# Project: Kaggle Challenge Movie Rating Prediction 
In this project, we write a K-Nearest Neighbor algorithm to predict IMDB rating of movies. We use a dataset Kaggle that contains 28 variables for 5043 movies, spanning across 100 years in 66 countries. 
We write our K-Nearest Neighbor algorithm from scratch without using built-in libraries.
We’ll implement the three steps of the K-Nearest Neighbor Algorithm:
- Normalize the data
- Find the k nearest neighbors
- Classify the new point based on those neighbors

In [27]:
import pandas as pd

## Part 1: Load the data

In [39]:
# https://www.kaggle.com/carolzhangdc/predict-imdb-score-with-data-mining-algorithms
# The dataset is from Kaggle website. It contains 28 variables for 5043 movies, spanning across 100 years in 66 countries. 
df = pd.read_csv('08 ML_KNearestNeighbors_KaggleChallengeMovieRatingPredictionDataset.csv')

# sorting by movie name 
df.sort_values("movie_title", inplace = True) 

# dropping ALL duplicte values 
df.drop_duplicates(subset ="movie_title", keep = False, inplace = True) 

df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 4798 entries, 4447 to 882
Data columns (total 28 columns):
color                        4779 non-null object
director_name                4698 non-null object
num_critic_for_reviews       4750 non-null float64
duration                     4783 non-null float64
director_facebook_likes      4698 non-null float64
actor_3_facebook_likes       4775 non-null float64
actor_2_name                 4785 non-null object
actor_1_facebook_likes       4791 non-null float64
gross                        3956 non-null float64
genres                       4798 non-null object
actor_1_name                 4791 non-null object
movie_title                  4798 non-null object
num_voted_users              4798 non-null int64
cast_total_facebook_likes    4798 non-null int64
actor_3_name                 4775 non-null object
facenumber_in_poster         4785 non-null float64
plot_keywords                4648 non-null object
movie_imdb_link              4798 no

## Part 2: Clean the data

In [41]:
# Make a subset of the dataframe with relevant features to use for our K-Nearest Neighbor 
movie_dataset = df[['movie_title', 'duration', 'actor_1_facebook_likes', 'num_critic_for_reviews','budget','gross','title_year','movie_facebook_likes']]

# Make a subset of the dataframe with IMDB score 
movie_labels = df[['movie_title','imdb_score']]
movie_labels.head(3)

# Adding a new column diviging movies into bad movie (IMDB rating < 7) and good (IMDB rating >= 7)
movie_labels['l'] = movie_labels.apply(lambda row: 1 if row['imdb_score'] >= 7 else 0, axis=1)

# movie_labels.head(3)
movie_dataset.head(3)

Unnamed: 0,movie_title,duration,actor_1_facebook_likes,num_critic_for_reviews,budget,gross,title_year,movie_facebook_likes
4447,#Horror,101.0,501.0,35.0,1500000.0,,2015.0,750
3698,10 Cloverfield Lane,104.0,14000.0,411.0,15000000.0,71897215.0,2016.0,33000
3015,10 Days in a Madhouse,111.0,1000.0,1.0,12000000.0,14616.0,2015.0,26000


## Part 2: Normalize the data

In [25]:
# For every feature, the minimum value of that feature gets transformed into a 0, the maximum value 
# gets transformed into a 1, and every other value gets transformed into a decimal between 0 and 1.
def min_max_normalize(lst):
  normalized = []
  normalized = [(l-min(lst))/(max(lst)-min(lst)) for l in lst]
  return normalized

NameError: name 'release_dates' is not defined

## Part 3: Distance function 

In [24]:
# Distance Between Points using Euclidean Distance
def distance(movie1, movie2):
  length_difference = 0
  for i in range(len(movie1)):
    length_difference += (movie1[i] - movie2[i]) ** 2
  return length_difference ** 0.5

## Part 4: 