# Project: Kaggle Challenge Movie Rating Prediction 
In this project, we write a K-Nearest Neighbor algorithm to predict IMDB rating of movies. We use a dataset Kaggle that contains 28 variables for 5043 movies, spanning across 100 years in 66 countries. 
We write our K-Nearest Neighbor algorithm from scratch without using built-in libraries.
We’ll implement the three steps of the K-Nearest Neighbor Algorithm:
- Normalize the data
- Find the k nearest neighbors
- Classify the new point based on those neighbors

In [27]:
import pandas as pd

## Part 1: Load and clean the data

In [148]:
# https://www.kaggle.com/carolzhangdc/predict-imdb-score-with-data-mining-algorithms
# The dataset is from Kaggle website. It contains 28 variables for 5043 movies, spanning across 100 years in 66 countries. 
df = pd.read_csv('08 ML_KNearestNeighbors_KaggleChallengeMovieRatingPredictionDataset.csv')

# sorting by movie name 
df.sort_values("movie_title", inplace = True) 

# dropping ALL duplicte values 
df.drop_duplicates(subset ="movie_title", keep = False, inplace = True) 

df.head(3)


Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
4447,Color,Tara Subkoff,35.0,101.0,37.0,56.0,Balthazar Getty,501.0,,Drama|Horror|Mystery|Thriller,...,42.0,English,USA,Not Rated,1500000.0,2015.0,418.0,3.3,,750
3698,Color,Dan Trachtenberg,411.0,104.0,16.0,82.0,John Gallagher Jr.,14000.0,71897215.0,Drama|Horror|Mystery|Sci-Fi|Thriller,...,440.0,English,USA,PG-13,15000000.0,2016.0,338.0,7.3,2.35,33000
3015,Color,Timothy Hines,1.0,111.0,0.0,247.0,Kelly LeBrock,1000.0,14616.0,Drama,...,10.0,English,USA,R,12000000.0,2015.0,445.0,7.5,1.85,26000


## Part 2: Restructure and clean the data

In [154]:
# Make a movie dataset as a subset of the dataframe with relevant features to use for our K-Nearest Neighbor 
md = df[['movie_title', 'duration', 'actor_1_facebook_likes', 'num_critic_for_reviews','budget','gross','title_year','movie_facebook_likes']]

# check NaN values and replace them with 0 (becuase we can't make average on NaN)
md.isna().any()
md.fillna({'duration':0,'actor_1_facebook_likes':0, 'num_critic_for_reviews':0, 'budget':0, 'gross':0,'title_year':0,'movie_facebook_likes':0}, inplace=True)

# replace all zero values with the average of the column
md.loc[md.duration == 0] = md['duration'].mean()
md.loc[md.actor_1_facebook_likes == 0] = md['actor_1_facebook_likes'].mean()
md.loc[md.num_critic_for_reviews == 0] = md['num_critic_for_reviews'].mean()
md.loc[md.budget == 0] = md['budget'].mean()
md.loc[md.gross == 0] = md['gross'].mean()
md.loc[md.title_year == 0] = md['title_year'].mean()
md.loc[md.movie_facebook_likes == 0] = md['movie_facebook_likes'].mean()
# check any 0 value 
md.isin([0]).any().any()
# check any NaN value
md.isna().any()

False

In [157]:
# Make a subset of the dataframe with IMDB score 
mls = df[['movie_title','imdb_score']]
mls.head(3)

# Adding a new column diviging movies into bad movie (IMDB rating < 7) and good (IMDB rating >= 7)
mls['l'] = mls.apply(lambda row: 1 if row['imdb_score'] >= 7 else 0, axis=1)
mls.head(3)

# check any 0 value 
mls.isin([0]).any().any()
# check any NaN value
mls.isna().any()

movie_title    False
imdb_score     False
l              False
dtype: bool

## Part 2: Normalize the data

In [25]:
# For every feature, the minimum value of that feature gets transformed into a 0, the maximum value 
# gets transformed into a 1, and every other value gets transformed into a decimal between 0 and 1.
def min_max_normalize(lst):
  normalized = []
  normalized = [(l-min(lst))/(max(lst)-min(lst)) for l in lst]
  return normalized

NameError: name 'release_dates' is not defined

## Part 3: Distance function 

In [24]:
# Distance Between Points using Euclidean Distance
def distance(movie1, movie2):
  length_difference = 0
  for i in range(len(movie1)):
    length_difference += (movie1[i] - movie2[i]) ** 2
  return length_difference ** 0.5

## Part 4: 