## Objective = build a movie recommendation system

## Question: What 5 movies can we recommend based on the movie "The Post"?

Recommend with kNN based on these inputs:

IMDB Rating = 7.2

Biography = Yes

Drama = Yes

Thriller = No

Comedy = No

Crime = No

Mystery = No

History = Yes

In [None]:
# import libraries
import pandas as pd
from sklearn.neighbors import NearestNeighbors

In [None]:
# load df
df = pd.read_csv('https://github.com/ArinB/MSBA-CA-Data/raw/main/CA05/movies_recommendation_data.csv')

In [None]:
# explore
df.head()

Unnamed: 0,Movie ID,Movie Name,IMDB Rating,Biography,Drama,Thriller,Comedy,Crime,Mystery,History,Label
0,58,The Imitation Game,8.0,1,1,1,0,0,0,0,0
1,8,Ex Machina,7.7,0,1,0,0,0,1,0,0
2,46,A Beautiful Mind,8.2,1,1,0,0,0,0,0,0
3,62,Good Will Hunting,8.3,0,1,0,0,0,0,0,0
4,97,Forrest Gump,8.8,0,1,0,0,0,0,0,0


In [None]:
# examine data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Movie ID     30 non-null     int64  
 1   Movie Name   30 non-null     object 
 2   IMDB Rating  30 non-null     float64
 3   Biography    30 non-null     int64  
 4   Drama        30 non-null     int64  
 5   Thriller     30 non-null     int64  
 6   Comedy       30 non-null     int64  
 7   Crime        30 non-null     int64  
 8   Mystery      30 non-null     int64  
 9   History      30 non-null     int64  
 10  Label        30 non-null     int64  
dtypes: float64(1), int64(9), object(1)
memory usage: 2.7+ KB


In [None]:
# data quality report
missing_values = df.isnull().sum()
percent_missing = df.isnull().mean()*100
data_types = df.dtypes
unique_values = df.nunique()

quality_report = pd.DataFrame(
    {"Missing Values": missing_values,
     'Percent Missing': percent_missing,
     'Data Type': data_types,
     'Number of Unique Values': unique_values
     })

quality_report

Unnamed: 0,Missing Values,Percent Missing,Data Type,Number of Unique Values
Movie ID,0,0.0,int64,27
Movie Name,0,0.0,object,30
IMDB Rating,0,0.0,float64,17
Biography,0,0.0,int64,2
Drama,0,0.0,int64,2
Thriller,0,0.0,int64,2
Comedy,0,0.0,int64,2
Crime,0,0.0,int64,2
Mystery,0,0.0,int64,2
History,0,0.0,int64,2


In [None]:
# assign the features to keep
columns_to_keep = ['IMDB Rating', 'Biography', 'Drama', 'Thriller', 'Comedy', 'Crime', 'Mystery', 'History']

In [None]:
# generate a new df with only the features to keep
df_kNN_features = df[columns_to_keep]

In [None]:
# assign the user input given in the prompt as a dict
user_data = {'IMDB Rating': 7.2, 'Biography': "Yes", 'Drama': "Yes", 'Thriller': "No", 'Comedy': "No", 'Crime': "No", 'Mystery': "No", 'History': "Yes"}

In [None]:
# encode categorical features of our df
df_kNN_features = df_kNN_features.replace({'Yes': 1, 'No': 0})

# encode categorical features of the user input
encoded_user_data = {key: (1 if value == "Yes" else 0) for key, value in user_data.items() if key in columns_to_keep}

In [None]:
# convert the user input dict to a df and align the column order of the user input with our df
user_df = pd.DataFrame([encoded_user_data], columns=columns_to_keep)

In [None]:
# kNN
neigh = NearestNeighbors(n_neighbors=5) # initialize Nearest Neighbor with 5 neighbors
neigh.fit(df_kNN_features)  # fit the model with the encoded movie features

distances, indices = neigh.kneighbors(user_df)  # pass the encoded user input

closest_movies = df.iloc[indices.flatten()]['Movie Name']
print(closest_movies)

7                  Travelling Salesman
5                                   21
20                    The DaVinci Code
26    Spirit: Stallion of the Cimarron
9                       The Karate Kid
Name: Movie Name, dtype: object
