In [2]:
#import necessary libraries
import pandas as pd
import numpy as np 
from sklearn.neighbors import NearestNeighbors

In [4]:
#load the dataset
df = pd.read_csv('https://github.com/ArinB/MSBA-CA-Data/raw/main/CA05/movies_recommendation_data.csv')

In [8]:
#explore the dataset
df.head(10)

Unnamed: 0,Movie ID,Movie Name,IMDB Rating,Biography,Drama,Thriller,Comedy,Crime,Mystery,History,Label
0,58,The Imitation Game,8.0,1,1,1,0,0,0,0,0
1,8,Ex Machina,7.7,0,1,0,0,0,1,0,0
2,46,A Beautiful Mind,8.2,1,1,0,0,0,0,0,0
3,62,Good Will Hunting,8.3,0,1,0,0,0,0,0,0
4,97,Forrest Gump,8.8,0,1,0,0,0,0,0,0
5,98,21,6.8,0,1,0,0,1,0,1,0
6,31,Gifted,7.6,0,1,0,0,0,0,0,0
7,3,Travelling Salesman,5.9,0,1,0,0,0,1,0,0
8,51,Avatar,7.9,0,0,0,0,0,0,0,0
9,47,The Karate Kid,7.2,0,1,0,0,0,0,0,0


In [12]:
#drop label column since it is unecessary 
df = df.drop(['Label'], axis = 1)

In [16]:
#verify that encoding is proper for modeling
genre_columns = ['Biography','Drama','Thriller','Comedy','Crime','Mystery','History']

for genre in genre_columns: 
    unique_values = df[genre].unique()
    if set(unique_values).issubset({0, 1}):
        print('Genre is encoded properly.')
    else:
        print('Genre is not encoded properly, as it is not binary')

Genre is encoded properly.
Genre is encoded properly.
Genre is encoded properly.
Genre is encoded properly.
Genre is encoded properly.
Genre is encoded properly.
Genre is encoded properly.


In [18]:
df['IMDB Rating'].describe()

count    30.000000
mean      7.696667
std       0.666169
min       5.900000
25%       7.300000
50%       7.750000
75%       8.175000
max       8.800000
Name: IMDB Rating, dtype: float64

Since the IMDB dataset ranges from a rating of 5.9 to 8.8, standard scaling of the dataset will help improve modeling. According to research into KNN modeling, scaling is strongly recommended since it is a distance-based algorithm. By scaling the IMDB ratings, we will create a more accurate dataset for our recommender system. 

Citation: https://medium.com/analytics-vidhya/why-is-scaling-required-in-knn-and-k-means-8129e4d88ed7

In [42]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['IMDB Rating Scaled'] = scaler.fit_transform(df[['IMDB Rating']])
df[['Movie Name','IMDB Rating','IMDB Rating Scaled']].head(10)

Unnamed: 0,Movie Name,IMDB Rating,IMDB Rating Scaled
0,The Imitation Game,8.0,0.724138
1,Ex Machina,7.7,0.62069
2,A Beautiful Mind,8.2,0.793103
3,Good Will Hunting,8.3,0.827586
4,Forrest Gump,8.8,1.0
5,21,6.8,0.310345
6,Gifted,7.6,0.586207
7,Travelling Salesman,5.9,0.0
8,Avatar,7.9,0.689655
9,The Karate Kid,7.2,0.448276


Next, we need to select the features we will train the model on, similar to previous machine learning models. 

In [44]:
features = ['IMDB Rating Scaled','Biography','Drama','Thriller','Comedy','Crime','Mystery']
x = df[features]

In [46]:
knn_model = NearestNeighbors(n_neighbors = 5, metric = 'euclidean') #most common
knn_model.fit(x)

Now we will test the prediction using the data provided to us. We will start by standardizing the scale of the rating, and then proceeding with testing. 

In [82]:
test_movie = pd.DataFrame([[7.2]], columns=['IMDB Rating'])
test_movie_scaled = scaler.transform(test_movie)

test_movie = [test_movie_scaled[0][0], 1, 1, 0, 0, 0, 0] #history excluded since it was not in test dataset
test_movie

[np.float64(0.4482758620689653), 1, 1, 0, 0, 0, 0]

In [88]:
#create seperate dataframe for test movie
test_movie_df = pd.DataFrame([test_movie], columns=['IMDB Rating Scaled', 'Biography', 'Drama', 'Thriller',
                                                 'Comedy', 'Crime', 'Mystery'])

In [90]:
#to find 5 nearest neighbors
distances, indices = knn_model.kneighbors([test_movie_df])

print('Top 5 Recommended Movies:')
for idx in indices[0]:
    print(f' {df.iloc[idx]['Movie Name']} (Distance: {distances[0][list(indices[0]).index(idx)]}')



ValueError: Found array with dim 3. NearestNeighbors expected <= 2.

In [72]:
#verifying the prediction was accurate
similar_movies = pd.DataFrame({
    'Movie Name': ['Queen of Katwe', 'The Wind Rises', '12 Years a Slave', 'A Beautiful Mind', 'Hacksaw Ridge']
})
recommended_movies = df[df['Movie Name'].isin(similar_movies['Movie Name'])]
print("Details of the Recommended Movies:")
print(recommended_movies[['Movie Name', 'IMDB Rating', 'Biography', 'Drama', 'Thriller', 'Comedy', 'Crime', 'Mystery']])

Details of the Recommended Movies:
          Movie Name  IMDB Rating  Biography  Drama  Thriller  Comedy  Crime  \
2   A Beautiful Mind          8.2          1      1         0       0      0   
16    The Wind Rises          7.8          1      1         0       0      0   
27     Hacksaw Ridge          8.2          1      1         0       0      0   
28  12 Years a Slave          8.1          1      1         0       0      0   
29    Queen of Katwe          7.4          1      1         0       0      0   

    Mystery  
2         0  
16        0  
27        0  
28        0  
29        0  


After testing the accuracy of the result, we see that the knn model not just matched the genres of "The Post" with the 5 recommendations, but also put in order based on the closest to the IMDB Rating. Thus, we can confirm that this iteration of knn modeling for Movie Recommendation was successful for this test movie query. 

In order to build a more effective model, a significantly larger dataset will be needed, properly scaled to include the full range of IMDB ratings (from 0-10), instead of the limited view we currently have. Furthermore, we'll have to take into account new genres, like history, which was part of the new query, but not part of the original training data. 

In essence, in order to properly use this dataset, we will need to remember that IMDB data must be scaled from 5.9 min to 8.8 max scaling. Furthermore, we can only take into account Biography, Drama, Thriller, Comedy, Crime, and Mystery as potential genres. 