# CA05: KNN based Movie Recommender Engine 

### (1) The Appliction
At scale, this would look like recommending products on Amazon, articles on Medium, movies on 
Netflix, or videos on YouTube. Although, we can be certain they all use more efficient means of making 
recommendations due to the enormous volume of data they process. However, we could replicate one of 
these recommender systems on a smaller scale ug what we have learned here in this article. Let us 
build the core of a movies recommender system. sin

__What question are we trying to answer? 
Given a movies data set, what are the 5 most similar movies to a movie query__ 

In [179]:
#Goal of this model: Return 5 most similar movies given a title of one (based on many different factors)

## (2) Data Source and Contents
If we worked at Netflix, Hulu, or IMDb, we could grab the data from their data warehouse. Since we 
don’t work at any of those companies, we have to get our data through some other means. We could use 
some movies data from the UCI Machine Learning Repository, IMDb’s data set, or painstakingly create 
our own. In our case, we will use a small sub-set of the data extracted from the UCI’s IMDB data set. 
Data File Na**me: movies_recommendation_data**csv 

__File Location:__ https://github.com/ArinB/MSBA-CA-
Data/raw/main/CA05/movies_recommendation_data.cs  

The data contains thirty movies, including data for each movie across seven genres and their IMDB 
ratings. The labels column values are all zeroes because we aren’t using this data set for classification or 
regression. You can ignore this column. The implementation assumes that all columns contain numerical 
data.  
Additionally, there are relationships among the movies that will not be accounted for (e.g. actors, 
directors, and themes) when using the KNN algorithm simply because the data that captures those 
relationships are missing from the data set. Consequently, when we run the KNN algorithm on our data, 
similarity will be based solely on the included genres and the IMDB ratings of the movies. 

In [181]:
import pandas as pd
from sklearn.neighbors import NearestNeighbors
import numpy as np

In [183]:
df = pd.read_csv("C:/Users/ashle/Downloads/movies_recommendation_data.csv")

In [185]:
df.head()

Unnamed: 0,Movie ID,Movie Name,IMDB Rating,Biography,Drama,Thriller,Comedy,Crime,Mystery,History,Label
0,58,The Imitation Game,8.0,1,1,1,0,0,0,0,0
1,8,Ex Machina,7.7,0,1,0,0,0,1,0,0
2,46,A Beautiful Mind,8.2,1,1,0,0,0,0,0,0
3,62,Good Will Hunting,8.3,0,1,0,0,0,0,0,0
4,97,Forrest Gump,8.8,0,1,0,0,0,0,0,0


In [187]:
#I want to keep the movie titles for later use (in reality this would be for when we actually display the recommendations--see the end)
movie_titles = df['Movie Name']

In [189]:
#Here I want to drop the columns that aren't numeric features
#We can also drop "Movie ID" because it doesn't contribute and don't describe the movie content
new_df = df.drop(['Movie Name', 'Label', 'Movie ID'], axis=1) 

In [191]:
new_df.head()

Unnamed: 0,IMDB Rating,Biography,Drama,Thriller,Comedy,Crime,Mystery,History
0,8.0,1,1,1,0,0,0,0
1,7.7,0,1,0,0,0,1,0
2,8.2,1,1,0,0,0,0,0
3,8.3,0,1,0,0,0,0,0
4,8.8,0,1,0,0,0,0,0


## (3) Building Your Own Recommender System
You are building your own movie recommendation website which uses your Recommendation Engine at 
the back-end. You are going to build this back-end Recommendation Engine. Imagine a user is 
navigating your recommendation website, and he/she encounters a movie named “The Post”.  The user 
is not sure if he/she wants to watch it, but its genres intrigue the user; he/she is curious about other 
similar movies. The user scrolls down to the “More Like This” section to see what recommendations your 
recommendation website will make, and the back-end algorithmic gears begin to turn. 
Your website sends a request to its back-end for the 5 movies that are most similar to The Post. The back
end has a recommendation data set exactly like ours. It begins by creating the row representation (better 
known as a feature vector) for The Post, then it runs a program similar to the one below to search for 
the 5 movies that are most similar to The Post, and finally sends the results back to the user at your 
website.

__Following is the genre information about the movie “The Post”:__

IMDB Rating = 7.2, Biography = Yes, Drama = Yes, Thriller = No, Comedy = No, Crime = No, 
Mystery = No, History = Yes 

In [193]:
#IMDB Rating = 7.2
#Biography = Yes, Drama = Yes, Thriller = No, Comedy = No, Crime = No, Mystery = No, History = Yes


In [195]:
# Order of features: ['IMDB Rating', 'Biography', 'Drama', 'Thriller', 'Comedy', 'Crime', 'Mystery', 'History']
#Here we're looking at features directly correlated with 'The post'-- turn into a vector

the_post_df = pd.DataFrame([[7.2, 1, 1, 0, 0, 0, 0, 1]], columns=new_df.columns)


In [197]:
#Training the model based on wanting to get 5 recommendations
knn = NearestNeighbors(n_neighbors=6, metric='euclidean')  # 6 to include The Post
knn.fit(new_df)

In [199]:
#Compare 'The Post' movie to each of those seen in the dataset (distances shows how far each of the movies is from 'The Post'

distances, indices = knn.kneighbors(the_post_df)


In [209]:
#Here I am using a function to return 'n' number of similar movies given an input vector using knn

def recommend_movies(input_vector, model, X, titles, n=5):
    input_df = pd.DataFrame([input_vector], columns=X.columns)
    distances, indices = model.kneighbors(input_df, n_neighbors=n+1)

    recommendations = []
    for i in range(1, n + 1):  
        idx = indices[0][i]
        score = distances[0][i]
        recommendations.append((titles.iloc[idx], score))


    #Then we sort by similarity score
    return sorted(recommendations, key=lambda x: x[1])


In [215]:
#The input we're giving the recommendation system
the_post_vector = [7.2, 1, 1, 0, 0, 0, 0, 1]
results = recommend_movies(the_post_vector, knn, new_df, movie_titles)

#call recommend_movies to find the closest movies
for title, score in results:
    print(f"{title} — Similarity Score: {score:.2f}")


Hacksaw Ridge — Similarity Score: 1.00
Queen of Katwe — Similarity Score: 1.02
The Wind Rises — Similarity Score: 1.17
A Beautiful Mind — Similarity Score: 1.41
The Karate Kid — Similarity Score: 1.41


__What recommendations will he/she see?__

Implement this problem using Python scikit-learn and display the answer within the Notebook with 
proper narrative / comments. 

In [217]:
#To show our result of the 5 most similar movies

print("Recommended Movies:\n", recommended_movies)

Recommended Movies:
 27       Hacksaw Ridge
29      Queen of Katwe
16      The Wind Rises
2     A Beautiful Mind
9       The Karate Kid
Name: Movie Name, dtype: object
