# CA05 – kNN based Movie Recommender Engine

## 1. The Application

At scale, this would look like recommending products on Amazon, articles on Medium, movies on Netflix, or videos on YouTube. Although, we can be certain they all use more efficient means of making recommendations due to the enormous volume of data they process. However, we could replicate one of these recommender systems on a smaller scale using what we have learned here in this article. Let us build the core of a movies recommender system.

What question are we trying to answer?

Given a movies data set, what are the 5 most similar movies to a movie query?

## 2. Data Source and Contents

In [None]:
# Load libraries needed
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors

In [None]:
# Load in data
df = pd.read_csv("https://github.com/ArinB/MSBA-CA-Data/raw/main/CA05/movies_recommendation_data.csv")

In [None]:
# Check first few rows of the data
df.head()

Unnamed: 0,Movie ID,Movie Name,IMDB Rating,Biography,Drama,Thriller,Comedy,Crime,Mystery,History,Label
0,58,The Imitation Game,8.0,1,1,1,0,0,0,0,0
1,8,Ex Machina,7.7,0,1,0,0,0,1,0,0
2,46,A Beautiful Mind,8.2,1,1,0,0,0,0,0,0
3,62,Good Will Hunting,8.3,0,1,0,0,0,0,0,0
4,97,Forrest Gump,8.8,0,1,0,0,0,0,0,0


In [None]:
# Check data types of each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Movie ID     30 non-null     int64  
 1   Movie Name   30 non-null     object 
 2   IMDB Rating  30 non-null     float64
 3   Biography    30 non-null     int64  
 4   Drama        30 non-null     int64  
 5   Thriller     30 non-null     int64  
 6   Comedy       30 non-null     int64  
 7   Crime        30 non-null     int64  
 8   Mystery      30 non-null     int64  
 9   History      30 non-null     int64  
 10  Label        30 non-null     int64  
dtypes: float64(1), int64(9), object(1)
memory usage: 2.7+ KB


In [None]:
# Check for nulls
df.isnull().sum()

Movie ID       0
Movie Name     0
IMDB Rating    0
Biography      0
Drama          0
Thriller       0
Comedy         0
Crime          0
Mystery        0
History        0
Label          0
dtype: int64

## 3. Building Your Own Recommender System

In [None]:
# Slice the dataframe to just include the data needed in the model
model_data = df.iloc[:,2:9]
model_data

Unnamed: 0,IMDB Rating,Biography,Drama,Thriller,Comedy,Crime,Mystery
0,8.0,1,1,1,0,0,0
1,7.7,0,1,0,0,0,1
2,8.2,1,1,0,0,0,0
3,8.3,0,1,0,0,0,0
4,8.8,0,1,0,0,0,0
5,6.8,0,1,0,0,1,0
6,7.6,0,1,0,0,0,0
7,5.9,0,1,0,0,0,1
8,7.9,0,0,0,0,0,0
9,7.2,0,1,0,0,0,0


In [None]:
# Check dimensions of the dataset for the model
model_data.shape

(30, 7)

Since the dataset is both small in sample size and dimensionality (N = 30, D = 7), we will use a brute force algorithm to create our KNN model. Per the instructions, we will use k = 5 as we want to find the 5 movies most similar to our movie.

In [None]:
from numpy.core.numeric import indices
# Creating the model
neigh = NearestNeighbors(n_neighbors =5, algorithm = "brute")
# Testing the model on the model data
# It will return the index for the 5 movies most similar to each movie
neigh.fit(model_data)
distances,indices = neigh.kneighbors(model_data)
indices

array([[ 0, 28, 16,  2, 27],
       [ 1,  6, 18, 21, 10],
       [27,  2, 28, 16, 29],
       [ 3, 12,  4,  6, 18],
       [ 4, 12,  3, 15, 17],
       [ 5,  9, 10, 21, 18],
       [ 6, 21, 18, 10,  9],
       [ 7, 20, 10,  9,  5],
       [ 8, 22, 24, 14, 19],
       [ 9, 10, 21, 18,  6],
       [ 9, 10, 21, 18,  6],
       [11,  5, 21, 18,  6],
       [12,  4,  3,  6, 17],
       [13, 23, 25, 27,  2],
       [14, 19, 26,  8, 22],
       [15, 17, 24, 22,  8],
       [16, 28, 29,  2, 27],
       [17, 15, 24, 22,  8],
       [21, 18, 10,  9,  6],
       [19, 14, 26,  8, 22],
       [20, 26, 19,  7, 14],
       [21, 18, 10,  9,  6],
       [22,  8, 24, 14, 17],
       [23, 25, 13, 19, 26],
       [24, 22,  8, 17, 14],
       [25,  8, 22, 24, 14],
       [26, 19, 14,  8, 22],
       [27,  2, 28, 16, 29],
       [28,  2, 27, 16, 29],
       [29, 16, 28, 27,  2]])

In [None]:
# Append the original dataframe with our test point
df.loc[len(df.index)] = [31,"The Post",7.2,1,1,0,0,0,0,1,0]

In [None]:
# Include the new point in the condensed dataframe
test_data = model_data = df.iloc[:,2:9]

In [None]:
# Run the kNN Model on "The Post" to find the 5 movies most similar to it
results = neigh.kneighbors(test_data)[1][0]
for x in results:
  print(df["Movie Name"].iloc[x])

The Imitation Game
12 Years a Slave
The Wind Rises
A Beautiful Mind
Hacksaw Ridge


The above movies are the 5 that our kNN model reccomended based on our input as "The Post"