<h1>Movie Recommender System</h1>
<h4>Author: Chang Dakota Sum Kiu</h4>
<h4>Last Modified: 10/09/2021</h4>


---


<p>This notebook will walk you through creating a basic movie recommender system using a grouplens dataset. There will be exploratory analysis involved and a basic and easy to use recommending system. </p>

<h3><u><b>What are recommender systems?</b></u></h3>
<p>Recommender systems are in every part of our lives, TikTok, YouTube, Netflix, Spotify, etc. all use recommender systems to decide what content to push to their users. </p>

<h4><u>Types of Recommender Systems</u></h4>

Today, we will create a content-based filtering recommender system using the dataset from https://grouplens.org/datasets/movielens/ . It is a dataset curated by researchers from University of Michigan and it contains information on user ratings of many movies.

<img src='https://www.researchgate.net/profile/Lionel-Ngoupeyou-Tondji/publication/323726564/figure/fig5/AS:631605009846299@1527597777415/Content-based-filtering-vs-Collaborative-filtering-Source.png'>



---


<h1>Code</h1>

In [None]:
# @title import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
!git clone https://github.com/freezingMonkeys/freezingMonkeysPythonTrack

fatal: destination path 'freezingMonkeysPythonTrack' already exists and is not an empty directory.


In [None]:
# @title Process datasets
movies = pd.read_csv('/content/freezingMonkeysPythonTrack/files/movies.csv')
ratings = pd.read_csv('/content/freezingMonkeysPythonTrack/files/ratings.csv')

In [None]:
movies.head(2)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy


In [None]:
ratings.head(2)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247


In [None]:
# Merge the dataframes together for easier processing
df = pd.merge(movies, ratings)

In [None]:
df.head(2)

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962


In [None]:
# drop useless columns
df.drop('genres', axis=1, inplace=True)
df.drop('timestamp', axis=1, inplace=True)

In [None]:
df.head(2)

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story (1995),1,4.0
1,1,Toy Story (1995),5,4.0


In [None]:
# creates a dataframe with total rating count of each movie 
# for example, if 5 people rated Toy Story, the count will show up as 5
film_df = df.groupby("movieId")["rating"].count().reset_index().rename(columns = {"rating": "total_rating_count"})

In [None]:
film_df.head(2)

Unnamed: 0,movieId,total_rating_count
0,1,215
1,2,110


In [None]:
new_df = pd.merge(df, film_df)

In [None]:
new_df.describe()

Unnamed: 0,movieId,userId,rating,total_rating_count
count,100836.0,100836.0,100836.0,100836.0
mean,19435.295718,326.127564,3.501557,58.755801
std,35530.987199,182.618491,1.042529,61.96667
min,1.0,1.0,0.5,1.0
25%,1199.0,177.0,3.0,13.0
50%,2991.0,325.0,3.5,39.0
75%,8122.0,477.0,4.0,84.0
max,193609.0,610.0,5.0,329.0


In [None]:
# sets the threshold of which movies are accounted for. If a movie is only rated by one user, the information/accuracy might be skewed
data = new_df[new_df["total_rating_count"]>=30]
data = data.reset_index().drop("index",axis=1)

In [None]:
data.sample(5)

Unnamed: 0,movieId,title,userId,rating,total_rating_count
41095,4226,Memento (2000),68,4.0,159
38175,3499,Misery (1990),19,3.0,44
29528,2054,"Honey, I Shrunk the Kids (1989)",577,3.0,68
18179,1090,Platoon (1986),83,1.5,63
25049,1517,Austin Powers: International Man of Mystery (1...,408,4.0,100




---
<h1>Processing data into usable form for machine learning algorithm</h1>


In [None]:
# creates pivot tables then fills the null values for csr matrix to function properly
data = data.pivot_table(index="title", columns="userId", values="rating")
data.shape
data = data.fillna(0)

In [None]:
# csr matrix (compressed sparese row matrix)
# csr has fuctions to determine k number for us
from scipy.sparse import csr_matrix
features = csr_matrix(data)
features

<882x609 sparse matrix of type '<class 'numpy.float64'>'
	with 58018 stored elements in Compressed Sparse Row format>

In [None]:
# this recommender uses the KNN algorithm since it can effectively find similar movies (content-based)
# for more information see KNN notebook (coming soon)
from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(metric="cosine")
knn.fit(features)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [None]:
# @title Find Your Favorite Movie!
movieName = 'Toy Story' #@param {type:'string'}
movieYear = '1995' #@param {type:'string'}
data.loc[f'{movieName} ({movieYear})']

userId
1      4.0
2      0.0
3      0.0
4      0.0
5      4.0
      ... 
606    2.5
607    4.0
608    2.5
609    3.0
610    5.0
Name: Toy Story (1995), Length: 609, dtype: float64

In [None]:
# @title Get the recommended movies
results = knn.kneighbors(data.loc[f'{movieName} ({movieYear})'].values.reshape(1,-1))
results = results[1][0][1:]
for movie in results:
  print(data.iloc[movie])

userId
1      0.0
2      0.0
3      0.0
4      0.0
5      0.0
      ... 
606    0.0
607    3.0
608    2.5
609    0.0
610    5.0
Name: Toy Story 2 (1999), Length: 609, dtype: float64
userId
1      4.0
2      0.0
3      0.0
4      0.0
5      0.0
      ... 
606    2.5
607    4.0
608    3.0
609    3.0
610    5.0
Name: Jurassic Park (1993), Length: 609, dtype: float64
userId
1      3.0
2      0.0
3      0.0
4      0.0
5      0.0
      ... 
606    2.5
607    4.0
608    3.0
609    0.0
610    3.5
Name: Independence Day (a.k.a. ID4) (1996), Length: 609, dtype: float64
userId
1      5.0
2      0.0
3      0.0
4      5.0
5      0.0
      ... 
606    4.5
607    3.0
608    3.5
609    0.0
610    5.0
Name: Star Wars: Episode IV - A New Hope (1977), Length: 609, dtype: float64
