# Movie Recommendation System
### By: Albert Wijaya

## Background
Movie streaming services are gaining popularity in these past few years. Netflix's subscribers has grown from 21.5 million in 2011 to 209 million in 2021, and their revenue has grown from $3.1 billion in 2011 to $24.9 billion in 2021. It is predicted that the number of subscribers will grow around 10% each year for the next 5 years. As a movie streaming platform, it is important to keep the subscriber watching movies in our platform, and one of the ways to retain the subscribers is to keep recommending new movies which they might like. Here's where recommendation system is needed.

## Goal
Our goal is to create a movie recommendation system that will give some recommendations based on movies already watched by a subscriber.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from ast import literal_eval
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# load the dataset
df_movies = pd.read_csv('movies_metadata.csv')
df_ratings = pd.read_csv('ratings_small.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [3]:
df_movies.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


In [4]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [5]:
df_movies.iloc[0]

adult                                                                False
belongs_to_collection    {'id': 10194, 'name': 'Toy Story Collection', ...
budget                                                            30000000
genres                   [{'id': 16, 'name': 'Animation'}, {'id': 35, '...
homepage                              http://toystory.disney.com/toy-story
id                                                                     862
imdb_id                                                          tt0114709
original_language                                                       en
original_title                                                   Toy Story
overview                 Led by Woody, Andy's toys live happily in his ...
popularity                                                       21.946943
poster_path                               /rhIRbceoE9lR4veEXuwCC2wARtG.jpg
production_companies        [{'name': 'Pixar Animation Studios', 'id': 3}]
production_countries     

In [6]:
df_movies['original_title'].nunique()

43373

Total of unique movie title is less than the total row of the dataset. This means there are some movies with same title.

In [7]:
# filter out movies with same title
df_same_ori_title = df_movies.groupby('original_title').count()['adult']
df_same_ori_title[df_same_ori_title > 1].sort_values()

original_title
12 Angry Men            2
Spiders                 2
Spider                  2
Spellbound              2
Speedway                2
                       ..
Cinderella              7
The Three Musketeers    7
Les Misérables          7
Alice in Wonderland     8
Hamlet                  8
Name: adult, Length: 1661, dtype: int64

In [8]:
df_movies[df_movies['original_title'] == '12 Angry Men']

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
1161,False,,350000,"[{'id': 18, 'name': 'Drama'}]",,389,tt0050083,en,12 Angry Men,The defense and the prosecution have rested an...,...,1957-03-25,1000000.0,96.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Life is in their hands. Death is on their minds.,12 Angry Men,False,8.2,2130.0
15200,False,,0,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...",,12219,tt0118528,en,12 Angry Men,During the trial of a man accused of his fathe...,...,1997-08-17,0.0,117.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,,12 Angry Men,False,7.5,59.0


After further analysis, movies with same title are made by different production_companies at different release_date.

In [9]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [10]:
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100004 non-null  int64  
 1   movieId    100004 non-null  int64  
 2   rating     100004 non-null  float64
 3   timestamp  100004 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [11]:
# add movie titles to df_ratings
df_ratings['movieId'] = df_ratings['movieId'].apply(str)
df_ratings_with_titles = pd.merge(
    left=df_ratings,
    right=df_movies[['id', 'title']],
    how='inner',
    left_on='movieId',
    right_on='id'
)

In [12]:
# generate user-movie matrix
df_user_movie_matrix = df_ratings_with_titles.pivot_table(
    index='userId', columns='title', values='rating', fill_value=0
)

df_user_movie_id_matrix = df_ratings_with_titles.pivot_table(
    index='userId', columns='movieId', values='rating', fill_value=0
)

In [13]:
df_user_movie_matrix

title,!Women Art Revolution,'Gator Bait,'Twas the Night Before Christmas,...And God Created Woman,00 Schneider - Jagd auf Nihil Baxter,10 Items or Less,10 Things I Hate About You,"10,000 BC",11'09''01 - September 11,12 Angry Men,...,Zodiac,Zombie Flesh Eaters,Zombie Holocaust,Zozo,eXistenZ,xXx,¡Three Amigos!,À nos amours,Ödipussi,Şaban Oğlu Şaban
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0,...,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0
2,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0,...,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0
3,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0,...,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0
4,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0,...,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0
5,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0,...,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0,...,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0
668,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0,...,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0
669,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0,...,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0
670,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0,...,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0


In [14]:
df_user_movie_id_matrix

movieId,100,100017,100032,100272,100450,101,101362,1018,101904,102,...,987,988,99,990,991,99106,992,994,996,99846
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0
2,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0
3,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0
4,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0
5,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0
668,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0
669,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0
670,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0


In [15]:
# extract movie genres from movies metadata
# iterate through each genre in each movie's genres list to get genre name 

def get_genre_list(genres):
    if isinstance(genres, list): # if type(genres) == list
        genre_names = [item['name'] for item in genres]
        return genre_names
    return []

df_movie_genres = df_movies[
    df_movies['id'].isin(df_ratings['movieId'].unique())
][['id', 'genres']].copy()

df_movie_genres['genres'] = df_movie_genres['genres'].apply(literal_eval).apply(get_genre_list)

In [16]:
df_movie_genres.head()

Unnamed: 0,id,genres
5,949,"[Action, Crime, Drama, Thriller]"
9,710,"[Adventure, Action, Thriller]"
14,1408,"[Action, Adventure]"
15,524,"[Drama, Crime]"
16,4584,"[Drama, Romance]"


In [17]:
# generate movie-feature matrix (feature: genre)
df_movie_genres_id = df_movie_genres.set_index('id')
df_movie_genres_stacked = df_movie_genres_id['genres'].apply(pd.Series).stack()
df_movie_feature_matrix = pd.get_dummies(df_movie_genres_stacked).groupby(level=0).sum()

In [18]:
df_movie_genres_id

Unnamed: 0_level_0,genres
id,Unnamed: 1_level_1
949,"[Action, Crime, Drama, Thriller]"
710,"[Adventure, Action, Thriller]"
1408,"[Action, Adventure]"
524,"[Drama, Crime]"
4584,"[Drama, Romance]"
...,...
80831,[Drama]
3104,"[Horror, Science Fiction]"
64197,"[Romance, Drama]"
98604,"[Comedy, Romance]"


In [19]:
df_movie_genres_stacked

id      
949    0       Action
       1        Crime
       2        Drama
       3     Thriller
710    0    Adventure
              ...    
98604  0       Comedy
       1      Romance
49280  0      Fantasy
       1       Action
       2     Thriller
Length: 6764, dtype: object

In [20]:
df_movie_feature_matrix

Unnamed: 0_level_0,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,Foreign,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
100,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
100017,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
100032,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
100272,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0
101,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99106,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
992,0,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0
994,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0
996,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0


In [21]:
# pick one user to give recommendation to
curr_user_id = 15

# get his/her watched movie list
movies_watched_by_curr_user = df_ratings_with_titles[
    df_ratings_with_titles['userId'] == curr_user_id
]['movieId'].unique()

In [22]:
# generate similarity matrix
df_cosine_matrix = pd.DataFrame(
    data=cosine_similarity(X=df_movie_feature_matrix),
    columns=df_movie_feature_matrix.index.tolist(),
    index=df_movie_feature_matrix.index.tolist()
)

In [23]:
df_cosine_matrix

Unnamed: 0,100,100017,100032,100272,101,101362,1018,101904,102,102165,...,987,988,99,990,991,99106,992,994,996,99846
100,1.000000,0.000000,0.000000,0.408248,0.408248,0.816497,0.000000,0.000000,0.000000,0.000000,...,0.408248,0.000000,0.500000,0.000000,0.000000,0.00000,0.353553,0.353553,0.000000,0.408248
100017,0.000000,1.000000,0.707107,0.577350,0.577350,0.000000,0.577350,0.707107,0.707107,0.707107,...,0.577350,0.707107,0.707107,1.000000,0.707107,0.00000,0.500000,0.500000,0.577350,0.577350
100032,0.000000,0.707107,1.000000,0.408248,0.408248,0.000000,0.408248,0.500000,0.500000,0.500000,...,0.408248,0.500000,0.500000,0.707107,0.500000,0.00000,0.353553,0.353553,0.408248,0.816497
100272,0.408248,0.577350,0.408248,1.000000,0.333333,0.333333,0.333333,0.408248,0.408248,0.408248,...,0.666667,0.408248,0.816497,0.577350,0.408248,0.57735,0.577350,0.288675,0.333333,0.333333
101,0.408248,0.577350,0.408248,0.333333,1.000000,0.333333,0.666667,0.408248,0.408248,0.408248,...,0.333333,0.816497,0.408248,0.577350,0.408248,0.00000,0.288675,0.866025,0.666667,0.666667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99106,0.000000,0.000000,0.000000,0.577350,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,1.00000,0.000000,0.000000,0.000000,0.000000
992,0.353553,0.500000,0.353553,0.577350,0.288675,0.288675,0.577350,0.707107,0.353553,0.353553,...,0.577350,0.353553,0.707107,0.500000,0.353553,0.00000,1.000000,0.500000,0.577350,0.288675
994,0.353553,0.500000,0.353553,0.288675,0.866025,0.288675,0.866025,0.353553,0.353553,0.353553,...,0.288675,0.707107,0.353553,0.500000,0.353553,0.00000,0.500000,1.000000,0.866025,0.577350
996,0.000000,0.577350,0.408248,0.333333,0.666667,0.000000,1.000000,0.408248,0.408248,0.408248,...,0.333333,0.816497,0.408248,0.577350,0.408248,0.00000,0.577350,0.866025,1.000000,0.333333


In [24]:
# get one watched movie by curr user
movie_watched_by_curr_user = '2001: A Space Odyssey'
movie_watched_by_curr_user = df_movies[
    df_movies['title'] == movie_watched_by_curr_user
]['id'].values[0]

In [25]:
# get similarity vector for movie watched by curr user
df_sim_with_curr_movie = df_cosine_matrix[movie_watched_by_curr_user].reset_index().rename(
    columns={'index': 'id', movie_watched_by_curr_user: 'cosine_sim'}
)

df_sim_with_curr_movie = pd.merge(
    left=df_sim_with_curr_movie,
    right=df_movies[['id', 'title']],
    how='left',
    on='id'
)

# exclude watched movies from recommendation and show top n recommendations
n_recommendation = 5

df_sim_with_curr_movie[
    ~df_sim_with_curr_movie['id'].isin(movies_watched_by_curr_user)
].sort_values(by='cosine_sim', ascending=False).iloc[:n_recommendation]

Unnamed: 0,id,cosine_sim,title
230,152,1.0,Star Trek: The Motion Picture
2274,6974,0.816497,The Angry Red Planet
902,26581,0.816497,Dr. Who and the Daleks
332,168,0.816497,Star Trek IV: The Voyage Home
1845,5137,0.774597,Sky Captain and the World of Tomorrow


# Conclusion 
Based on our recommendation system, the top five recommended movies for user 15 are as the following:
- Star Trek: The Motion Picture
- The Angry Red Planet
- Dr. Who and the Daleks
- Star Trek IV: The Voyage Home
- Sky Captain and the World of Tomorrow