# Building A Movie Recommendation System

The goal of this project is to build a movie recommendation system, using latest data scraped from [GroupLens](https://grouplens.org/datasets/movielens/latest/) and  [The Movie Database](https://www.themoviedb.org/). The GroupLens data was last updated on September 26, 2018. The dataset includes data from 283228 users between January 09, 1995 and September 26, 2018, and contains 27,753,444 ratings and 1,108,997 tag applications across 58,098 movies.

## Part I: Create a movies dataset

Compilation of movie ratings and metadata from GroupLens and The Movie Database

In [1]:
import pandas as pd
import requests
import sys

In [2]:
# loading credentials for the tmdb 
credentials = pd.read_csv("api_keys.csv")
key = credentials[credentials['API'] == 'tmdb']['KEY'].values[0]

In [3]:
# reading the links dataset downloaded from GroupLens which contains the tmdb movie ids
links = pd.read_csv("data/ml-latest-small/links.csv")
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [4]:
links.shape

(9742, 3)

In [5]:
# grabbing the tmdb movie ids and put them in a list
movie_ids = links.tmdbId.tolist()
test_ids = movie_ids[:100]

In [6]:
# creating a helper function to grab each movie in the GroupLens dataset
def get_movies_data():
    movies = []
    
    for movie_id in test_ids:
        url = "https://api.themoviedb.org/3/movie/{movie_id}?api_key={key}".format(movie_id=movie_id,key=key)
        r = requests.get(url) # scraping the data
        if r.status_code == 200: # making sure that the response is valid
            record = requests.get(url).json() # getting the record
            movies.append(record) # appending the record to a list called movies
    
    df = pd.DataFrame(movies)
    df
    return df

In [7]:
df1 = get_movies_data()
df1.head()

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,/3Rfvhy1Nl6sSGJwyjb0QiZzZYlB.jpg,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 12, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,...,1995-10-30,373554033,81,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,Toy Story,False,8.0,16201
1,False,/pYw10zrqfkdm3yD9JTO6vEGQhKy.jpg,"{'id': 495527, 'name': 'Jumanji Collection', '...",65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",http://www.sonypictures.com/movies/jumanji/,8844,tt0113497,en,Jumanji,...,1995-12-15,262821940,104,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Roll the dice and unleash the excitement!,Jumanji,False,7.235,9364
2,False,/1J4Z7VhdAgtdd97nCxY7dcBpjGT.jpg,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",25000000,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,...,1995-12-22,71500000,101,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.44,315
3,False,/jZjoEKXMTDoZAGdkjhAdJaKtXSN.jpg,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,...,1995-12-22,81452156,127,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.3,128
4,False,/lEsjVrGU21BeJjF5AF9EWsihDpw.jpg,"{'id': 96871, 'name': 'Father of the Bride (St...",0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10751, '...",,11862,tt0113041,en,Father of the Bride Part II,...,1995-12-08,76594107,106,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,6.228,620


In [8]:
df1.columns

Index(['adult', 'backdrop_path', 'belongs_to_collection', 'budget', 'genres',
       'homepage', 'id', 'imdb_id', 'original_language', 'original_title',
       'overview', 'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

In [9]:
df1.drop(['backdrop_path','budget','imdb_id',
         'homepage','original_language', 
         'original_title','poster_path',
         'production_countries','revenue',
         'status','video'], axis=1, inplace=True)

df1.rename({'belongs_to_collection':'collection'}, 
           axis=1, inplace=True)
df1.head()

Unnamed: 0,adult,collection,genres,id,overview,popularity,production_companies,release_date,runtime,spoken_languages,tagline,title,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...","[{'id': 16, 'name': 'Animation'}, {'id': 12, '...",862,"Led by Woody, Andy's toys live happily in his ...",113.49,"[{'id': 3, 'logo_path': '/1TjvGVDMYsj6JBxOAkUH...",1995-10-30,81,"[{'english_name': 'English', 'iso_639_1': 'en'...",,Toy Story,8.0,16201
1,False,"{'id': 495527, 'name': 'Jumanji Collection', '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,When siblings Judy and Peter discover an encha...,16.479,"[{'id': 559, 'logo_path': '/jqWioYeGSyTLuHth01...",1995-12-15,104,"[{'english_name': 'English', 'iso_639_1': 'en'...",Roll the dice and unleash the excitement!,Jumanji,7.235,9364
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,A family wedding reignites the ancient feud be...,9.941,"[{'id': 174, 'logo_path': '/IuAlhI9eVC9Z8UQWOI...",1995-12-22,101,"[{'english_name': 'English', 'iso_639_1': 'en'...",Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.44,315
3,False,,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,"Cheated on, mistreated and stepped on, the wom...",8.443,"[{'id': 25, 'logo_path': '/qZCc1lty5FzX30aOCVR...",1995-12-22,127,"[{'english_name': 'English', 'iso_639_1': 'en'...",Friends are the people who let you be yourself...,Waiting to Exhale,6.3,128
4,False,"{'id': 96871, 'name': 'Father of the Bride (St...","[{'id': 35, 'name': 'Comedy'}, {'id': 10751, '...",11862,Just when George Banks has recovered from his ...,11.822,"[{'id': 5842, 'logo_path': None, 'name': 'Sand...",1995-12-08,106,"[{'english_name': 'English', 'iso_639_1': 'en'...",Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,6.228,620


In [10]:
# creating a helper function to grab each movie's credits in the GroupLens dataset
def get_credits_data():
    credits = []
    
    for movie_id in test_ids:
        url = "https://api.themoviedb.org/3/movie/{id}/credits?api_key={key}".format(id=movie_id,key=key)
        r = requests.get(url) # scraping the data
        if r.status_code == 200: # making sure that the response is valid
            record = requests.get(url).json() # getting the record
            credits.append(record) # appending the record to a list called movies
    
    df = pd.DataFrame(credits)
    df
    return df

In [11]:
# getting the credits data
df2 = get_credits_data()
df2.head()

Unnamed: 0,id,cast,crew
0,862,"[{'adult': False, 'gender': 2, 'id': 31, 'know...","[{'adult': False, 'gender': 2, 'id': 7, 'known..."
1,8844,"[{'adult': False, 'gender': 2, 'id': 2157, 'kn...","[{'adult': False, 'gender': 2, 'id': 511, 'kno..."
2,15602,"[{'adult': False, 'gender': 2, 'id': 6837, 'kn...","[{'adult': False, 'gender': 2, 'id': 3117, 'kn..."
3,31357,"[{'adult': False, 'gender': 1, 'id': 8851, 'kn...","[{'adult': False, 'gender': 2, 'id': 2178, 'kn..."
4,11862,"[{'adult': False, 'gender': 2, 'id': 67773, 'k...","[{'adult': False, 'gender': 2, 'id': 37, 'know..."


In [12]:
# creating a helper function to grab each movie's keywords in the GroupLens dataset
def get_keywords_data():
    keywords = []
    
    for movie_id in test_ids:
        url = "https://api.themoviedb.org/3/movie/{id}/keywords?api_key={key}".format(id=movie_id,key=key)
        r = requests.get(url) # scraping the data
        if r.status_code == 200: # making sure that the response is valid
            record = requests.get(url).json() # getting the record
            keywords.append(record) # appending the record to a list called movies
    
    df = pd.DataFrame(keywords)
    df
    return df

In [13]:
# getting keywords data
df3 = get_keywords_data()
df3.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 779, 'name': 'martial arts'}, {'id': 9..."
1,8844,"[{'id': 7035, 'name': 'giant insect'}, {'id': ..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 3335,..."
3,31357,"[{'id': 818, 'name': 'based on novel or book'}..."
4,11862,"[{'id': 970, 'name': 'parent child relationshi..."


In [14]:
# making sure the ids are the same across the datasets

((df1.id == df2.id) & (df1.id == df3.id)).all()

True

In [15]:
# merge the datasets
df = df1.merge(df2, on='id').merge(df3, on='id')
df.head()

Unnamed: 0,adult,collection,genres,id,overview,popularity,production_companies,release_date,runtime,spoken_languages,tagline,title,vote_average,vote_count,cast,crew,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...","[{'id': 16, 'name': 'Animation'}, {'id': 12, '...",862,"Led by Woody, Andy's toys live happily in his ...",113.49,"[{'id': 3, 'logo_path': '/1TjvGVDMYsj6JBxOAkUH...",1995-10-30,81,"[{'english_name': 'English', 'iso_639_1': 'en'...",,Toy Story,8.0,16201,"[{'adult': False, 'gender': 2, 'id': 31, 'know...","[{'adult': False, 'gender': 2, 'id': 7, 'known...","[{'id': 779, 'name': 'martial arts'}, {'id': 9..."
1,False,"{'id': 495527, 'name': 'Jumanji Collection', '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,When siblings Judy and Peter discover an encha...,16.479,"[{'id': 559, 'logo_path': '/jqWioYeGSyTLuHth01...",1995-12-15,104,"[{'english_name': 'English', 'iso_639_1': 'en'...",Roll the dice and unleash the excitement!,Jumanji,7.235,9364,"[{'adult': False, 'gender': 2, 'id': 2157, 'kn...","[{'adult': False, 'gender': 2, 'id': 511, 'kno...","[{'id': 7035, 'name': 'giant insect'}, {'id': ..."
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,A family wedding reignites the ancient feud be...,9.941,"[{'id': 174, 'logo_path': '/IuAlhI9eVC9Z8UQWOI...",1995-12-22,101,"[{'english_name': 'English', 'iso_639_1': 'en'...",Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.44,315,"[{'adult': False, 'gender': 2, 'id': 6837, 'kn...","[{'adult': False, 'gender': 2, 'id': 3117, 'kn...","[{'id': 1495, 'name': 'fishing'}, {'id': 3335,..."
3,False,,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,"Cheated on, mistreated and stepped on, the wom...",8.443,"[{'id': 25, 'logo_path': '/qZCc1lty5FzX30aOCVR...",1995-12-22,127,"[{'english_name': 'English', 'iso_639_1': 'en'...",Friends are the people who let you be yourself...,Waiting to Exhale,6.3,128,"[{'adult': False, 'gender': 1, 'id': 8851, 'kn...","[{'adult': False, 'gender': 2, 'id': 2178, 'kn...","[{'id': 818, 'name': 'based on novel or book'}..."
4,False,"{'id': 96871, 'name': 'Father of the Bride (St...","[{'id': 35, 'name': 'Comedy'}, {'id': 10751, '...",11862,Just when George Banks has recovered from his ...,11.822,"[{'id': 5842, 'logo_path': None, 'name': 'Sand...",1995-12-08,106,"[{'english_name': 'English', 'iso_639_1': 'en'...",Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,6.228,620,"[{'adult': False, 'gender': 2, 'id': 67773, 'k...","[{'adult': False, 'gender': 2, 'id': 37, 'know...","[{'id': 970, 'name': 'parent child relationshi..."


In [16]:
# saving the raw data
#df1.to_csv("movies_data.csv", index=False)
#df2.to_csv("credits_data.csv", index=False)
#df3.to_csv("keywords_data.csv", index=False)
#df.to_csv("fulldataset.csv", index=False)

In [17]:
df.to_csv("test_data.csv", index=False)

## Part II: Dataset cleaning & feature engineering

In [18]:
import pandas as pd
import requests
import sys
import numpy as np
from ast import literal_eval

In [19]:
df = pd.read_csv("test_data.csv")
df.head()

Unnamed: 0,adult,collection,genres,id,overview,popularity,production_companies,release_date,runtime,spoken_languages,tagline,title,vote_average,vote_count,cast,crew,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...","[{'id': 16, 'name': 'Animation'}, {'id': 12, '...",862,"Led by Woody, Andy's toys live happily in his ...",113.49,"[{'id': 3, 'logo_path': '/1TjvGVDMYsj6JBxOAkUH...",1995-10-30,81,"[{'english_name': 'English', 'iso_639_1': 'en'...",,Toy Story,8.0,16201,"[{'adult': False, 'gender': 2, 'id': 31, 'know...","[{'adult': False, 'gender': 2, 'id': 7, 'known...","[{'id': 779, 'name': 'martial arts'}, {'id': 9..."
1,False,"{'id': 495527, 'name': 'Jumanji Collection', '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,When siblings Judy and Peter discover an encha...,16.479,"[{'id': 559, 'logo_path': '/jqWioYeGSyTLuHth01...",1995-12-15,104,"[{'english_name': 'English', 'iso_639_1': 'en'...",Roll the dice and unleash the excitement!,Jumanji,7.235,9364,"[{'adult': False, 'gender': 2, 'id': 2157, 'kn...","[{'adult': False, 'gender': 2, 'id': 511, 'kno...","[{'id': 7035, 'name': 'giant insect'}, {'id': ..."
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,A family wedding reignites the ancient feud be...,9.941,"[{'id': 174, 'logo_path': '/IuAlhI9eVC9Z8UQWOI...",1995-12-22,101,"[{'english_name': 'English', 'iso_639_1': 'en'...",Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.44,315,"[{'adult': False, 'gender': 2, 'id': 6837, 'kn...","[{'adult': False, 'gender': 2, 'id': 3117, 'kn...","[{'id': 1495, 'name': 'fishing'}, {'id': 3335,..."
3,False,,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,"Cheated on, mistreated and stepped on, the wom...",8.443,"[{'id': 25, 'logo_path': '/qZCc1lty5FzX30aOCVR...",1995-12-22,127,"[{'english_name': 'English', 'iso_639_1': 'en'...",Friends are the people who let you be yourself...,Waiting to Exhale,6.3,128,"[{'adult': False, 'gender': 1, 'id': 8851, 'kn...","[{'adult': False, 'gender': 2, 'id': 2178, 'kn...","[{'id': 818, 'name': 'based on novel or book'}..."
4,False,"{'id': 96871, 'name': 'Father of the Bride (St...","[{'id': 35, 'name': 'Comedy'}, {'id': 10751, '...",11862,Just when George Banks has recovered from his ...,11.822,"[{'id': 5842, 'logo_path': None, 'name': 'Sand...",1995-12-08,106,"[{'english_name': 'English', 'iso_639_1': 'en'...",Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,6.228,620,"[{'adult': False, 'gender': 2, 'id': 67773, 'k...","[{'adult': False, 'gender': 2, 'id': 37, 'know...","[{'id': 970, 'name': 'parent child relationshi..."


In [20]:
df.columns

Index(['adult', 'collection', 'genres', 'id', 'overview', 'popularity',
       'production_companies', 'release_date', 'runtime', 'spoken_languages',
       'tagline', 'title', 'vote_average', 'vote_count', 'cast', 'crew',
       'keywords'],
      dtype='object')

In [21]:
# checking the data type of the adult column
df.adult.dtypes

dtype('bool')

In [22]:
# cleaning the collection column
df['collection'] = df['collection'].fillna('{}').apply(literal_eval).apply(lambda x: x.get('name'))
df.collection[:3]

0         Toy Story Collection
1           Jumanji Collection
2    Grumpy Old Men Collection
Name: collection, dtype: object

In [23]:
# cleaning the genres column
df['genres'] = df['genres'].fillna('[]').apply(literal_eval).apply(
                lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
df.genres[:3]

0    [Animation, Adventure, Family, Comedy]
1              [Adventure, Fantasy, Family]
2                         [Romance, Comedy]
Name: genres, dtype: object

In [24]:
# casting the id column to integers
df['id'] = df['id'].astype('int')

In [25]:
# inspecting the production companies column
df.production_companies[0]

"[{'id': 3, 'logo_path': '/1TjvGVDMYsj6JBxOAkUHpPEwLf7.png', 'name': 'Pixar', 'origin_country': 'US'}]"

In [26]:
# cleaning the production companies column
df['production_companies'] = df['production_companies'].fillna('[]').apply(literal_eval).apply(
                lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
df.production_companies[:3]

0                                              [Pixar]
1    [TriStar Pictures, Teitler Film, Interscope Co...
2              [Warner Bros. Pictures, Lancaster Gate]
Name: production_companies, dtype: object

In [27]:
# cleaning the release date column and engineering the years column
df['year'] = pd.to_datetime(df['release_date'], errors='coerce').apply(
                lambda x: str(x).split('-')[0] if x != np.nan else np.nan)
df.drop(['release_date'], axis=1, inplace=True)
df.year[:3]

0    1995
1    1995
2    1995
Name: year, dtype: object

In [28]:
# casting the runtime column to integers
df['runtime'] = df['runtime'].astype('int')

In [29]:
# inspecting the spoken languages column
df.spoken_languages[0]

"[{'english_name': 'English', 'iso_639_1': 'en', 'name': 'English'}]"

In [30]:
# cleaning the spoken languages column
df['spoken_languages'] = df['spoken_languages'].fillna('[]').apply(literal_eval).apply(
                lambda x: [i['english_name'] for i in x] if isinstance(x, list) else [])
df.spoken_languages[:3]

0            [English]
1    [English, French]
2            [English]
Name: spoken_languages, dtype: object

In [31]:
# casting the average vote and number of votes as integers
df['vote_count'] = df['vote_count'].astype('int')
df['vote_average'] = df['vote_average'].astype('int')

In [32]:
# casting each cast records as a list
df['cast'] = df['cast'].apply(literal_eval)

In [33]:
# inspecting the cast column
df.cast[0]

[{'adult': False,
  'gender': 2,
  'id': 31,
  'known_for_department': 'Acting',
  'name': 'Tom Hanks',
  'original_name': 'Tom Hanks',
  'popularity': 80.405,
  'profile_path': '/xndWFsBlClOJFRdhSt4NBwiPq2o.jpg',
  'cast_id': 14,
  'character': 'Woody (voice)',
  'credit_id': '52fe4284c3a36847f8024f95',
  'order': 0},
 {'adult': False,
  'gender': 2,
  'id': 12898,
  'known_for_department': 'Acting',
  'name': 'Tim Allen',
  'original_name': 'Tim Allen',
  'popularity': 24.816,
  'profile_path': '/woWhZzFILVhYMAvsPL171HjMY0y.jpg',
  'cast_id': 15,
  'character': 'Buzz Lightyear (voice)',
  'credit_id': '52fe4284c3a36847f8024f99',
  'order': 1},
 {'adult': False,
  'gender': 2,
  'id': 7167,
  'known_for_department': 'Acting',
  'name': 'Don Rickles',
  'original_name': 'Don Rickles',
  'popularity': 7.328,
  'profile_path': '/iJLQV4dcbTUgxlWJakjDldzlMXS.jpg',
  'cast_id': 16,
  'character': 'Mr. Potato Head (voice)',
  'credit_id': '52fe4284c3a36847f8024f9d',
  'order': 2},
 {'adult':

In [34]:
# cleaning the cast column
df['cast'] = df['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
df['cast'] = df['cast'].apply(lambda x: x[:5] if len(x) >=5 else x)
df.cast[:3]

0    [Tom Hanks, Tim Allen, Don Rickles, Jim Varney...
1    [Robin Williams, Kirsten Dunst, Bradley Pierce...
2    [Walter Matthau, Jack Lemmon, Ann-Margret, Sop...
Name: cast, dtype: object

In [35]:
# casting each crew records as a list
df['crew'] = df['crew'].apply(literal_eval)

In [36]:
# inspecting the crew column
df.crew[0]

[{'adult': False,
  'gender': 2,
  'id': 7,
  'known_for_department': 'Writing',
  'name': 'Andrew Stanton',
  'original_name': 'Andrew Stanton',
  'popularity': 5.321,
  'profile_path': '/gasNitCwepbqNcYBmDHpsCgZH0I.jpg',
  'credit_id': '52fe4284c3a36847f8024f55',
  'department': 'Writing',
  'job': 'Screenplay'},
 {'adult': False,
  'gender': 2,
  'id': 7,
  'known_for_department': 'Writing',
  'name': 'Andrew Stanton',
  'original_name': 'Andrew Stanton',
  'popularity': 5.321,
  'profile_path': '/gasNitCwepbqNcYBmDHpsCgZH0I.jpg',
  'credit_id': '5891edf9c3a36809700075e6',
  'department': 'Writing',
  'job': 'Original Story'},
 {'adult': False,
  'gender': 2,
  'id': 7,
  'known_for_department': 'Writing',
  'name': 'Andrew Stanton',
  'original_name': 'Andrew Stanton',
  'popularity': 5.321,
  'profile_path': '/gasNitCwepbqNcYBmDHpsCgZH0I.jpg',
  'credit_id': '5892132b9251412dc80097b1',
  'department': 'Visual Effects',
  'job': 'Character Designer'},
 {'adult': False,
  'gender': 

In [37]:
# helper function to get the director's name
def crew_names(x):
    director = []
    producer = []
    
    for i in x:
        if i['job'] == 'Director':
            director.append(i['name'])
        elif i['job'] == 'Producer':
            producer.append(i['name'])
    return [director, producer]

In [38]:
# cleaning the crew column
df['director'] = df['crew'].apply(crew_names).apply(lambda x: [i for i in x[0]] if isinstance(x, list) else [])
df['producer'] = df['crew'].apply(crew_names).apply(lambda x: [i for i in x[1]] if isinstance(x, list) else [])
df.drop(['crew'], axis=1,inplace=True)
df.head()

Unnamed: 0,adult,collection,genres,id,overview,popularity,production_companies,runtime,spoken_languages,tagline,title,vote_average,vote_count,cast,keywords,year,director,producer
0,False,Toy Story Collection,"[Animation, Adventure, Family, Comedy]",862,"Led by Woody, Andy's toys live happily in his ...",113.49,[Pixar],81,[English],,Toy Story,8,16201,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[{'id': 779, 'name': 'martial arts'}, {'id': 9...",1995,[John Lasseter],"[Bonnie Arnold, Ralph Guggenheim]"
1,False,Jumanji Collection,"[Adventure, Fantasy, Family]",8844,When siblings Judy and Peter discover an encha...,16.479,"[TriStar Pictures, Teitler Film, Interscope Co...",104,"[English, French]",Roll the dice and unleash the excitement!,Jumanji,7,9364,"[Robin Williams, Kirsten Dunst, Bradley Pierce...","[{'id': 7035, 'name': 'giant insect'}, {'id': ...",1995,[Joe Johnston],"[Scott Kroopf, William Teitler]"
2,False,Grumpy Old Men Collection,"[Romance, Comedy]",15602,A family wedding reignites the ancient feud be...,9.941,"[Warner Bros. Pictures, Lancaster Gate]",101,[English],Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6,315,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[{'id': 1495, 'name': 'fishing'}, {'id': 3335,...",1995,[Howard Deutch],"[John Davis, Richard C. Berman]"
3,False,,"[Comedy, Drama, Romance]",31357,"Cheated on, mistreated and stepped on, the wom...",8.443,[20th Century Fox],127,[English],Friends are the people who let you be yourself...,Waiting to Exhale,6,128,"[Whitney Houston, Angela Bassett, Loretta Devi...","[{'id': 818, 'name': 'based on novel or book'}...",1995,[Forest Whitaker],"[Ronald Bass, Ezra Swerdlow, Deborah Schindler..."
4,False,Father of the Bride (Steve Martin) Collection,"[Comedy, Family]",11862,Just when George Banks has recovered from his ...,11.822,"[Sandollar Productions, Touchstone Pictures]",106,[English],Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,6,620,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[{'id': 970, 'name': 'parent child relationshi...",1995,[Charles Shyer],[Nancy Meyers]


In [39]:
# casting each keywords records as a list
df['keywords'] = df['keywords'].apply(literal_eval)

In [40]:
# inspecting the keywords column
df.keywords[0]

[{'id': 779, 'name': 'martial arts'},
 {'id': 931, 'name': 'jealousy'},
 {'id': 6054, 'name': 'friendship'},
 {'id': 6733, 'name': 'bullying'},
 {'id': 8102, 'name': 'elementary school'},
 {'id': 9713, 'name': 'friends'},
 {'id': 9823, 'name': 'rivalry'},
 {'id': 10084, 'name': 'rescue'},
 {'id': 10364, 'name': 'mission'},
 {'id': 15214, 'name': 'buddy'},
 {'id': 33553, 'name': 'walkie talkie'},
 {'id': 158141, 'name': 'toy car'},
 {'id': 165503, 'name': 'boy next door'},
 {'id': 170722, 'name': 'new toy'},
 {'id': 180523, 'name': 'neighborhood'},
 {'id': 187065, 'name': 'toy comes to life'},
 {'id': 242792, 'name': 'resourcefulness'}]

In [41]:
# cleaning the keywords column
df['keywords'] = df['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
df.head()

Unnamed: 0,adult,collection,genres,id,overview,popularity,production_companies,runtime,spoken_languages,tagline,title,vote_average,vote_count,cast,keywords,year,director,producer
0,False,Toy Story Collection,"[Animation, Adventure, Family, Comedy]",862,"Led by Woody, Andy's toys live happily in his ...",113.49,[Pixar],81,[English],,Toy Story,8,16201,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[martial arts, jealousy, friendship, bullying,...",1995,[John Lasseter],"[Bonnie Arnold, Ralph Guggenheim]"
1,False,Jumanji Collection,"[Adventure, Fantasy, Family]",8844,When siblings Judy and Peter discover an encha...,16.479,"[TriStar Pictures, Teitler Film, Interscope Co...",104,"[English, French]",Roll the dice and unleash the excitement!,Jumanji,7,9364,"[Robin Williams, Kirsten Dunst, Bradley Pierce...","[giant insect, board game, jungle, disappearan...",1995,[Joe Johnston],"[Scott Kroopf, William Teitler]"
2,False,Grumpy Old Men Collection,"[Romance, Comedy]",15602,A family wedding reignites the ancient feud be...,9.941,"[Warner Bros. Pictures, Lancaster Gate]",101,[English],Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6,315,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[fishing, halloween, sequel, old man, best fri...",1995,[Howard Deutch],"[John Davis, Richard C. Berman]"
3,False,,"[Comedy, Drama, Romance]",31357,"Cheated on, mistreated and stepped on, the wom...",8.443,[20th Century Fox],127,[English],Friends are the people who let you be yourself...,Waiting to Exhale,6,128,"[Whitney Houston, Angela Bassett, Loretta Devi...","[based on novel or book, interracial relations...",1995,[Forest Whitaker],"[Ronald Bass, Ezra Swerdlow, Deborah Schindler..."
4,False,Father of the Bride (Steve Martin) Collection,"[Comedy, Family]",11862,Just when George Banks has recovered from his ...,11.822,"[Sandollar Productions, Touchstone Pictures]",106,[English],Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,6,620,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[parent child relationship, baby, midlife cris...",1995,[Charles Shyer],[Nancy Meyers]


In [42]:
df.columns

Index(['adult', 'collection', 'genres', 'id', 'overview', 'popularity',
       'production_companies', 'runtime', 'spoken_languages', 'tagline',
       'title', 'vote_average', 'vote_count', 'cast', 'keywords', 'year',
       'director', 'producer'],
      dtype='object')

In [43]:
# reorganizing the dataset to be more intuitive
df = df[['id','year','title','runtime','collection','genres','tagline','overview',
         'cast','director','producer','keywords','adult','production_companies',
         'spoken_languages','popularity','vote_count','vote_average']].copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    100 non-null    int64  
 1   year                  100 non-null    object 
 2   title                 100 non-null    object 
 3   runtime               100 non-null    int64  
 4   collection            16 non-null     object 
 5   genres                100 non-null    object 
 6   tagline               89 non-null     object 
 7   overview              100 non-null    object 
 8   cast                  100 non-null    object 
 9   director              100 non-null    object 
 10  producer              100 non-null    object 
 11  keywords              100 non-null    object 
 12  adult                 100 non-null    bool   
 13  production_companies  100 non-null    object 
 14  spoken_languages      100 non-null    object 
 15  popularity            10

In [44]:
# saving the dataset
df.to_csv("clean_data.csv", index=False)

## Part III: Recommendation systems

In [45]:
import pandas as pd
import numpy as np
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from surprise import Reader, Dataset, SVD, accuracy
from surprise.model_selection import cross_validate, KFold

import warnings; warnings.simplefilter('ignore')

In [46]:
df = pd.read_csv("clean_data.csv")
df.head()

Unnamed: 0,id,year,title,runtime,collection,genres,tagline,overview,cast,director,producer,keywords,adult,production_companies,spoken_languages,popularity,vote_count,vote_average
0,862,1995,Toy Story,81,Toy Story Collection,"['Animation', 'Adventure', 'Family', 'Comedy']",,"Led by Woody, Andy's toys live happily in his ...","['Tom Hanks', 'Tim Allen', 'Don Rickles', 'Jim...",['John Lasseter'],"['Bonnie Arnold', 'Ralph Guggenheim']","['martial arts', 'jealousy', 'friendship', 'bu...",False,['Pixar'],['English'],113.49,16201,8
1,8844,1995,Jumanji,104,Jumanji Collection,"['Adventure', 'Fantasy', 'Family']",Roll the dice and unleash the excitement!,When siblings Judy and Peter discover an encha...,"['Robin Williams', 'Kirsten Dunst', 'Bradley P...",['Joe Johnston'],"['Scott Kroopf', 'William Teitler']","['giant insect', 'board game', 'jungle', 'disa...",False,"['TriStar Pictures', 'Teitler Film', 'Intersco...","['English', 'French']",16.479,9364,7
2,15602,1995,Grumpier Old Men,101,Grumpy Old Men Collection,"['Romance', 'Comedy']",Still Yelling. Still Fighting. Still Ready for...,A family wedding reignites the ancient feud be...,"['Walter Matthau', 'Jack Lemmon', 'Ann-Margret...",['Howard Deutch'],"['John Davis', 'Richard C. Berman']","['fishing', 'halloween', 'sequel', 'old man', ...",False,"['Warner Bros. Pictures', 'Lancaster Gate']",['English'],9.941,315,6
3,31357,1995,Waiting to Exhale,127,,"['Comedy', 'Drama', 'Romance']",Friends are the people who let you be yourself...,"Cheated on, mistreated and stepped on, the wom...","['Whitney Houston', 'Angela Bassett', 'Loretta...",['Forest Whitaker'],"['Ronald Bass', 'Ezra Swerdlow', 'Deborah Schi...","['based on novel or book', 'interracial relati...",False,['20th Century Fox'],['English'],8.443,128,6
4,11862,1995,Father of the Bride Part II,106,Father of the Bride (Steve Martin) Collection,"['Comedy', 'Family']",Just When His World Is Back To Normal... He's ...,Just when George Banks has recovered from his ...,"['Steve Martin', 'Diane Keaton', 'Martin Short...",['Charles Shyer'],['Nancy Meyers'],"['parent child relationship', 'baby', 'midlife...",False,"['Sandollar Productions', 'Touchstone Pictures']",['English'],11.822,620,6


### A simple movie recommender

In their basic form, recommenders offer the highest rated services or items, regardless of user preferences. However, since movies have varying number of assigned votes or scores, simple recommenders often calculate a weighted score for each item, to account for this difference. The weighted score of an item can be calculated using the formula:
$$
ws = (\frac{v}{v+m} \times R) + (\frac{m}{v+m} \times C)
$$
where: <br>
$ws$ is the weighted score for an item
$v$ is the number of votes an item has <br>
$m$ is the minimum number of votes an item needs in order to be recommended <br>
$R$ is the average score for an item <br>
$C$ is the average score for all items in the system <br>

In [47]:
def simple_recsys(df=df):
    """
    This algorithm takes in a database of movies 
    and recommends the highest rated ones
    """
    
    # copying the dataframe to maintain its integrity
    df = df.copy()
    
    # calculating the minimum number of votes movies need to be recommended (m)
    vote_counts = df[df['vote_count'].notnull()]['vote_count']
    m = vote_counts.quantile(0.90)
    
    # calculating the average vote across all movies in the dataset (C) 
    vote_averages = df[df['vote_average'].notnull()]['vote_average']
    C = vote_averages.mean()
    
    # filtering out the database against null vote count and average vote values 
    recs = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())]
    recs['genres'] = recs.genres.apply(lambda x: x.strip("][").replace("'", ''))
    
    # copying the qualified movies' title, release year, vote count, 
    # average vote, popularity and generes to a new df
    recs = recs[['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']].copy()
    
    # cleaning the genres column
    
    # calculating the weighted score
    v = recs['vote_count'].astype('int')
    R = recs['vote_average'].astype('int')
    recs['weighted_score'] = round((v/(v+m) * R) + (m/(m+v) * C), 1)
    
    # sort by weighted score
    recs = recs.sort_values('weighted_score', ascending=False).head(10)

    return recs.head(10)

In [48]:
simple_recsys()

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,weighted_score
43,Se7en,1995,18146,8,53.38,"Crime, Mystery, Thriller",7.6
0,Toy Story,1995,16201,8,113.49,"Animation, Adventure, Family, Comedy",7.5
98,Taxi Driver,1976,10171,8,47.968,"Crime, Drama",7.4
46,The Usual Suspects,1995,9050,8,26.657,"Drama, Crime, Thriller",7.3
1,Jumanji,1995,9364,7,16.479,"Adventure, Fantasy, Family",6.7
97,Braveheart,1995,8800,7,43.811,"Action, Drama, History, War",6.7
5,Heat,1995,5977,7,71.24,"Action, Crime, Drama, Thriller",6.6
31,Twelve Monkeys,1995,7243,7,55.618,"Science Fiction, Thriller, Mystery",6.6
62,From Dusk Till Dawn,1996,5096,7,28.231,"Horror, Action, Thriller, Crime",6.5
44,Pocahontas,1995,5031,6,41.225,"Adventure, Animation, Family, Romance",6.0


#### Improving the simple recommender by introducing a genres filter

To improve upon the recommender, simple recommenders integrate a user-defined class of items they want to look at or consider. For the movies dataset, I will be using the genre as the classifier.

In [49]:
def simple_recsys_2(df=df):
    """
    This function takes in a movie db and a user-defined
    movie genre to recommend movies with the same genre 
    based on their weighted score
    """
    
    # copying the dataframe to maintain its integrity
    df = df.copy()
    
    # user-defined genre
    print("Select a Genre")
    genre = input().capitalize()
    
    # exploding the genres column so each movie's genre is in one record
    df.genres = df.genres.apply(literal_eval)
    g = df.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
    g.name = 'genre'
    
    # merging the genres to the df
    df = df.drop('genres', axis=1).join(g)
    
    # filtering for movies that are in the selected genre 
    df = df[df['genre'] == genre]
    
    # calculating the minimum number of votes movies need to be recommended (m)
    vote_counts = df[df['vote_count'].notnull()]['vote_count']
    m = vote_counts.quantile(0.90)
    
    # calculating the average vote across all movies in the dataset (C) 
    vote_averages = df[df['vote_average'].notnull()]['vote_average']
    C = vote_averages.mean()
    
    # filtering out the database against null vote count and average vote values 
    recs = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())]
    
    # copying the qualified movies' title, release year, vote count, 
    # average vote and popularity to a new df
    recs = recs[['title', 'year', 'vote_count', 'vote_average', 'popularity']].copy()
    
    # cleaning the genres column
    
    # calculating the weighted score
    v = recs['vote_count'].astype('int')
    R = recs['vote_average'].astype('int')
    recs['weighted_score'] = round((v/(v+m) * R) + (m/(m+v) * C), 1)
    
    # sort by weighted score
    recs = recs.sort_values('weighted_score', ascending=False)

    return recs.head(10)

In [50]:
simple_recsys_2()

Select a Genre
crime


Unnamed: 0,title,year,vote_count,vote_average,popularity,weighted_score
43,Se7en,1995,18146,8,53.38,7.5
98,Taxi Driver,1976,10171,8,47.968,7.2


### Content-based recommender

While simple recommenders work, their biggest pitfall is their inability to account for context. For instance, both Toy Story and Jumanji are adventure movies. However, the audiences for the movies differ as one is animated and the other is not. Content-based recommenders consider multiple user-defined inputs to make suggestions a user may like. For this project, I will be building two content-based movie recommenders based on the movie's description (tagline & overview) and its details (cast, director, producers, production companies, keywords, language). The cosine similarity score will be the metric used to determine the similarity between the movies based on their content. This similarity score between two bodies of text ($x$, $y$) can be determined using the formula:

$$
cos(\theta) = \frac{x \cdot y}{||x|| \cdot ||y||}
$$

#### A description-based recommendation system

In [51]:
def more_like_this(df=df):
    """
    This function takes a movie title in any case and returns
    five similar movies based on the input movie's description
    """
    # copying the dataframe to maintain its integrity
    df = df.copy()
    
    # dropping records with no titles, if any
    no_title = df[df['title'].isnull()]
    df = df.drop(no_title.index)
    
    # creating the description text for each movie
    df['tagline'] = df['tagline'].fillna('')
    df['description'] = df['overview'] + df['tagline']
    df['description'] = df['description'].fillna('')
    
    # vectorize the description text
    vect = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
    mtrx = vect.fit_transform(df['description'])

    # calculating the cosine similarity using sklearn's linear_kernel
    cosine_sim = linear_kernel(mtrx, mtrx)
    
    # create an indexer using the titles column
    df = df.reset_index()
    df['title_i'] = df['title']
    titles = df['title']
    indices = pd.Series(df.index, index=df['title_i'].str.lower())
    
    # user-defined title
    print("Select a Movie")
    title = input().lower()
    
    # find the input title in the indexer
    idx = indices[title]
    
    # create the similarity score for the selected movie and sort from highest to lowest
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # return the 10 most similar movies in the dataset
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    recs = df.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    
    # calculating the minimum number of votes movies need to be recommended (m)
    vote_counts = df[df['vote_count'].notnull()]['vote_count']
    m = vote_counts.quantile(0.90)
    
    # calculating the average vote across all movies in the dataset (C) 
    vote_averages = df[df['vote_average'].notnull()]['vote_average']
    C = vote_averages.mean()
    
    # filtering out the database against null vote count and average vote values 
    recs = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())]
    
    # copying the qualified movies' title, release year, vote count, 
    # average vote, popularity and generes to a new df
    recs = recs[['title', 'year', 'vote_count', 'vote_average']].copy()
        
    # calculating the weighted score
    v = recs['vote_count'].astype('int')
    R = recs['vote_average'].astype('int')
    recs['weighted_score'] = round((v/(v+m) * R) + (m/(m+v) * C), 1)
    
    # sort by weighted score
    recs = recs.sort_values('weighted_score', ascending=False).head(10)

    return recs

In [52]:
more_like_this()

Select a Movie
se7en


Unnamed: 0,title,year,vote_count,vote_average,weighted_score
43,Se7en,1995,18146,8,7.6
0,Toy Story,1995,16201,8,7.5
98,Taxi Driver,1976,10171,8,7.4
46,The Usual Suspects,1995,9050,8,7.3
1,Jumanji,1995,9364,7,6.7
97,Braveheart,1995,8800,7,6.7
5,Heat,1995,5977,7,6.6
31,Twelve Monkeys,1995,7243,7,6.6
62,From Dusk Till Dawn,1996,5096,7,6.5
44,Pocahontas,1995,5031,6,6.0


#### A details-based recommendation system with the weighted score filter

In [53]:
def others_also_like(df=df):
    """
    This function takes a movie title in any case and returns
    five similar movies based on the input movie's details, 
    which include the cast members, the directors, producers,
    producing companies and spoken languages.
    """
    
    # copying the dataframe to maintain its integrity
    df = df.copy()
    
    # dropping records with no titles, if any
    no_title = df[df['title'].isnull()]
    df = df.drop(no_title.index)
    
    # dealing with the input features
    features = ['cast','genres','producer',
                'production_companies',
                'spoken_languages']

    for feat in features:
        df[feat] = df[feat].apply(literal_eval)
        df[feat] = df[feat].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

    df['director'] = df['director'].apply(lambda x: [x, x])
    
    # exploding the keywords so each keyword is a record
    df['keywords'] = df['keywords'].apply(literal_eval)
    s = df.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
    s.name = 'keyword'
    
    # getting the frequency of each keyword & filtering out unique keywords as they are not useful
    s = s.value_counts()    
    s = s[s > 1]
    
    # helper function to filter keywords by frequency
    def freq_filter(x):
        words = []
        for i in x:
            if i in s:
                words.append(i)
        return words
    
    # loading the stem function
    stem_filter = SnowballStemmer('english')
    
    # apply the keyword filters
    df['keywords'] = df['keywords'].apply(freq_filter)
    df['keywords'] = df['keywords'].apply(lambda x: [stem_filter.stem(i) for i in x])
    df['keywords'] = df['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
    
    # creating the details text
    details = ['cast','director','genres','producer',
               'production_companies','keywords',
               'spoken_languages']

    df['details'] = df[details].apply(lambda x: ' '.join(x.astype(str).str.strip("'[]").str.replace("', '", ", ")), axis=1)
    
    # create an indexer using the titles column
    df = df.reset_index()
    df['title_i'] = df['title']
    titles = df['title']
    indices = pd.Series(df.index, index=df['title_i'].str.lower())
    
    # user-defined title
    print("Select a Movie")
    title = input().lower()
    
    # find the input title in the indexer
    idx = indices[title]
    
    # vectorize the details text
    vect = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
    mtrx = vect.fit_transform(df['details'])
    
    # calculating the cosine similarity
    cosine_sim = cosine_similarity(mtrx, mtrx)

    # create the similarity score for the selected movie and sort from highest to lowest
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # return the 10 most similar movies in the dataset
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    recs = df.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    
    # calculating the minimum number of votes movies need to be recommended (m)
    vote_counts = df[df['vote_count'].notnull()]['vote_count']
    m = vote_counts.quantile(0.90)
    
    # calculating the average vote across all movies in the dataset (C) 
    vote_averages = df[df['vote_average'].notnull()]['vote_average']
    C = vote_averages.mean()
    
    # filtering out the database against null vote count and average vote values 
    recs = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())]
    
    # copying the qualified movies' title, release year, 
    # vote count, and average vote to a new df
    recs = recs[['title', 'year', 'vote_count', 'vote_average']].copy()
        
    # calculating the weighted score
    v = recs['vote_count'].astype('int')
    R = recs['vote_average'].astype('int')
    recs['weighted_score'] = round((v/(v+m) * R) + (m/(m+v) * C), 1)
    
    # sort by weighted score
    recs = recs.sort_values('weighted_score', ascending=False).head(10)

    return recs

In [54]:
others_also_like()

Select a Movie
se7en


Unnamed: 0,title,year,vote_count,vote_average,weighted_score
43,Se7en,1995,18146,8,7.6
0,Toy Story,1995,16201,8,7.5
98,Taxi Driver,1976,10171,8,7.4
46,The Usual Suspects,1995,9050,8,7.3
1,Jumanji,1995,9364,7,6.7
97,Braveheart,1995,8800,7,6.7
5,Heat,1995,5977,7,6.6
31,Twelve Monkeys,1995,7243,7,6.6
62,From Dusk Till Dawn,1996,5096,7,6.5
44,Pocahontas,1995,5031,6,6.0


### Collaborative Filtering

Despite the improvements brought by the content-based data, the recommendation system still suffers from severe limitations. Specifically, content-based recommenders suggest items that are <i>close</i> to a user-defined item. However, they are not capable of predicting a user's preferences, neither can they provide recommendations across multiple categories. Therefore, I will apply a collaborative filter on the system, to improve the system's predictive power. Collaborative filters are powerful algorithms, that leverage the power of communal votes and opinions to predict the likelihood a user will be interested in an item or service. For this project, the collaborative filter will be built based on the ratings users assigned over a period of 23 years.

In [55]:
# importing the ratings dataset
ratings = pd.read_csv('data/ml-latest-small/ratings.csv', low_memory=False)
ratings.head(2)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247


In [56]:
# load the reader function
reader = Reader()

# load the ratings dataset in the Dataset function
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

In [57]:
# choose an algorithm to predict the ratings a user might give a movie
svd = SVD()

# evaluating the algorithm's accuracy
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8813  0.8684  0.8711  0.8752  0.8693  0.8731  0.0047  
MAE (testset)     0.6779  0.6653  0.6696  0.6737  0.6647  0.6702  0.0050  
Fit time          0.86    1.04    1.10    1.00    0.90    0.98    0.09    
Test time         0.16    0.16    0.27    0.16    0.23    0.20    0.05    


{'test_rmse': array([0.88134551, 0.86841242, 0.87110409, 0.87521617, 0.86930812]),
 'test_mae': array([0.67788761, 0.66534208, 0.66960489, 0.67366964, 0.66473808]),
 'fit_time': (0.8561291694641113,
  1.0416672229766846,
  1.1020028591156006,
  1.002683162689209,
  0.899803876876831),
 'test_time': (0.15795302391052246,
  0.16362595558166504,
  0.2714359760284424,
  0.15778565406799316,
  0.23114395141601562)}

In [58]:
# defining a cross-validation iterator
kf = KFold(n_splits=5)
for trainset, testset in kf.split(data):

    # train and test algorithm.
    svd.fit(trainset)
    predictions = svd.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

RMSE: 0.8714
RMSE: 0.8771
RMSE: 0.8705
RMSE: 0.8752
RMSE: 0.8674


We can check the accuracy of the filter by comparing its prediction to an actual user rating.

In [59]:
# let's look at movie 302, only a few users have given it a rating
mov_302 = ratings[ratings['movieId'] == 302]
mov_302

Unnamed: 0,userId,movieId,rating,timestamp
693,6,302,3.0,845555436
5061,33,302,3.0,939716327
13140,84,302,3.0,858771796
67509,437,302,3.0,859721556
96175,603,302,4.0,954482471


In [60]:
# calculating the average rating for movie id 302
mov_302['rating'].mean()

3.2

In [61]:
# use the algorithm to predict a value
uid = str(1)   # user id 
iid = str(302)  # movie id 

# get a prediction for specific users and items.
predictor = svd.predict(uid, iid, r_ui=3.71, verbose=True)

user: 1          item: 302        r_ui = 3.71   est = 3.50   {'was_impossible': False}


In [62]:
def collab_filter(ratings=ratings):
    df = ratings.copy()
    
    # load the reader
    reader = Reader()
    
    # load the ratings dataset in the Dataset function
    data = Dataset.load_from_df(df[['userId', 'movieId', 'rating']], reader)
    
    # choose an algorithm to predict the ratings a user might give a movie
    svd = SVD()
    
    # input the user's id, movie id and mean rating of a movie
    print("User Id:")
    uid = input()
    
    print("Movie Id")
    iid = input()
    
    sub_df = df[df['movieId'] == iid]
    r_ui = sub_df['rating'].mean()

    # defining a cross-validation iterator
    kf = KFold(n_splits=5)
    for trainset, testset in kf.split(data):
    
        # train and test algorithm.
        svd.fit(trainset)
        predictions = svd.test(testset)
    
    # get a prediction for specific users and items.
    estimate = round(svd.predict(uid, iid, r_ui, verbose=False).est,1)
    
    return print(f"The estimated user rating for this movie is {estimate}.")

In [63]:
collab_filter()

User Id:
1
Movie Id
se7en
The estimated user rating for this movie is 3.5.


Overall, a collaborative filter based on the movie ratings has an RMSE of $\approx$__0.88__ indicating that this filter could be beneficial for the recommendation system.

### The improved recommendation system (a hybrid)

I chose to apply the collaborative filter to the second content-based recommender (`others_also_like`), as it yielded the best results of the three recommenders I built.

In [64]:
# To build the hybrid algorithm, I need a map of movie ids and tmdbids to work as indexers

# helper function
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

# create an index using the movie id and tmdbids of movies in the dataset
id_map = pd.read_csv('data/ml-latest-small/links.csv')[['movieId', 'tmdbId']]
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
id_map.columns = ['movieId', 'id']
id_map = id_map.merge(df[['title', 'id']], on='id').set_index('title')

# set record id as the index
indices_map = id_map.set_index('id')

In [65]:
def hybrid_recsys(df=df,ratings=ratings,id_map=id_map):
    """
    This function takes a movie title in any case and returns
    five similar movies based on the input movie's details, 
    which include the cast members, the directors, producers,
    producing companies and spoken languages.
    
    It also applies a collaborative filter based on other communal 
    movie ratings to determine whether the current user will like a 
    recommendation.
    """
    
    # copying the dataframe to maintain its integrity
    df = df.copy()
    
    # dropping records with no titles, if any
    no_title = df[df['title'].isnull()]
    df = df.drop(no_title.index)
    
    # dealing with the input features
    features = ['cast','genres','producer',
                'production_companies',
                'spoken_languages']

    for feat in features:
        df[feat] = df[feat].apply(literal_eval)
        df[feat] = df[feat].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

    df['director'] = df['director'].apply(lambda x: [x, x])
    
    # exploding the keywords so each keyword is a record
    df['keywords'] = df['keywords'].apply(literal_eval)
    s = df.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
    s.name = 'keyword'
    
    # getting the frequency of each keyword & filtering out unique keywords as they are not useful
    s = s.value_counts()    
    s = s[s > 1]
    
    # helper function to filter keywords by frequency
    def freq_filter(x):
        words = []
        for i in x:
            if i in s:
                words.append(i)
        return words
    
    # loading the stem function
    stem_filter = SnowballStemmer('english')
    
    # apply the keyword filters
    df['keywords'] = df['keywords'].apply(freq_filter)
    df['keywords'] = df['keywords'].apply(lambda x: [stem_filter.stem(i) for i in x])
    df['keywords'] = df['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
    
    # creating the details text
    details = ['cast','director','genres','producer',
               'production_companies','keywords',
               'spoken_languages']

    df['details'] = df[details].apply(lambda x: ' '.join(x.astype(str).str.strip("'[]").str.replace("', '", ", ")), axis=1)
    
    # create an indexer using the titles column
    df = df.reset_index()
    df['title_i'] = df['title']
    titles = df['title']
    indices = pd.Series(df.index, index=df['title_i'].str.lower())
    
    # user-defined title
    print("Select a Movie")
    title = input().lower()
    
    # find the input title in the indexer
    idx = indices[title]
    
    # vectorize the details text
    vect = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
    mtrx = vect.fit_transform(df['details'])
    
    # calculating the cosine similarity
    cosine_sim = cosine_similarity(mtrx, mtrx)

    # create the similarity score for the selected movie and sort from highest to lowest
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # return the 10 most similar movies in the dataset
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    
    # get the movie data
    movies = df.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id']].copy()
    
    # select a user identity
    print("User Id:")
    userId = input()

    # build the filter
    ratings = ratings.copy()
    reader = Reader()
    data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
    svd = SVD()
    
    # defining a cross-validation iterator
    kf = KFold(n_splits=5)
    for trainset, testset in kf.split(data):
    
        # train and test algorithm.
        svd.fit(trainset)
        predictions = svd.test(testset)

    # apply collaboration filter and sort the df
    movies['est'] = round(movies['id'].apply(lambda x: svd.predict(userId, indices_map.loc[x]['movieId']).est),1)    
    movies = movies.sort_values('est', ascending=False)
    
    # calculating the minimum number of votes movies need to be recommended (m)
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count']
    m = vote_counts.quantile(0.60)
    
    # calculating the average vote across all movies in the dataset (C) 
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average']
    C = vote_averages.mean()
    
    # filtering out the database against null vote count and average vote values 
    movies = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    
    # copying the qualified movies' title, release year, vote count, 
    # average vote, and collaborative filter estimate to a new df
    movies = movies[['title', 'year', 'vote_count', 'vote_average', 'est']].copy()
       
    # calculating the weighted score
    v = movies['vote_count'].astype('int')
    R = movies['vote_average'].astype('int')
    movies['weighted_score'] = round((v/(v+m) * R) + (m/(m+v) * C), 1)
    
    # sort by weighted score
    movies = movies.sort_values('est', ascending=False).head(10)

    return movies.head(10)

In [66]:
hybrid_recsys()

Select a Movie
se7en
User Id:
1


Unnamed: 0,title,year,vote_count,vote_average,est,weighted_score
46,The Usual Suspects,1995,9050,8,4.3,7.6
98,Taxi Driver,1976,10171,8,4.1,7.6
31,Twelve Monkeys,1995,7243,7,4.0,6.8
5,Heat,1995,5977,7,3.9,6.8


The goal of this project was to build a movie recommendation system using data scraped from [GroupLens](https://grouplens.org/datasets/movielens/latest/) and  [The Movie Database](https://www.themoviedb.org/). I began by building a simple recommendation system that provided suggestions based on the movies weighted rating. To fine-tune the system, I also added a genre filter to allow users to pick a movie category of interest. Unfortunately, this simple recommender failed to detect subtle differences between movies of the same genre and did not account for user's preferences. Thus, I decided to build a content-based recommender instead. This algorithm applied the movies' metadata, including the cast, crew and keywords, to improve the suggestions tremendously. Nevertheless, it still failed to account for user preferences, so I added a collaborative filter based on user ratings to the system. This filter uses the ratings made by a community of movie watchers to predict a certain user's penchant toward a recommended movie. 

Tags: ML, prediction, recommendation system, collaborative filter