# Introduction

This notebook builds a recommendation engine from a database of close to 5000 movies. The dataset contains information about the movies, such as descriptions, cast, genre, etc. 

The user needs to specify a movie that they want to find similar movies to, then the engine finds 10 movies that are similar to the user-provided movie.

The dataset contains information about the content of movies, but not the ratings of indivitual users. So I will build a **content-based recommendation engine**, and not one based on collaborative filtering. The dataset can be found [here](https://www.kaggle.com/tmdb/tmdb-movie-metadata).

# Loading data

In [23]:
# Importing libraries
import pandas as pd
import numpy as np
import json
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [24]:
# Loading data
df_cred = pd.read_csv("tmdb_5000_credits.csv")
df_mov = pd.read_csv("tmdb_5000_movies.csv")

In [25]:
print(df_cred.shape)
print(df_mov.shape)

(4803, 4)
(4803, 20)


In [26]:
# Rename columns for merge
df_cred.rename(columns={"movie_id": "id","title":'title_cred'}, inplace=True)

# Merge dataframes
df = df_mov.merge(df_cred,on='id')
df.shape

(4803, 23)

In [27]:
df.head(3)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,title_cred,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."


# Data pre-processing

### Converting JSON features

Several of the columns contain json, so these will be decoded.

In [28]:
df.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count', 'title_cred', 'cast', 'crew'],
      dtype='object')

In [29]:
# Checking type of columns
json_features = ['cast', 'crew', 'keywords', 'genres']
for feat in json_features:
  print(isinstance(df[feat][0], list))

False
False
False
False


In [30]:
# Decoding JSON columns
json_features = ['cast', 'crew', 'keywords', 'genres']
for column in json_features:
  df[column] = df[column].apply(json.loads)

In [31]:
# Checking type of decoded columns
for feat in json_features:
  print(isinstance(df[feat][0], list))

True
True
True
True


### Functions to get necessary features

In [32]:
# Keywords are stored in dictionaries with a 'name' key
df.keywords[2]

[{'id': 470, 'name': 'spy'},
 {'id': 818, 'name': 'based on novel'},
 {'id': 4289, 'name': 'secret agent'},
 {'id': 9663, 'name': 'sequel'},
 {'id': 14555, 'name': 'mi6'},
 {'id': 156095, 'name': 'british secret service'},
 {'id': 158431, 'name': 'united kingdom'}]

In [33]:
# Function to return keywords
def get_keyword(x):
    if isinstance(x, list):
        keywords = [i['name'] for i in x] 
        return keywords
    return []

In [34]:
# Function to return the director out of crew column, if it contains the director
def get_dir(crew):
    for i in crew:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [35]:
# Return first three elements in list
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x] 
        if len(names) > 3:
            names = names[:3]
        return names
    return []

In [36]:
# Apply functions to columns
df['director'] = df['crew'].apply(get_dir)
features = ['cast', 'genres']
for feature in features:
    df[feature] = df[feature].apply(get_list)
df['keywords'] = df['keywords'].apply(get_keyword)

In [37]:
df.keywords[2]

['spy',
 'based on novel',
 'secret agent',
 'sequel',
 'mi6',
 'british secret service',
 'united kingdom']

### Cleaning features

In [38]:
# Function to convert to lowercase and strip names of spaces
def clean_data(feature):
    if isinstance(feature, list):
        return [str.lower(item.replace(" ","")) for item in feature]
    else:
        if isinstance(feature, str):
            return str.lower(feature.replace(" ",""))
        else:
            return ''

In [39]:
# Cleaning features
features = ['cast', 'keywords', 'director', 'genres']
for feature in features:
    df[feature] = df[feature].apply(clean_data)

In [40]:
features = ['title','keywords','cast','genres','director']

In [41]:
df[features].head()

Unnamed: 0,title,keywords,cast,genres,director
0,Avatar,"[cultureclash, future, spacewar, spacecolony, ...","[samworthington, zoesaldana, sigourneyweaver]","[action, adventure, fantasy]",jamescameron
1,Pirates of the Caribbean: At World's End,"[ocean, drugabuse, exoticisland, eastindiatrad...","[johnnydepp, orlandobloom, keiraknightley]","[adventure, fantasy, action]",goreverbinski
2,Spectre,"[spy, basedonnovel, secretagent, sequel, mi6, ...","[danielcraig, christophwaltz, léaseydoux]","[action, adventure, crime]",sammendes
3,The Dark Knight Rises,"[dccomics, crimefighter, terrorist, secretiden...","[christianbale, michaelcaine, garyoldman]","[action, crime, drama]",christophernolan
4,John Carter,"[basedonnovel, mars, medallion, spacetravel, p...","[taylorkitsch, lynncollins, samanthamorton]","[action, adventure, sciencefiction]",andrewstanton


In [42]:
# Fill all NaNs with an empty string
for feature in features:
    df[feature] = df[feature].fillna('')

In [43]:
def combine_features(row):
    return ' '.join(row['keywords']) + ' ' + ' '.join(row['cast']) + ' ' + ' '.join(row['genres']) + ' ' + row['director']  

In [44]:
# Combining features
df["combined_features"] = df.apply(combine_features,axis=1) 

# Create content-based recommender system

### Create vecotizer and similarity score matrix

In [45]:
# Initiate count vectorizer
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df["combined_features"])

I will compare movies based on their cosine similarity scores.

In [46]:
cosine_sim = cosine_similarity(count_matrix)

In [47]:
# Display cosine similarity matrix: each column and each row corresponds to a movie
df_cosine_sim = pd.DataFrame(cosine_sim)
df_cosine_sim.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,4763,4764,4765,4766,4767,4768,4769,4770,4771,4772,4773,4774,4775,4776,4777,4778,4779,4780,4781,4782,4783,4784,4785,4786,4787,4788,4789,4790,4791,4792,4793,4794,4795,4796,4797,4798,4799,4800,4801,4802
0,1.0,0.115728,0.101015,0.035714,0.197028,0.115728,0.0,0.141737,0.09759,0.141737,0.137505,0.113961,0.123718,0.084515,0.141737,0.086711,0.09167,0.236433,0.104828,0.169031,0.130066,0.101015,0.104828,0.09759,0.09167,0.109109,0.133631,0.065795,0.118217,0.086711,0.130066,0.118217,0.188982,0.101015,0.0,0.09167,0.151523,0.141737,0.146385,0.09167,...,0.077152,0.0,0.0,0.0,0.0,0.071429,0.077152,0.054554,0.0,0.059761,0.0,0.062994,0.0,0.0,0.0,0.059761,0.0,0.0,0.05698,0.0,0.0,0.066815,0.0,0.059761,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052414,0.066815,0.052414,0.0,0.0
1,0.115728,1.0,0.109109,0.038576,0.085126,0.208333,0.0,0.102062,0.105409,0.153093,0.148522,0.123091,0.579066,0.182574,0.153093,0.093659,0.148522,0.340503,0.056614,0.136931,0.140488,0.109109,0.113228,0.105409,0.19803,0.078567,0.144338,0.1066,0.085126,0.093659,0.234146,0.127688,0.204124,0.163663,0.051031,0.099015,0.109109,0.102062,0.158114,0.099015,...,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.06455,0.068041,0.0,0.0,0.0,0.0,0.06455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.056614,0.0,0.0,0.0,0.0
2,0.101015,0.109109,1.0,0.101015,0.167183,0.163663,0.0,0.200446,0.069007,0.133631,0.194461,0.402911,0.116642,0.119523,0.133631,0.122628,0.129641,0.111456,0.074125,0.119523,0.122628,0.142857,0.074125,0.138013,0.129641,0.0,0.188982,0.093048,0.167183,0.429198,0.183942,0.111456,0.133631,0.142857,0.0,0.129641,0.214286,0.066815,0.20702,0.129641,...,0.0,0.0,0.0,0.0,0.0,0.0,0.109109,0.0,0.0,0.084515,0.0,0.0,0.0,0.0,0.0,0.169031,0.0,0.101015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.054554,0.0,0.0,0.0,0.106904,0.0,0.0,0.0,0.0,0.0,0.14825,0.0,0.0,0.0,0.0


In [48]:
# Create index column for finding movie
df['index'] = df.index

# Functions to find movie from index and vice versa
def get_title_from_index(index):
    return df[df.index == index]["title"].values[0]
def get_index_from_title(movie):
    return df[df.title == movie]["index"].values[0]

### Get similar movies

User's chosen movie example

In [79]:
# Example of a user's chosen movie
users_movie = "Inception"

In [80]:
# Get movie index
movie_index = get_index_from_title(users_movie)

# Create a list of tuples containing movie index and similarity score
similar_movies = list(enumerate(cosine_sim[movie_index]))

In [81]:
# Display first five tuples containing movie index and similarity score to user's movie
similar_movies[0:5]

[(0, 0.0472455591261534),
 (1, 0.051031036307982884),
 (2, 0.0668153104781061),
 (3, 0.0944911182523068),
 (4, 0.10425720702853739)]

In [82]:
# Sort movies based on similarity
sorted_similar_movies = sorted(similar_movies,key=lambda x:x[1],reverse=True)[1:11]

In [83]:
# Get similar movies' titles
i=0
print("Top 10 similar movies to "+users_movie, ":\n")
for item in sorted_similar_movies:
    print(get_title_from_index(item[0]))
    i+=1
    if i>10:
        break

Top 10 similar movies to Inception :

The Helix... Loaded
Timecop
The One
Premium Rush
Surrogates
Impostor
Megiddo: The Omega Code 2
Triple 9
The Cold Light of Day
Looper


### Sort similar movies by average vote

In [84]:
# Create list of tuples based on vote average
sort_by_average_vote = sorted(sorted_similar_movies,key=lambda x: df["vote_average"][x[0]],reverse=True)
sort_by_average_vote[0:5]

[(1568, 0.24514516892273006),
 (1431, 0.2581988897471611),
 (1272, 0.25),
 (476, 0.25),
 (1002, 0.2651650429449553)]

In [85]:
# Print first 10 movie reccommendations based on similarity and average vote
i=0
print("Top 10 similar movies to",users_movie, "based on average ratings:\n")
for element in sort_by_average_vote:
    print(get_title_from_index(element[0]))
    i+=1
    if i>10:
        break

Top 10 similar movies to Inception based on average ratings:

Looper
Premium Rush
Impostor
Surrogates
The One
Triple 9
Timecop
The Helix... Loaded
The Cold Light of Day
Megiddo: The Omega Code 2
