# **Metadata based Recommender Systems**

Now, lets say a person is very fond of some directors and genres and he just wants to be recommended movies based on these metadata which can be the metadata of any movie or the user itself. For these users, let us build a recommender system that recommends movies based on metadata.

The dataset we will be using is the MovieLens 100k dataset on Kaggle :

https://www.kaggle.com/prajitdatta/movielens-100k-dataset





In [0]:
#importing necessary libraries

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from ast import literal_eval
from sklearn.feature_extraction.text import CountVectorizer


In [0]:
from google.colab import files
uploaded = files.upload()

Saving movies_metadata.csv to movies_metadata (1).csv


In [0]:
movies = pd.read_csv('movies_metadata.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [0]:
from google.colab import files
uploaded = files.upload()

Saving credits.csv to credits (1).csv


In [0]:
credits = pd.read_csv('credits.csv')

In [0]:
from google.colab import files
uploaded = files.upload()

Saving keywords.csv to keywords (1).csv


In [0]:
#keywords like jealousy, fishing, etc that belongs to particular movies are also part of the metadata.
#we will grab keywords from keywords.csv
keywords = pd.read_csv('keywords.csv')

In [0]:
#importing necessary columns
movies = movies[['id','title','genres']]

In [0]:
movies.head()

Unnamed: 0,id,title,genres
0,862,Toy Story,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '..."
1,8844,Jumanji,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."
2,15602,Grumpier Old Men,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
3,31357,Waiting to Exhale,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
4,11862,Father of the Bride Part II,"[{'id': 35, 'name': 'Comedy'}]"


In [0]:
movies.dtypes

id        object
title     object
genres    object
dtype: object

Movie ids are not integers ( object datatype) because of some dirty values, lets clean the movie_id column before proceeding.

In [0]:
sum(movies['id'].isnull())

0

In [0]:
movies['id'].value_counts()

141971    3
12600     2
25541     2
5511      2
11115     2
         ..
61532     1
50942     1
26293     1
115162    1
34138     1
Name: id, Length: 45436, dtype: int64

In [0]:
#clean movie_id function

def clean_id(x):
    try:
        return int(x)
    except:
        return np.nan

In [0]:
movies['id'] = movies['id'].apply(clean_id)
movies = movies[movies['id'].notnull()]

In [0]:
#converting everything into integer
movies['id'] = movies['id'].astype('int')
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')

In [0]:
#merging the 3 dataframes to get all the required data on 1 datafarame movies
movies = movies.merge(credits, on='id')
movies = movies.merge(keywords, on='id')

In [0]:
movies.head()

Unnamed: 0,id,title,genres,cast,crew,keywords
0,862,Toy Story,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,Jumanji,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,Grumpier Old Men,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,Waiting to Exhale,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,Father of the Bride Part II,"[{'id': 35, 'name': 'Comedy'}]","[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [0]:
movies.dtypes

id           int64
title       object
genres      object
cast        object
crew        object
keywords    object
dtype: object

The **genres**, **cast**, **crew** and **keywords** all are of a object ( or a string datatype). Lets get the required words we will be using from these columns by first using literal_eval to convert these strings into python objects and using pandas and numpy to wrangle them.

In [0]:
movies['genres'][2]

"[{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]"

In [0]:
print(sum(movies['genres'].isnull()))
print(sum(movies['cast'].isnull()))
print(sum(movies['crew'].isnull()))
print(sum(movies['keywords'].isnull()))

0
0
0
0


In [0]:
#changing the 4 columns into python objects ( list of dictionaries here)
movies['genres'] = movies['genres'].apply(literal_eval)
movies['cast'] = movies['cast'].apply(literal_eval)
movies['crew'] = movies['crew'].apply(literal_eval)
movies['keywords'] = movies['keywords'].apply(literal_eval)


In [0]:
movies.dtypes

id           int64
title       object
genres      object
cast        object
crew        object
keywords    object
dtype: object

In [0]:
movies.head()

Unnamed: 0,id,title,genres,cast,crew,keywords
0,862,Toy Story,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,Jumanji,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,Grumpier Old Men,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,Waiting to Exhale,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[{'cast_id': 1, 'character': 'Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,Father of the Bride Part II,"[{'id': 35, 'name': 'Comedy'}]","[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [0]:
#grabbing the names of all the genres attached to each movie
movies['genres'] = movies['genres'].apply(lambda x: [i['name'].lower() for i in x])

In [0]:
movies.head()

Unnamed: 0,id,title,genres,cast,crew,keywords
0,862,Toy Story,"[animation, comedy, family]","[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,Jumanji,"[adventure, fantasy, family]","[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,Grumpier Old Men,"[romance, comedy]","[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,Waiting to Exhale,"[comedy, drama, romance]","[{'cast_id': 1, 'character': 'Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,Father of the Bride Part II,[comedy],"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [0]:
#grabbing the name of the director from all the crew members
#we will only use directors from the creqw column for our purpose
movies['crew'] = movies['crew'].apply(lambda x: [i['name'].lower() for i in x if i['job']=='Director'])

In [0]:
#grabbing the cast and keywords from the list of dictionaries of those columns

movies['cast'] = movies['cast'].apply(lambda x: [i['name'].lower() for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x: [i['name'].lower() for i in x])

In [0]:
movies.head()

Unnamed: 0,id,title,genres,cast,crew,keywords
0,862,Toy Story,"[animation, comedy, family]","[tom hanks, tim allen, don rickles, jim varney...",[john lasseter],"[jealousy, toy, boy, friendship, friends, riva..."
1,8844,Jumanji,"[adventure, fantasy, family]","[robin williams, jonathan hyde, kirsten dunst,...",[joe johnston],"[board game, disappearance, based on children'..."
2,15602,Grumpier Old Men,"[romance, comedy]","[walter matthau, jack lemmon, ann-margret, sop...",[howard deutch],"[fishing, best friend, duringcreditsstinger, o..."
3,31357,Waiting to Exhale,"[comedy, drama, romance]","[whitney houston, angela bassett, loretta devi...",[forest whitaker],"[based on novel, interracial relationship, sin..."
4,11862,Father of the Bride Part II,[comedy],"[steve martin, diane keaton, martin short, kim...",[charles shyer],"[baby, midlife crisis, confidence, aging, daug..."


In [0]:
#taking maximum 3 cast/genre/keywords for each movie
movies['genres'] = movies['genres'].apply(lambda x: x[:3] if len(x)>3 else x)
movies['cast'] = movies['cast'].apply(lambda x: x[:3] if len(x)>3 else x)
movies['keywords'] = movies['keywords'].apply(lambda x: x[:3] if len(x)>3 else x)

In [0]:
movies.head()

Unnamed: 0,id,title,genres,cast,crew,keywords
0,862,Toy Story,"[animation, comedy, family]","[tom hanks, tim allen, don rickles]",[john lasseter],"[jealousy, toy, boy]"
1,8844,Jumanji,"[adventure, fantasy, family]","[robin williams, jonathan hyde, kirsten dunst]",[joe johnston],"[board game, disappearance, based on children'..."
2,15602,Grumpier Old Men,"[romance, comedy]","[walter matthau, jack lemmon, ann-margret]",[howard deutch],"[fishing, best friend, duringcreditsstinger]"
3,31357,Waiting to Exhale,"[comedy, drama, romance]","[whitney houston, angela bassett, loretta devine]",[forest whitaker],"[based on novel, interracial relationship, sin..."
4,11862,Father of the Bride Part II,[comedy],"[steve martin, diane keaton, martin short]",[charles shyer],"[baby, midlife crisis, confidence]"


Now, we have the required clean data for building the recommender systems. We just need to **remove spaces** between the names and surnames because if we won't remomve the spaces, movies for  **Tom Cruise** and **Tom Hanks** will be considered the same by machine. Lets remove spaces so that Tom Hanks becomes TomHanks and Tom Cruise becomes TomCruise.

In [0]:
#=removing spaces
movies['cast'] = movies['cast'].apply(lambda x: [i.replace(' ','') for i in x])
movies['crew'] = movies['crew'].apply(lambda x: [i.replace(' ','') for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x: [i.replace(' ','') for i in x])
movies['genres'] = movies['genres'].apply(lambda x: [i.replace(' ','') for i in x])

In [0]:
movies.head()

Unnamed: 0,id,title,genres,cast,crew,keywords
0,862,Toy Story,"[animation, comedy, family]","[tomhanks, timallen, donrickles]",[johnlasseter],"[jealousy, toy, boy]"
1,8844,Jumanji,"[adventure, fantasy, family]","[robinwilliams, jonathanhyde, kirstendunst]",[joejohnston],"[boardgame, disappearance, basedonchildren'sbook]"
2,15602,Grumpier Old Men,"[romance, comedy]","[waltermatthau, jacklemmon, ann-margret]",[howarddeutch],"[fishing, bestfriend, duringcreditsstinger]"
3,31357,Waiting to Exhale,"[comedy, drama, romance]","[whitneyhouston, angelabassett, lorettadevine]",[forestwhitaker],"[basedonnovel, interracialrelationship, single..."
4,11862,Father of the Bride Part II,[comedy],"[stevemartin, dianekeaton, martinshort]",[charlesshyer],"[baby, midlifecrisis, confidence]"


**Now, lets make 1 column of metadata by appending the values in the genres, cast, crew and keywords column**

In [0]:
movies['metadata'] = movies.apply(lambda x : ' '.join(x['genres']) + ' ' + ' '.join(x['cast']) + ' ' + ' '.join(x['crew']) + ' ' + ' '.join(x['keywords']), axis = 1)

In [0]:
movies.head()

Unnamed: 0,id,title,genres,cast,crew,keywords,metadata
0,862,Toy Story,"[animation, comedy, family]","[tomhanks, timallen, donrickles]",[johnlasseter],"[jealousy, toy, boy]",animation comedy family tomhanks timallen donr...
1,8844,Jumanji,"[adventure, fantasy, family]","[robinwilliams, jonathanhyde, kirstendunst]",[joejohnston],"[boardgame, disappearance, basedonchildren'sbook]",adventure fantasy family robinwilliams jonatha...
2,15602,Grumpier Old Men,"[romance, comedy]","[waltermatthau, jacklemmon, ann-margret]",[howarddeutch],"[fishing, bestfriend, duringcreditsstinger]",romance comedy waltermatthau jacklemmon ann-ma...
3,31357,Waiting to Exhale,"[comedy, drama, romance]","[whitneyhouston, angelabassett, lorettadevine]",[forestwhitaker],"[basedonnovel, interracialrelationship, single...",comedy drama romance whitneyhouston angelabass...
4,11862,Father of the Bride Part II,[comedy],"[stevemartin, dianekeaton, martinshort]",[charlesshyer],"[baby, midlifecrisis, confidence]",comedy stevemartin dianekeaton martinshort cha...


Due to memory issues in google colab, I will just run the first 10000 movies to build a recommender system here. The same codes can be used to scale further.

In [0]:
movies_df = movies.iloc[0:10000,:]

In [0]:
movies_df.shape

(10000, 7)

I will use a **CountVectorizer** to built numeric features from our metadata. I am not using TfIdf here because there might be many movies with same directors and I don't wanna penalize that director. It might be possible that a user wants to be recommended movies belonigng to that director. Most of the words we have are names and genres whose counts are actually useful for recommending movies.

I will use **cosine similarity** to find the similarity between any 2 movies. Now lets make a cosine similarity matrix using count vectorizer values and then lets build a recommender function.

In [0]:
count_vec = CountVectorizer(stop_words='english')
count_vec_matrix = count_vec.fit_transform(movies_df['metadata'])

cosine_sim_matrix = cosine_similarity(count_vec_matrix, count_vec_matrix)

In [0]:
#movies index mapping
mapping = pd.Series(movies_df.index,index = movies_df['title'])

In [0]:
mapping

title
Toy Story                             0
Jumanji                               1
Grumpier Old Men                      2
Waiting to Exhale                     3
Father of the Bride Part II           4
                                   ... 
National Lampoon's Gold Diggers    9995
Blind Horizon                      9996
Islands in the Stream              9997
Go for Broke!                      9998
The Blood on Satan's Claw          9999
Length: 10000, dtype: int64

In [0]:
#recommender function to recommend movies based on metadata

def recommend_movies_based_on_metadata(movie_input):
  
  movie_index = mapping[movie_input]
  #get similarity values with other movies
  similarity_score = list(enumerate(cosine_sim_matrix[movie_index]))

  similarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True)
  
  # Get the scores of the 15 most similar movies. Ignore the first movie.
  similarity_score = similarity_score[1:15]

  
  movie_indices = [i[0] for i in similarity_score]
  return (movies_df['title'].iloc[movie_indices])

In [0]:
recommend_movies_based_on_metadata('Blind Horizon')

1648              Ill Gotten Gains
3487    Jails, Hospitals & Hip-Hop
8331                        Fabled
1801               Little Boy Blue
3940                 Kill Me Again
1391              Hearts and Minds
2715             The Pelican Brief
9506                        Trauma
3020                       Country
5763                   Raggedy Man
9586          When Will I Be Loved
111               Before and After
458                  Guilty as Sin
627                          Frisk
Name: title, dtype: object

For the input movie 'blind Horizon', we have been recommended 15 movies by learning the metadata of the input movie. Isn't it amazing. 

**Thank you**

Keep learning