### Movie recommendation - Content based filtering

There are three popular and widely used methods of approaching a recommendation system.

**1. Content based filterig:** *Similar content*

Recommendation based on the similarity of the content. This method required knowledge about the content and their categorization.

**2. Collaboration based filterig:** *Similar users*

Recommendation based on the similarity of users basis their watch history. This method doesn't required information on the type of contents

**3. Demographic filterig:** *Similar demographic features* or some genric recommendations

This implementation will use the movies and metatadata dataset from [Dataset](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata "Kaggle") and use content based similarity rules for recommendations.


### Importing dataset

In [200]:
import pandas as pd

credits_df = pd.read_csv(r"credits.csv")
movies_df = pd.read_csv(r"movies.csv")

In [201]:
credits_df.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [202]:
movies_df.head(3)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",10-12-2009,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",19-05-2007,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",26-10-2015,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466


### Combining datasets
Let us make use of metadata such as genre and keywords by combining both datasets

In [203]:
movies_df = movies_df.merge(credits_df, left_on="id", right_on="movie_id")
movies_df.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title_x', 'vote_average',
       'vote_count', 'movie_id', 'title_y', 'cast', 'crew'],
      dtype='object')

### Extracting prominent features

Let us understant how the data looks

In [204]:
movies_df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,spoken_languages,status,tagline,title_x,vote_average,vote_count,movie_id,title_y,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


Let us extract 3 cast members

In [205]:
type(movies_df['cast'][0])

str

As we could see that cast is saved as a list of dictionaries but currently identified as string, let us apply literal_eval to convert to actual dat type

In [206]:
from ast import literal_eval

features = ["cast", "crew", "keywords", "genres"]

for feature in features:
    movies_df[feature] = movies_df[feature].apply(literal_eval)

In [207]:
type(movies_df['cast'][0])

list

In [208]:
type(movies_df['cast'][0][0])

dict

In [209]:
movies_df['cast'][0][0]

{'cast_id': 242,
 'character': 'Jake Sully',
 'credit_id': '5602a8a7c3a3685532001c9a',
 'gender': 2,
 'id': 65731,
 'name': 'Sam Worthington',
 'order': 0}

Now we have a list of dictionaries for each row of cast data

**Cast:** The cast feature contains all the names of each member of the cast. But we will take 3-5 cast members as input

**Crew:** For now, let us extract only director from crew as input.

Extracting 3 cast members and it to main_cast column where each row of main_cast column would contain the list of upto 3 cast member names

In [210]:
movies_df['cast'][0][0]['name']

'Sam Worthington'

### Extracting the 3 cast members 

In [211]:
movies_df['main_cast'] = movies_df['cast'].apply(lambda x: [x[0]['name'],x[1]['name'],x[2]['name']] if (isinstance(x, list) and len(x) >= 3) \
                                                 else ([x[0]['name'],x[1]['name']]   if (isinstance(x, list) and len(x) == 2)  \
                                                       else ([x[0]['name']] if (isinstance(x, list) and len(x) == 1)   else []     )   ))

In [212]:
movies_df['main_cast'][:5]

0    [Sam Worthington, Zoe Saldana, Sigourney Weaver]
1       [Johnny Depp, Orlando Bloom, Keira Knightley]
2        [Daniel Craig, Christoph Waltz, Léa Seydoux]
3        [Christian Bale, Michael Caine, Gary Oldman]
4      [Taylor Kitsch, Lynn Collins, Samantha Morton]
Name: main_cast, dtype: object

### Extracting the director name

Let us extract the director name with a function

In [213]:
import numpy as np
def get_director(list):
    for dictionary in list:
        if dictionary['job'] == "Director":
            return dictionary['name']
    return np.nan                

In [214]:
movies_df['director'] = movies_df['crew'].apply(get_director)

In [215]:
movies_df['director']

0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
4          Andrew Stanton
              ...        
4798     Robert Rodriguez
4799         Edward Burns
4800          Scott Smith
4801          Daniel Hsia
4802     Brian Herzlinger
Name: director, Length: 4803, dtype: object

### Extracting 3 items from keywords and genres

The movie type or generic feature of the movie can be identified with some available features like genre and keywords. Let us make a user defined function for this repetitive task

In [216]:
def get_3_features(the_list):
    names = []
    if isinstance(the_list, list):
        for dictionary in the_list:
            names.append(dictionary['name'])
    if len(names) > 3:
        names = names[:3]
        
    return names    

In [217]:
movies_df['genres'] = movies_df['genres'].apply(get_3_features)
movies_df['keywords'] = movies_df['keywords'].apply(get_3_features)

In [219]:
movies_df.head(3)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,tagline,title_x,vote_average,vote_count,movie_id,title_y,cast,crew,main_cast,director
0,237000000,"[Action, Adventure, Fantasy]",http://www.avatarmovie.com/,19995,"[culture clash, future, space war]",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,Enter the World of Pandora.,Avatar,7.2,11800,19995,Avatar,"[{'cast_id': 242, 'character': 'Jake Sully', '...","[{'credit_id': '52fe48009251416c750aca23', 'de...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",James Cameron
1,300000000,"[Adventure, Fantasy, Action]",http://disney.go.com/disneypictures/pirates/,285,"[ocean, drug abuse, exotic island]",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,Pirates of the Caribbean: At World's End,"[{'cast_id': 4, 'character': 'Captain Jack Spa...","[{'credit_id': '52fe4232c3a36847f800b579', 'de...","[Johnny Depp, Orlando Bloom, Keira Knightley]",Gore Verbinski
2,245000000,"[Action, Adventure, Crime]",http://www.sonypictures.com/movies/spectre/,206647,"[spy, based on novel, secret agent]",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,A Plan No One Escapes,Spectre,6.3,4466,206647,Spectre,"[{'cast_id': 1, 'character': 'James Bond', 'cr...","[{'credit_id': '54805967c3a36829b5002c41', 'de...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",Sam Mendes


### Cleaning the data and making a corpus

Removing spaces between Crew and cast names to make them unique object. For example 'James Cameron' will become JamesCameron and becomes a unique identifier for the director

In [222]:
def clean_data(row):
    if isinstance(row, list):
        
        return [str.lower(i.replace(" ", "")) for i in row]
    else:
        if isinstance(row, str):
            return str.lower(row.replace(" ", ""))
        else:
            return ""

features = ['main_cast', 'keywords', 'director', 'genres']
for feature in features:
    movies_df[feature] = movies_df[feature].apply(clean_data)

In [223]:
movies_df[features]

Unnamed: 0,main_cast,keywords,director,genres
0,"[samworthington, zoesaldana, sigourneyweaver]","[cultureclash, future, spacewar]",jamescameron,"[action, adventure, fantasy]"
1,"[johnnydepp, orlandobloom, keiraknightley]","[ocean, drugabuse, exoticisland]",goreverbinski,"[adventure, fantasy, action]"
2,"[danielcraig, christophwaltz, léaseydoux]","[spy, basedonnovel, secretagent]",sammendes,"[action, adventure, crime]"
3,"[christianbale, michaelcaine, garyoldman]","[dccomics, crimefighter, terrorist]",christophernolan,"[action, crime, drama]"
4,"[taylorkitsch, lynncollins, samanthamorton]","[basedonnovel, mars, medallion]",andrewstanton,"[action, adventure, sciencefiction]"
...,...,...,...,...
4798,"[carlosgallardo, jaimedehoyos, petermarquardt]","[unitedstates–mexicobarrier, legs, arms]",robertrodriguez,"[action, crime, thriller]"
4799,"[edwardburns, kerrybishé, marshadietlein]",[],edwardburns,"[comedy, romance]"
4800,"[ericmabius, kristinbooth, crystallowe]","[date, loveatfirstsight, narration]",scottsmith,"[comedy, drama, romance]"
4801,"[danielhenney, elizacoupe, billpaxton]",[],danielhsia,[]


Making a metadata feature by combining all the extracted features 

In [226]:
def create_soup(features):
    return ' '.join(features['keywords']) + ' ' + ' '.join(features['main_cast']) + ' ' + features['director'] + ' ' + ' '.join(features['genres'])


movies_df["soup"] = movies_df[features].apply(create_soup, axis=1)
print(movies_df["soup"].head())

0    cultureclash future spacewar samworthington zo...
1    ocean drugabuse exoticisland johnnydepp orland...
2    spy basedonnovel secretagent danielcraig chris...
3    dccomics crimefighter terrorist christianbale ...
4    basedonnovel mars medallion taylorkitsch lynnc...
Name: soup, dtype: object


### Creating Sparse matrix

Using CountVectorizer to create a frequency vector from soup and avoiding stop_words which don't add much value 

In [227]:
from sklearn.feature_extraction.text import CountVectorizer


count_vectorizer = CountVectorizer(stop_words="english")
count_matrix = count_vectorizer.fit_transform(movies_df["soup"])




In [232]:
count_matrix.shape

(4803, 11520)

When we have this sparse matrix, this will have all the individual words in the columns and show whether the wird is present in the specific row

Let us look at the first row

In [244]:
print(count_matrix.toarray()[0][:500])

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 

This shows that wherever the specific word in the corpus is present in the first row, the value will be 1 or more

### Creating similarity matrix

In [246]:
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(count_matrix, count_matrix) 
print(similarity.shape)

(4803, 4803)


### Checking the results

To extract the similarity basis movie titles, let us set the title as index for the movie dataframe

In [252]:
movies_df = movies_df.reset_index()
indices = pd.Series(movies_df.index, index=movies_df['original_title']).drop_duplicates()

In [254]:
indices.head()

original_title
Avatar                                      0
Pirates of the Caribbean: At World's End    1
Spectre                                     2
The Dark Knight Rises                       3
John Carter                                 4
dtype: int64

In [255]:
indices['Avatar']

0

In [257]:
indices['The Dark Knight Rises']

3

Creating a dataframe using indices series

In [344]:
indices_df = pd.DataFrame(indices, columns= ['index']).reset_index()

Let us find the cosine similarity for 'The Dark Knight Rises'

In [271]:
list(enumerate(similarity[3]))[:10]

[(0, 0.1),
 (1, 0.1),
 (2, 0.2),
 (3, 0.9999999999999999),
 (4, 0.1),
 (5, 0.1),
 (6, 0.0),
 (7, 0.1),
 (8, 0.0),
 (9, 0.2)]

Making sorted list in descending order

In [283]:
sorted(list(enumerate(similarity[3])), key=lambda x: x[1], reverse=True)[1:11]

[(65, 0.7),
 (119, 0.7),
 (4638, 0.5477225575051663),
 (1196, 0.4),
 (3073, 0.4),
 (3326, 0.3585685828003181),
 (1503, 0.33541019662496846),
 (1986, 0.33541019662496846),
 (303, 0.31622776601683794),
 (747, 0.31622776601683794)]

It is observed that movie with id 65 is the most similar movie

In [302]:
indices.loc[lambda x : (x ==65 )] 

original_title
The Dark Knight    65
dtype: int64

Second most similar moview is

In [335]:
indices.loc[lambda x : (x ==119 )].get

original_title
Batman Begins    119
dtype: int64

### Creating a production ready output

Creating a function to create a list of 10 recommendations based on similarity score

In [358]:
def recommend(movie_title):
    movie_list = []
    #find the index from title
    index = indices[movie_title]
    # find similar movies
    top_10_movies = sorted(list(enumerate(similarity[index])), key=lambda x: x[1], reverse=True)[1:11]
    
    for i in top_10_movies:
        movie_list.append( indices_df['original_title'][(indices_df['index'] == i[0])].values[0] )
    print(movie_list)    

In [359]:
recommend("Avatar")

['Clash of the Titans', 'The Mummy: Tomb of the Dragon Emperor', '西游记之孙悟空三打白骨精', "The Sorcerer's Apprentice", 'G-Force', '4: Rise of the Silver Surfer', 'The Time Machine', 'The Scorpion King', "Pirates of the Caribbean: At World's End", 'Spider-Man 3']


In [360]:
recommend("Superman Returns")

['Man of Steel', 'Superman', 'Superman II', 'Batman v Superman: Dawn of Justice', 'X-Men: Days of Future Past', 'Superman III', "The Warrior's Way", 'Superman IV: The Quest for Peace', 'The Mummy: Tomb of the Dragon Emperor', '西游记之孙悟空三打白骨精']


In [362]:
recommend("The Prestige")

['Fabled', 'The Dark Knight Rises', 'Batman Begins', 'The Village', 'Stir of Echoes', 'Spider', 'The Statement', 'Exorcist: The Beginning', 'Goddess of Love', 'Amnesiac']
