#                 Content Based Recommendation System

### Content-based filtering methods are based on a description of the item and a profile of the user’s preferences.These methods are best suited to situations where there is known data on an item (name, location, description, etc.), but not on the user. Content-based recommenders treat recommendation as a user-specific classification problem and learn a classifier for the user's likes and dislikes based on product features

![cbrsexa](https://user-images.githubusercontent.com/19235560/61588820-2c30a700-abbf-11e9-8d3c-cc64402e618d.png)

Post your doubts and feedback on our FB group <a href='https://www.facebook.com/groups/colearninglounge/learning_content/?filter=870558249994697'>here</a>.

### Table of content
* Introduction
* Types of content base recommender
* Document vectors
   1. Count Vectorizer
   2. Tfidf Vectorizer
* Plot description base recommender
* Creating tfidf matrix
* Cosine similarity
* Building Recommender function for plot based recommender
* Metadata based recommender 
* Wrangling keywords, cast, and crew
* Creating the metadata soup
* Summary


### Introduction

In this tutorial,we shall cover some basics of content base recommendation system and its types.Then we cover some of the more advance techniques like document vetorization and cosine similarity,with this we shall build recommender function for plot based recommender and then cover metadata based recommender, wrangling data and build metadata soupt then call recommender function for metadata based recommender. 

### we are going to build two types of content based recommender
### 1 Plot description-based recommender
### 2 Metadata-based recommender

In [1]:
# importing pandas and numpy library
import pandas as pd
import numpy as np

In [2]:
# importing metadata dataset
df_clean=pd.read_csv("metadata_clean.csv")
# print the head of metadata
df_clean.head(5)

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year
0,Toy Story,"['animation', 'comedy', 'family']",81.0,7.7,5415.0,1995
1,Jumanji,"['adventure', 'fantasy', 'family']",104.0,6.9,2413.0,1995
2,Grumpier Old Men,"['romance', 'comedy']",101.0,6.5,92.0,1995
3,Waiting to Exhale,"['comedy', 'drama', 'romance']",127.0,6.1,34.0,1995
4,Father of the Bride Part II,['comedy'],106.0,5.7,173.0,1995


### [Document vectors](https://en.wikipedia.org/wiki/Vector_space_model)

we are converting our documents into vectors as a mathematical quantity.
For this there are two most popular vectorizers.
###### 1 CountVectorizer :The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

![count-vector](https://user-images.githubusercontent.com/19235560/61589263-3190f000-abc5-11e9-9626-a953068db567.png)

###### 2 TF-IDFVectorizer :[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is an abbreviation for Term Frequency-Inverse Document Frequency and is a very common algorithm to transform text into a meaningful representation of numbers.


TF-IDFVectorizer (Term Frequency-Inverse Document Frequency)takes the aforementioned point into consideration and assigns weights to each word according to the following formula. For every word i in document j, the following applies:

![tfidf](https://user-images.githubusercontent.com/19235560/61589223-8aac5400-abc4-11e9-8c6a-122de177955f.png)

In this formula, the following is true:
wi, j is the weight of word i in document j
dfi is the number of documents that contain the term i
N is the total number of documents


### Plot description-based recommender

######  Our plot description-based recommender will take in a movie title as an argument and output a list of movies that are most similar based on their plots. These are the steps we are going to perform in building this model:
###### 1 Obtain the data required to build the model
###### 2 Create TF-IDF vectors for the plot description (or overview) of every movie
###### 3 Compute the pairwise cosine similarity score of every movie
###### 4 Write the recommender function that takes in a movie title as an argument and outputs movies most similar                   to it based on the plot

In [3]:
# importing movies_metadata
df_meta=pd.read_csv("movies_metadata.csv")
df_meta.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [4]:
df_clean['overview'], df_clean['id'] = df_meta['overview'], df_meta['id']

df_clean.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id
0,Toy Story,"['animation', 'comedy', 'family']",81.0,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ...",862
1,Jumanji,"['adventure', 'fantasy', 'family']",104.0,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...,8844
2,Grumpier Old Men,"['romance', 'comedy']",101.0,6.5,92.0,1995,A family wedding reignites the ancient feud be...,15602
3,Waiting to Exhale,"['comedy', 'drama', 'romance']",127.0,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom...",31357
4,Father of the Bride Part II,['comedy'],106.0,5.7,173.0,1995,Just when George Banks has recovered from his ...,11862


In [5]:
df_clean.shape

(45466, 8)

The dataset df_meta_over has 45466 rows and 8 columns ,for reducing the computation power,we take only 15000 rows and 8 columns

In [6]:
df_clean=df_clean.iloc[:15000]

In [7]:
df_clean.shape

(15000, 8)

### Creating the TF-IDF matrix


In [8]:
# import tfidfvectorizer 
from sklearn.feature_extraction.text import TfidfVectorizer
# define TF IDF Vestorizer object,remove all english stopword
tfidf=TfidfVectorizer(stop_words="english")
df_clean["overview"]=df_clean["overview"].fillna('')
# constructing IF-IDF Matrix by applying fit_transform mathed on overview feature
tfidf_matrix=tfidf.fit_transform(df_clean["overview"])


In [9]:
tfidf_matrix.shape

(15000, 40226)

## [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity)

#### It is the cosine of the angle between the two vectors, similarity(movie1,movie2) = cos(movie1,movie2) = cos 45 which is around 0.7. Cosine Similarity of 1 denotes the highest similarity while a cosine similarity value of zero denotes no similarity.

![cosinesimi](https://user-images.githubusercontent.com/19235560/61589020-107ad000-abc2-11e9-8243-62cf62501a6f.png)

###### The next step is to calculate the pairwise cosine similarity score of every movie. In other words, we are going to create a 15000 ×15000  matrix, where the cell in the ith row and jth column represents the similarity score between movies i and j. We can easily see that this matrix is symmetric in nature and every element in the diagonal is 1, since it is the similarity score of the movie with itself.

In [10]:
# import linear_kernal to calculate dot product
from sklearn.metrics.pairwise import linear_kernel
# compute the cosine similarity matrix
cosine_sim=linear_kernel(tfidf_matrix,tfidf_matrix)

In [12]:
cosine_sim

array([[1.        , 0.01641771, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.01641771, 1.        , 0.04875137, ..., 0.        , 0.        ,
        0.00326425],
       [0.        , 0.04875137, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.00859075,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.00859075, 1.        ,
        0.        ],
       [0.        , 0.00326425, 0.        , ..., 0.        , 0.        ,
        1.        ]])

### Building the recommender function

The final step is to create our recommender function. However, before we do that, let's create a reverse mapping of movie titles and their respective indices. In other words, let's create a pandas series with the index as the movie title and the value as the corresponding index in the main DataFrame:

In [12]:
#Construct a reverse mapping of indices and movie titles, and drop duplicate titles, if any
indices = pd.Series(df_clean.index, index=df_clean['title']).drop_duplicates()

In [13]:
indices.head()

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
dtype: int64

#### We will perform the following steps in building the recommender function:

###### 1 Declare the title of the movie as an argument.
###### 2 Obtain the index of the movie from the indices reverse mapping.
###### 3 Get the list of cosine similarity scores for that particular movie with all movies using  cosine_sim. Convert this into a list of tuples where the first element is the position and the second is the   similarity score.
###### 4 Sort this list of tuples on the basis of the cosine similarity scores.
###### 5 Get the top 10 elements of this list. Ignore the first element as it refers to the similarity score with itself (the movie most similar to a particular movie is obviously the movie itself).
###### 6 Return the titles corresponding to the indices of the top 10 elements, excluding the first:

In [14]:
# Function that takes in movie title as input and gives recommendations 
def content_recommender(title, cosine_sim=cosine_sim, df_clean=df_clean, indices=indices):
    # Obtain the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    # And convert it into a list of tuples as described above
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort the movies based on the cosine similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies. Ignore the first movie.
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df_clean['title'].iloc[movie_indices]

Congratulations! We've built our very first content-based recommender. Now it is time to see our recommender in action! Let's ask it for recommendations of movies similar to The Lion King:

In [15]:
#Get recommendations for The Lion King
content_recommender("The Lion King")

9353                         The Lion King 1½
9115           The Lion King 2: Simba's Pride
6094                                Born Free
3203                         The Waiting Game
14402    Michael Jackson: Life of a Superstar
6574            Once Upon a Time in China III
3293                                 The Bear
2779                    Napoleon and Samantha
11507                     David and Bathsheba
892                          The Wizard of Oz
Name: title, dtype: object

We see that our recommender has suggested all of The Lion King's sequels in its top-10 list. We also notice that most of the movies in the list have to do with lions.
It goes without saying that a person who loves The Lion King is very likely to have a thing for Disney movies. They may also prefer to watch animated movies. Unfortunately, our plot description recommender isn't able to capture all this information.

Therefore, in the next section, we will build a recommender that uses more advanced metadata, such as genres, cast, crew, and keywords (or sub-genres). This recommender will be able to do a much better job of identifying an individual's taste for a particular director, actor, sub-genre, and so on.

### Metadata based recommender

We will largely follow the same steps as the plot description-based recommender to build our metadata-based model. The main difference, of course, is in the type of data we use to build the model.

###### To build this model, we will be using the following metdata:
###### The genre of the movie. 
###### The director of the movie. This person is part of the crew.
###### The movie's three major stars. They are part of the cast.Sub-genres or keywords

With the exception of genres, our DataFrames (both original and cleaned) do not contain the data that we require. Therefore, for this, we will need to import  two additional files: credits.csv, which contains information on the cast and crew of the movies, and keywords.csv, which contains information on the sub-genres. 

In [17]:
# contains information on the cast and crew of the movies 
cred_df=pd.read_csv("credits.csv")
cred_df.head()
#df=pd.read_csv("metadata_clean.csv")

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [18]:
# contains information on the sub-genres
key_df=pd.read_csv("keywords.csv")
key_df.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [19]:
#Convert the IDs of df into int
#df_meta['id'] = df_meta['id'].astype('int')
# Function to convert all non-integer IDs to NaN
def clean_ids(x):
    try:
        return int(x)
    except:
        return np.nan

#Clean the ids of df
df_clean['id'] = df_clean['id'].apply(clean_ids)

#Filter all rows that have a null ID
df_clean = df_clean[df_clean['id'].notnull()]
# Convert IDs into integer
df_clean['id'] = df_clean['id'].astype('int')
key_df['id'] = key_df['id'].astype('int')
cred_df['id'] = cred_df['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe
df_clean = df_clean.merge(cred_df, on='id')
df_clean = df_clean.merge(key_df, on='id')

#Display the head of the merged df
df_clean.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id,cast,crew,keywords
0,Toy Story,"['animation', 'comedy', 'family']",81.0,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ...",862,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,Jumanji,"['adventure', 'fantasy', 'family']",104.0,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...,8844,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,Grumpier Old Men,"['romance', 'comedy']",101.0,6.5,92.0,1995,A family wedding reignites the ancient feud be...,15602,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,Waiting to Exhale,"['comedy', 'drama', 'romance']",127.0,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom...",31357,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,Father of the Bride Part II,['comedy'],106.0,5.7,173.0,1995,Just when George Banks has recovered from his ...,11862,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [20]:
df_meta.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

In [21]:
#Only keep those features that we require 
df_clean = df_clean[['title','genres','runtime','vote_average','vote_count',"overview","id","cast","crew","keywords"]]

df_clean.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,overview,id,cast,crew,keywords
0,Toy Story,"['animation', 'comedy', 'family']",81.0,7.7,5415.0,"Led by Woody, Andy's toys live happily in his ...",862,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,Jumanji,"['adventure', 'fantasy', 'family']",104.0,6.9,2413.0,When siblings Judy and Peter discover an encha...,8844,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,Grumpier Old Men,"['romance', 'comedy']",101.0,6.5,92.0,A family wedding reignites the ancient feud be...,15602,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,Waiting to Exhale,"['comedy', 'drama', 'romance']",127.0,6.1,34.0,"Cheated on, mistreated and stepped on, the wom...",31357,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,Father of the Bride Part II,['comedy'],106.0,5.7,173.0,Just when George Banks has recovered from his ...,11862,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


## Wrangling keywords, cast, and crew

keywords into a list of strings where each string is a keyword (similar to genres). We will include only the top three keywords. Therefore, this list can have a maximum of three elements.

Convert cast into a list of strings where each string is a star. Like keywords, we will only include the top three stars in our cast. 

Convert crew into director. In other words, we will extract only the director of the movie and ignore all other crew members.

The first step is to convert these stringified objects into native Python objects:

In [22]:
# Convert the stringified objects into the native python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    df_clean[feature] = df_clean[feature].apply(literal_eval)

Next, let's extract the director from our crew list. To do this, we will first examine the structure of the dictionary in the crew list:

In [23]:
#Print the first cast member of the first movie in df_meta
df_clean.iloc[0]['crew'][0]


{'credit_id': '52fe4284c3a36847f8024f49',
 'department': 'Directing',
 'gender': 2,
 'id': 7879,
 'job': 'Director',
 'name': 'John Lasseter',
 'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'}

We see that this dictionary consists of job and name keys. Since we're only interested in the director, we will loop through all the crew members in a particular list and extract the name when the job is Director. Let's write a function that does this:

In [24]:
# Extract the director's name. If director is not listed, return NaN
def get_director(x):
    for crew_member in x:
        if crew_member["job"]=="Director":
            return crew_member["name"]
    return np.nan

Now that we have the get_director function, we can define the new director feature:

In [25]:
#Define the new director feature
df_clean["Director"]=df_clean["crew"].apply(get_director)

In [26]:
df_clean["Director"].head()

0      John Lasseter
1       Joe Johnston
2      Howard Deutch
3    Forest Whitaker
4      Charles Shyer
Name: Director, dtype: object


Both keywords and cast are dictionary lists as well. And, in both cases, we need to extract the top three name attributes of each list. Therefore, we can write a single function to wrangle both these features Also, just like keywords and cast, we will only consider the top three genres for every movie:

In [27]:
# Returns the list top 3 elements or entire list; whichever is more.
def generate_list(x):
    if isinstance(x, list):
        names = [ele['name'] for ele in x]
        #Check if more than 3 elements exist. If yes, return only first three. 
        #If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []


We will use this function to wrangle our cast and keywords features. We will also only consider the first three genres listed:

In [28]:
#Apply the generate_list function to cast and keywords
df_clean['cast'] = df_clean['cast'].apply(generate_list)
df_clean['keywords'] = df_clean['keywords'].apply(generate_list)

In [29]:
#Only consider a maximum of 3 genres
df_clean['genres'] = df_clean['genres'].apply(lambda x: x[:3])

Let's now take a look at a sample of our wrangled data:

In [30]:
# Print the new features of the first 5 movies along with title
df_clean[['title', 'cast', 'Director', 'keywords', 'genres']].head(5)

Unnamed: 0,title,cast,Director,keywords,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy]","[animation, comedy, family]"
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[adventure, fantasy, family]"
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger]","[romance, comedy]"
3,Waiting to Exhale,"[Whitney Houston, Angela Bassett, Loretta Devine]",Forest Whitaker,"[based on novel, interracial relationship, sin...","[comedy, drama, romance]"
4,Father of the Bride Part II,"[Steve Martin, Diane Keaton, Martin Short]",Charles Shyer,"[baby, midlife crisis, confidence]",[comedy]


In [31]:
# Function to sanitize data to prevent ambiguity. 
# Removes spaces and converts to lowercase
def sanitize(x):
    if isinstance(x, list):
        #Strip spaces and convert to lowercase
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''
#Apply the generate_list function to cast, keywords, director and genres
for feature in ['cast', 'Director', 'genres', 'keywords']:
    df_clean[feature] = df_clean[feature].apply(sanitize)

## Creating the metadata soup

What we need to do is create a soup that contains the actors, director, keywords, and genres. This way, we can feed this soup into our vectorizer and perform similar follow-up steps to before:

In [33]:
#Function that creates a soup out of the desired metadata
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['Director'] + ' ' + ' '.join(x['genres'])

In [34]:
# Create the new soup feature
df_clean['soup'] = df_clean.apply(create_soup, axis=1)

In [36]:
df_clean["soup"][1]

"boardgame disappearance basedonchildren'sbook robinwilliams jonathanhyde kirstendunst joejohnston adventure fantasy family"

In [39]:
# display the soup of first movie
df_clean.iloc[2]["soup"]

'fishing bestfriend duringcreditsstinger waltermatthau jacklemmon ann-margret howarddeutch romance comedy'

With the soup created, we are now in a good position to create our document vectors, compute similarity scores, and build the metadata-based recommender function.

We are taking 10000 intances for reducing computation power

In [38]:
df_clean=df_clean[:10000]

In [39]:
df_clean.shape

(10000, 12)

Instead of using **TF-IDFVectorizer**, we will be using **CountVectorizer**. This is because using TF-IDFVectorizer will accord less weight to actors and directors who have acted and directed in a relatively larger number of movies.

In [40]:
from sklearn.feature_extraction.text import CountVectorizer
#Define a new CountVectorizer object and create vectors for the soup
count = CountVectorizer(stop_words='english')
df_clean["soup"]=df_clean["soup"].fillna("")
count_matrix = count.fit_transform(df_clean['soup'])
count_matrix

<10000x20330 sparse matrix of type '<class 'numpy.int64'>'
	with 85395 stored elements in Compressed Sparse Row format>

In [None]:
#Import cosine_similarity function
from sklearn.metrics.pairwise import cosine_similarity

#Compute the cosine similarity score (equivalent to dot product for tf-idf vectors)
cosine_sim2 = cosine_similarity(count_matrix,count_matrix)

In [44]:
# Reset index of your df and construct reverse mapping again
df_clean = df_clean.reset_index()
indices2 = pd.Series(df_clean.index, index=df_clean['title'])

With the new reverse mapping constructed and the similarity scores computed, we can reuse the content_recommender function by passing in cosine_sim2 as an argument. Let's now try out our new model by asking recommendations for the movies :

In [45]:
content_recommender('Jumanji',cosine_sim2, df_clean, indices2)

552                      The Pagemaster
1996                       Return to Oz
2298             Santa Claus: The Movie
3229                 The Legend of Lobo
3505           The Slipper and the Rose
59           The Indian in the Cupboard
902                    The Wizard of Oz
1047    Aladdin and the King of Thieves
1168                 The Princess Bride
1823                     Small Soldiers
Name: title, dtype: object

In [146]:
content_recommender('Father of the Bride Part II', cosine_sim2, df_clean, indices2)

3969                                            Baby Boom
6826                                  Father of the Bride
2377                                       ¡Three Amigos!
3196                                           Hanging Up
9201    Das merkwürdige Verhalten geschlechtsreifer Gr...
7428                                       The Lonely Guy
183                                           Nine Months
202                                       Unstrung Heroes
813                                  The First Wives Club
1137                                      Mina Tannenbaum
Name: title, dtype: object

In [148]:
content_recommender('Baby Boom', cosine_sim2, df_clean, indices2)

4       Father of the Bride Part II
252                          Junior
9993       With Six You Get Eggroll
4766                    On the Line
5423    Children on Their Birthdays
6826            Father of the Bride
202                 Unstrung Heroes
2001                 Son of Flubber
2405               The Other Sister
8022         How I Got Into College
Name: title, dtype: object

# Summary

We have come a long way in this Notebook. We first learned about document vectors and gained a brief introduction to the cosine similarity score. Next, we built a recommender that identified movies with similar plot descriptions. We then proceeded to build a more advanced model that leveraged the power of other metadata, such as genres, keywords, and credits. Finally, we discussed a few methods by which we could improve our existing system.
With this, we formally come to an end of our tour of content-based recommendation system. In the next Notebook, we will cover what is arguably the most popular recommendation model in the industry today: collaborative filtering

>This tutorial is intended to be a public resource. As such, if you see any glaring inaccuracies or if  a critical topic is missing, please feel free to point it out or (preferably) submit a pull request to improve the tutorial. Also, we are always looking to improve the scope of this article. For anything feel free to mail us @ colearninglounge@gmail.com

>Author of this article is **Sajid Ali**. You can contact him on [GitHub](https://github.com/sajidmeo), [LinkedIn](www.linkedin.com/in/sajidalimeo),[Facebook](https://www.facebook.com/meosajidali.balot),[Instagram](https://www.instagram.com/sajidali.ai/)