# Movie Recommender System on TMDB Dataset

Some common types of recommendation systems:

1. **Content-based filtering**: This approach analyzes item characteristics and user profiles to provide recommendations. It suggests items similar to the ones users have shown interest in, based on shared attributes such as genre or keywords.

2. **Collaborative filtering**: By identifying patterns among users or items, collaborative filtering generates recommendations. It looks for similarities between users’ preferences or item ratings to suggest items that others with similar tastes have enjoyed.

3. **Hybrid recommender systems**: These systems combine multiple techniques to enhance recommendation accuracy and diversity. They leverage the strengths of different approaches, such as content-based and collaborative filtering, to provide more personalized suggestions.

4. **Knowledge-based systems**: Using explicit knowledge and rules, knowledge-based systems make recommendations based on specific criteria or constraints. They incorporate domain knowledge to suggest items that align with predefined rules or user preferences.

5. **Popularity-based systems**: These systems rely on the overall popularity of products or items within a particular demographic. Recommendations are based on what is trending or frequently purchased by others.

We will focus on building a **content-based recommender engine** for movies, where recommendations are based on the type of movies users search for.

Dataset: TMDB 5000 dataset

This dataset contains valuable information about movies, including their titles, genres, keywords, and user ratings. there are two csv files: one that contains credits and another brimming with movie details. 

## Feature Engineering

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df_credits = pd.read_csv('tmdb_5000_credits.csv')
df_movies = pd.read_csv('tmdb_5000_movies.csv')

In [3]:
df_credits.shape, df_movies.shape

((4803, 4), (4803, 20))

In [4]:
df_credits.head(4)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."


In [5]:
df_movies.head(4)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106


In [6]:
# Check whether the id column from df_movies is same as movie_id from df_credits
(df_credits.movie_id != df_movies.id).any().sum()

0

In [7]:
# Rename the column movie_id name in df_cred to id followed by merge
df_credits.rename(columns = {'movie_id': 'id'}, inplace=True)

In [8]:
# Merge the two dataframes
df = df_credits.merge(df_movies, on = 'id')

In [9]:
df.head(4)

Unnamed: 0,id,title_x,cast,crew,budget,genres,homepage,keywords,original_language,original_title,...,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title_y,vote_average,vote_count
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,...,"[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106


In [10]:
# Lets take a look at the data type of the columns
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    4803 non-null   int64  
 1   title_x               4803 non-null   object 
 2   cast                  4803 non-null   object 
 3   crew                  4803 non-null   object 
 4   budget                4803 non-null   int64  
 5   genres                4803 non-null   object 
 6   homepage              1712 non-null   object 
 7   keywords              4803 non-null   object 
 8   original_language     4803 non-null   object 
 9   original_title        4803 non-null   object 
 10  overview              4800 non-null   object 
 11  popularity            4803 non-null   float64
 12  production_companies  4803 non-null   object 
 13  production_countries  4803 non-null   object 
 14  release_date          4802 non-null   object 
 15  revenue              

In [11]:
df.isna().sum()

id                         0
title_x                    0
cast                       0
crew                       0
budget                     0
genres                     0
homepage                3091
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title_y                    0
vote_average               0
vote_count                 0
dtype: int64

By combining elements such as the movie title, genre, overview, and cast/crew (if we desire to search based on individuals), we can generate a cohesive representation that will enable us to discover similar movies effectively.

Since we are considering the Overview column and there are 3 null values, lets drop those null values first.

In [12]:
# drop null overviews
df.dropna(subset = ['overview'], inplace=True)

In [16]:
# filter out target columns
df = df[['id', 'title_x', 'genres', 'overview', 'cast', 'crew','vote_average']]

In [17]:
# check new df info
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4800 entries, 0 to 4802
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            4800 non-null   int64  
 1   title_x       4800 non-null   object 
 2   genres        4800 non-null   object 
 3   overview      4800 non-null   object 
 4   cast          4800 non-null   object 
 5   crew          4800 non-null   object 
 6   vote_average  4800 non-null   float64
dtypes: float64(1), int64(1), object(5)
memory usage: 429.0+ KB


In [18]:
df.head()

Unnamed: 0,id,title_x,genres,overview,cast,crew,vote_average
0,19995,Avatar,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","In the 22nd century, a paraplegic Marine is di...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",7.2
1,285,Pirates of the Caribbean: At World's End,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","Captain Barbossa, long believed to be dead, ha...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",6.9
2,206647,Spectre,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",A cryptic message from Bond’s past sends him o...,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",6.3
3,49026,The Dark Knight Rises,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",Following the death of District Attorney Harve...,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",7.6
4,49529,John Carter,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","John Carter is a war-weary, former military ca...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",6.1


**overview** and **title** columns contain simple string values, which makes them straightforward to handle.
On the other hand, the **genres, cast, and crew** columns follow a similar structure, as they consist of lists of dictionaries.

To extract the desired values from each dictionary, we need to define a specific approach. For genres, we can include almost all genres associated with each movie as relevant tags. 

However, when it comes to the cast, including the entire list would lead to increased variance, so we will only add the top 4(you can vary) cast members. Similarly, for the crew, we will focus on directors and producers.

In [19]:
df.genres[0] # we will include all the "name" tags from Genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [20]:
' '.join([i['name'] for i in eval(df.genres[0])])

'Action Adventure Fantasy Science Fiction'

In [21]:
df.cast[0] # We will include the "name" of first 4(you can vary) cast members

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

In [39]:
' '.join([i['name'] for i in eval(df.cast[0])[:4]])

'Sam Worthington Zoe Saldana Sigourney Weaver Stephen Lang'

In [36]:
df.crew[0]  #We will consider the "name" tag of the "job" : "Director" and "Producer"

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

In [22]:
' '.join(list(set([i['name'] for i in eval(df.crew[0]) if (i['job'] == 'Director' or i['job'] == 'Producer')])))

'Jon Landau James Cameron'

In [23]:
# Function to fetch the desired details from the columns
def get_required_details(overview, genres, cast, crew):
    corpus = ""
    genre = ' '.join([i['name'] for i in eval(genres)])
    cast =  ' '.join([i['name'] for i in eval(cast)[:4]])
    crew = ' '.join(list(set([i['name'] for i in eval(crew) if i['job']=='Director' or i['job']=='Producer'])))
    corpus += overview+ " " + genre + " " + cast + " " + crew
    return corpus

In [24]:
corpus = []
for i in range(len(df)):
    corpus.append(get_required_details(df.iloc[i].overview, df.iloc[i].genres, df.iloc[i].cast, df.iloc[i].crew))


In [25]:
len(corpus)

4800

In [26]:
corpus[0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy Science Fiction Sam Worthington Zoe Saldana Sigourney Weaver Stephen Lang Jon Landau James Cameron'

At this point, we have the option to add this newly created column to our DataFrame and either retain or drop the original columns — this choice is yours to make. In my case, I will opt to drop these columns to streamline and reduce the size of our DataFrame.

In [27]:
# rename title column
df.rename(columns = {'title_x': 'title'}, inplace= True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns = {'title_x': 'title'}, inplace= True)


In [28]:
# drop old columns
df.drop(columns=['genres', 'overview', 'cast', 'crew'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=['genres', 'overview', 'cast', 'crew'], inplace=True)


In [29]:
# add corpus
df['corpus'] = corpus

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['corpus'] = corpus


In [30]:
df.head()

Unnamed: 0,id,title,vote_average,corpus
0,19995,Avatar,7.2,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,6.9,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,6.3,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,7.6,Following the death of District Attorney Harve...
4,49529,John Carter,6.1,"John Carter is a war-weary, former military ca..."


## Text Representation and Text Similarity

To enable machine learning algorithms to process text data effectively, the textual information must undergo a transformation into a mathematical form, typically represented as vectors of numbers a.k.a **vector space model (VSM)**, where text units are encoded as numerical vectors.

The task of establishing relationships between texts involves a two-step process. 
    The first step, as we previously discussed, involves **text representation, wherein the vectors are generated**. 
    Once we have these vectors, the second step is to **determine methods for comparing them to measure their similarities or differences**.
    

In the NLP universe, this conversion of raw text to a suitable numerical form is called **text representation**.
Various techniques are used to collect texts split them into atoms, and transform them into numerical vectors.

In order to correctly extract the main words/tokens from a piece of the corpus, the very first step usually is to **break the sentence into lexical units** then **tokenize the sentences** and do **removal of stopwords**, and irrelevant words/sentences depending upon the use case, which is usually done by either **regular expressions, NLTK, spaCy or TextVectorivectorization.**

Following this, we will now have a **list of tokens that are nothing but the unique words aka vocabulary** of the numerical representation — ready to be converted into numerical vectors.

Before directly jumping into word embedding let’s have a look at some of the **traditional methods of numerical encoding and why they are not good methods for feeding into neural networks.**

**Label Encoding**

   *Assigns each token in the corpus a unique number.* It is directly proportional to the size of the vocabulary and due to its nature of sequencing a relation is established between the features getting higher numbers and a biased model is inevitable. This could lead to poor performing models and unexpected results, so it’s a big NO for using this technique in terms of feature representation instead is always used for target variables/labels hence the name.
    
**One Hot Encoder**

   *Each token in the corpus vocabulary is given a unique integer representation that is between 1 and n, where n is the length/number of unique tokens in the corpus vocabulary(set of tokens)*. In n-dimensional space, each token would occupy one of the dimensions, meaning none of these tokens has any similarity between them irrespective of their context. This is done via an n-dimension vector filled with all 0s barring the index which is marked as 1.


**Dummies**
  
   *A more flexible way to convert categorical values into dummy/indicator values*. By default get_dummies() provides a one-hot version, in order to get the dummy version we need to pass drop_first=True. We see there is some redundancy in one-hot encoding the n-dimensional vector can be represented in n-1 dimensional long. In the above example if any two cities are 0 then it is obvious the third one is 1 hence the above representation can be further converted to like this.
    
**Why these are not a good fit for neural networks?**

    1.They’re discrete representations — the ability to capture relationships between words is lost.
    2.The feature vectors are sparse(with most values being zero for any vector) and high-dimensional(proportional to the size of the vocabulary) representations.
    3.They cannot handle OOV(out of vocabulary) words.

All of these hampers learning capability and high-dimensionality representation making them computationally inefficient. The Dense representation comes to the rescue.    


### Word Embeddings (Extra)

   *An embedding is a representation of the Natural Language that can be learned where words that have the same meaning have a similar representation.* 
    
   For the set of words in a corpus, embedding is a mapping between vector space coming from distributional representation to vector space coming from distributed representation. If we’re given the word “USA,” distributionally similar words could be other countries (e.g., Canada, Germany, India, etc.) or cities in the USA.
    
   The distributed representation is learned based on the usage of words. This allows words that are used in similar ways to result in having similar representations, naturally capturing their meaning.
    
   Each word is represented by a real-valued vector, often tens or hundreds of dimensions. This is contrasted to the thousands or millions of dimensions required for sparse word representations, such as a one-hot encoding.
    
   **Create your own embedding layer**
   
   In order to create your own word embedding layer, you need to perform the following operations:

        1.Text cleaning
        2.Tokenization
        3.Embedding
        
   **Step 1: Text Cleaning**

   This is the building block and very crucial step, the quality of embedding is proportional to the text cleaning as well. The lesser the noise, the better the model. We can leverage the NLP libraries depending upon the type of cleaning required. In this tutorial, I am not going to cover this step for obvious reasons.

   **Step 2: Tokenization**

   Tokenization means the direct mapping of tokens to numbers. We will use **TensorFlow’s TextVectorization** for this which is a preprocessing layer that directly maps the text features to integer sequences. It performs the following operations:

    1. Standardize each example (usually lowercasing + punctuation stripping)
    2. Split each example into substrings (usually words)
    3. Recombine substrings into tokens (usually ngrams)
    4. Index tokens (associate a unique int value with each token)
    5. Transform each example using this index, either into a vector of ints or a dense float vector.

   Here are some of the key parameters that need to be taken care off:

    max_tokens = how many words are there in the vocabulary.
    standardize = what kind of standardization you want to apply like “lower”, “strip_punctuation” or “lower_and_strip_punctuation”.
    split = whether to split on whitespace or some specified characters.
    ngrams = create groups of n_words, by default None(no groups).
    output_sequence_length = how long each sequence should be? 
   A good practice to define the output_sequence_length is to put a value ≥ the average number of tokens in each sentence
    

In [74]:
corpus = ['I love TensorFlow', 'TensorFlow loves Tensors', 'Tensors loves Vectors']
round(sum([len(i.split()) for i in corpus]) /len(corpus))

#output : 3

3

In [68]:
!pip install tensorflow



In [75]:
from tensorflow.keras.layers import TextVectorization
vectorizer = TextVectorization(max_tokens = 10,
                              standardize = 'lower_and_strip_punctuation',
                              split = 'whitespace',
                              ngrams = None,
                               output_mode= 'int',
                               output_sequence_length = 4,
                               pad_to_max_tokens = True
                              )

In [76]:
#Now it’s time for our text_vectorization instance to adapt the sequences in our corpus.

#adapt
vectorizer.adapt(corpus)

#visualize on whole corpus
vectorizer(corpus)


<tf.Tensor: shape=(3, 4), dtype=int64, numpy=
array([[7, 6, 3, 0],
       [3, 4, 2, 0],
       [2, 4, 5, 0]], dtype=int64)>

The shape is very important here — (3,4), it represents the number of rows then how long is each sequence. Since we defined our “output_sequence_length = 4” hence despite the sequence length being equal to 3 it is padded to 4 with zeros the same applies for the antipodal i.e even if the sequence is longer than the said size it gets truncated up to the defined limit hence selecting this parameter is tricky. Even if a single token will be passed to this method it proceeds with the same padding.

**Step 3: Embedding**

Finally, here we are at the final step in ***creating word embedding layers***! For this, we will use **TensorFlow’s Embedding**. Again here are some of the key parameters that need to be taken care off:

input_dim = Size of the vocabulary.

After the vectorizer is fitted to the corpus we can get the length of the vocabulary by using:

In [77]:
vocab_len = len(vectorizer.get_vocabulary())

In [78]:
vocab_len

8

output_dim = Size of the output embedding vector, a value of 10 will mean each token gets represented by 10 vectors

Note: Always try to define vectors that are easily divisible by 8 — speeds up computing

input_length = length of the sequences being passed to the embedding layer so it should be equal to the output_sequence_length

Now, let’s create the instance of Embedding and fit it on the output of the previous step.

In [80]:
from tensorflow.keras.layers import Embedding

embed = Embedding(input_dim = vocab_len,
                 output_dim = 4,
                 input_length = 4)

In [81]:
# fit on tokenized corpus
embed(vectorizer(corpus))

<tf.Tensor: shape=(3, 4, 4), dtype=float32, numpy=
array([[[ 0.0162386 , -0.04755688, -0.01951839, -0.0486214 ],
        [ 0.0140211 , -0.00024027, -0.04202429,  0.02287618],
        [ 0.02416504,  0.01927711, -0.01489583, -0.03634726],
        [ 0.01321355, -0.02965011,  0.00294185,  0.04639276]],

       [[ 0.02416504,  0.01927711, -0.01489583, -0.03634726],
        [ 0.03322167,  0.02805246,  0.04029535,  0.03265033],
        [-0.00534086, -0.0002786 , -0.02870411,  0.02548721],
        [ 0.01321355, -0.02965011,  0.00294185,  0.04639276]],

       [[-0.00534086, -0.0002786 , -0.02870411,  0.02548721],
        [ 0.03322167,  0.02805246,  0.04029535,  0.03265033],
        [ 0.03240671,  0.01602124, -0.03174583,  0.00118878],
        [ 0.01321355, -0.02965011,  0.00294185,  0.04639276]]],
      dtype=float32)>

The output shape is very important since this shape will be responsible for feeding this embedding into any neural network.

So (3,4,4) what does it mean?

3 — three sequences are passed,

4 — each sentence is padded to 4 tokens, so all zeros will have the same representation.

4 — Each token is represented by 4 feature vectors.

In [82]:
embed(vectorizer(['TensorFlow']))

<tf.Tensor: shape=(1, 4, 4), dtype=float32, numpy=
array([[[ 0.02416504,  0.01927711, -0.01489583, -0.03634726],
        [ 0.01321355, -0.02965011,  0.00294185,  0.04639276],
        [ 0.01321355, -0.02965011,  0.00294185,  0.04639276],
        [ 0.01321355, -0.02965011,  0.00294185,  0.04639276]]],
      dtype=float32)>

As you can see, the shape this time is (1,4,4). Though a single token is passed it is considered as a sequence and gets padded. Therefore the first row is unique and the rest all have the same representation of 0.

Now we know how to create a basic word embeddings layer efficiently in TensorFlow! The two famous algorithms which are just an extended version of this layer are word2vec and GloVe


**Summary:**

    Neural Networks work with numbers so you can’t just throw words in it.
    You could one-hot encode all the words but that will result in a sparse matrix, you will lose the notion of similarity between them — a big NO-NO.
    Neural networks work best with dense layers.
    Therefore we feed the Embedding layer to the neural networks.

### Bag of Words & TF-IDF (Text Representation)

Text representations techniques:

**Bag of Words**

This method represents the considered text as a collection of words, disregarding their order and context. It assumes that text belonging to a specific class in the dataset can be characterized by a unique set of words. Essentially, it creates a list of vector arrays for each sentence or corpus, where each word is encoded using a one-hot encoding scheme. The size of the list is determined by the number of unique words in the set, forming a Bag 💼 representation.

If two text pieces share similar words, they are considered to belong to the same class or Bag. By analyzing the words present in a text, one can identify the corresponding class it falls into. Consequently, in this representation, documents with the same words will have their vector representations located close to each other in the vector space.

However, this approach results in a sparse representation, with the majority of vector entries being zeroes. This sparsity can lead to computational inefficiencies in terms of storage, computation, and learning (overfitting). To overcome these limitations, the next method introduced is TF-IDF.


**TF-IDF**

In the previous text representation techniques all words within a text are considered equally significant, with no distinction of importance. However, to overcome this limitation, the concept of **TF-IDF (term frequency–inverse document frequency)** emerges. TF-IDF strives to **measure the relative importance of a word by comparing its frequency in a specific document to its occurrence across the entire corpus.** This approach serves as a widely employed representation scheme in information-retrieval systems, aiding in the extraction of relevant documents from a corpus based on a given text query.

**TF (term frequency)** measures how often a term or word occurs in a given document and TF of a term t in a document d is given as

    TF(t,d) = (Number of occurrences of term t in document d) / (Total number of terms in document d)
   
   
**IDF (inverse document frequency)** measures the importance of the term across a corpus. In computing TF (t), all terms are given equal importance (weightage). **IDF weighs down the terms that are very common across a corpus and weighs up the rare terms**. IDF of a term t is given as

    IDF(t) = log[ (Total number of documents in the corpus) / (Number of documents with term t in them) ]
    
    
    TF-IDF = TF(t,d) * IDF(t)

The TF-IDF score is calculated for the vocabulary of the corpus(consider the union of unique words across all documents). Then each document is represented as a TF-IDF score of words present to them.

Note: Word preprocessing like stop words removal, stemming , lemmatization is crucial here for the obvious reasons (Ex: Eat, Eating, Eats will all be converted into Eat) hence will lower the dimensions of the array.

### Cosine Similarity (Text Similarity)

There are multiple distance measures in data science:

    1. Euclidean
    2. Cosine
    3. Hamming
    4. Manhattan
    5. Minkowski
    6. Chebyshev
    7. Jaccard
    8. Haversine
    9. Sorensen-Dice
    

**Cosine Similairy**:

Cosine similarity is a mathematical approach **used to measure the similarity between pairs of vectors or rows of numbers treated as vectors.** It involves representing each value in a sample endpoint coordinates for a vector, with the other endpoint at the origin of the coordinate system. This process is repeated for two samples, and the cosine between the vectors is computed in an m-dimensional space, where m represents the number of values in each sample. **The cosine similarity is such that identical vectors have a similarity of 1 since the cosine of 0 is 1. As the vectors become more dissimilar, the cosine value approaches 0, indicating a lower similarity.**

Cosine similarity between two vectors, A and B, each with n components, between them is computed as follows:

    similarity = cos(theta) = A.B / ||A||^2 ||B||^2
    
The dot product of vectors A (x1​, x2​, x3​) and B (y1​, y2​, y3​), denoted as A.B, can be computed using the following formula:

A.B = x1​∗y1​ + x2​∗y2 ​+ x3​∗y3 ​

and ||A|| & ||B|| is calculated by:

∣∣A∣∣ = ✓(​x1²​+x2​²+x3​²)

∣∣B∣∣ = ✓(​y1²​+y2​²+y3​²)


Let’s consider a simple example in two-dimensional space. It’s worth noting the beauty of mathematics lies in the fact that if we can comprehend a concept in two dimensions, we can extend that understanding to any number of dimensions. Let's walk through the example⚡

In [84]:
# Define three vectors
A = [1, 2]
B = [2, 3]
C = [3, 1]

In [85]:
# Calculate ot products
ab = np.dot(A,B)
bc = np.dot(B,C)
ca = np.dot(C,A)

In [86]:
# calculate the length of the vector
a = np.linalg.norm(A)
b = np.linalg.norm(B)
c = np.linalg.norm(C)

In [87]:
# calculte cosine similarity for each pair using the above formula
sim_ab = ab/(a*b)
sim_bc = bc/(b*c)
sim_ca = ca/(c*a)

In [88]:
# lets see the similarities
sim_ab, sim_bc, sim_ca

(0.9922778767136677, 0.7893522173763263, 0.7071067811865475)

By utilizing the **pairwise.cosine_similarity** class from sklearn, we can effortlessly replicate all these computations with just a single line of code.

In [31]:
from sklearn.metrics.pairwise import cosine_similarity

In [91]:

# compute cosine similarity of the example above
cosine_similarity([A,B,C])

array([[1.        , 0.99227788, 0.70710678],
       [0.99227788, 1.        , 0.78935222],
       [0.70710678, 0.78935222, 1.        ]])

The cosine_similarity function returns a matrix of cosine similarities, similar to a correlation matrix. Each row and column in the matrix represents a vector, resulting in a diagonal matrix where the diagonal values are always one (since the vectors are being compared to themselves).

From the obtained results, it becomes apparent that vector A is more similar to vector B compared to vector C. Similarly, vector B is closer to vector C. To gain a visual understanding, let’s plot these vectors in a 2D space and create a DataFrame using the cosine similarity matrix.

In [93]:
data = [A, B, C]
pd.DataFrame(cosine_similarity(data), 
             columns=['A', 'B', 'C'], 
             index=['A', 'B', 'C'])

Unnamed: 0,A,B,C
A,1.0,0.992278,0.707107
B,0.992278,1.0,0.789352
C,0.707107,0.789352,1.0


## Generating Recommendation Engine

We will take the DataFrame we have generated and apply TF-IDF vectorization to the corpus column. This process will yield a feature matrix where the rows represent the movies, and the columns correspond to the textual representations of the vocabulary. We can verify the same via shapes of it.

In [32]:
df.head()

Unnamed: 0,id,title,vote_average,corpus
0,19995,Avatar,7.2,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,6.9,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,6.3,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,7.6,Following the death of District Attorney Harve...
4,49529,John Carter,6.1,"John Carter is a war-weary, former military ca..."


In [33]:
df.corpus[0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy Science Fiction Sam Worthington Zoe Saldana Sigourney Weaver Stephen Lang Jon Landau James Cameron'

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the object and remove stopwords
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['corpus'])


In [35]:
tfidf_matrix

<4800x30258 sparse matrix of type '<class 'numpy.float64'>'
	with 202256 stored elements in Compressed Sparse Row format>

In [36]:
# Compare shapes
df.shape

(4800, 4)

In [37]:
tfidf_matrix.shape

(4800, 30258)

Now, we have a situation similar to what we previously discussed with examples A, B, and C, in our current context, we have a dataset comprising 4,800 movies, and the text representation of each movie is described by 30258-word vectors, representing the vocabulary of the entire corpus. To compute the cosine similarity, we need to generate a matrix where the rows and columns represent the 4,800 movies. Each movie will be compared to every other movie along both axes, and their dot product divided by the length of the vectors will yield the cosine similarity values.

Given our prior understanding of the basic example, we can envision the matrix we are aiming to generate. However, to avoid manual calculation, we can leverage the **linear kernel provided by scikit-learn**. This handy tool simplifies the process of generating the cosine similarity matrix, saving us considerable effort and time.

In [38]:
from sklearn.metrics.pairwise import linear_kernel

# compute the similarity matrix
cos_mat = linear_kernel(tfidf_matrix, tfidf_matrix)

In [39]:
cos_mat.shape

(4800, 4800)

We have our similarity matrix. As per our theory, the diagonal elements of the matrix should be 1 since each movie is being compared to itself. To verify this, if we sum up all the diagonal elements of the matrix, it should yield a value of 4800. Let’s put it to the test and confirm our expectations!

In [40]:
diag = 0
for i in range(len(cos_mat)):
    diag += cos_mat[i][i]
    
print(diag)

4800.0


Now that we have everything prepared, here’s how the workflow will unfold: 

    When the user provides a movie name, the model will locate the corresponding index of the movie in our DataFrame. 
    We can then use this index to retrieve the same index from the similarity matrix. 
    As the DataFrame and the cosine matrix are aligned, this step yields an array containing the cosine similarity scores of that movie with all other movies in the database.

However, *the array is not sorted in any particular order, and we want to showcase the most similar movies*. To achieve this, we *need to sort the array in descending order*. The first element will always correspond to the movie itself, with a similarity score of 1, followed by the other movies in descending order of similarity. Here lies a challenge: *Sorting the array will disrupt the original order, making it difficult to fetch the movie titles from the database.*

To overcome this, *we can store the movie index and similarity score as tuples*. Then, we can perform the sorting based on the score alone while keeping the index intact. Subsequently, we can retrieve the movie details using the index from the tuple, ensuring we maintain the correct movie-title association. This approach allows us to obtain the desired similarity rankings while preserving the necessary information for fetching movie details. We can also pass a parameter n for slicing i.e. to fetch top n similar movies.

In [41]:
sorted(list(enumerate(cos_mat[0])), reverse=True, key = lambda x: x[1])

[(0, 0.9999999999999997),
 (4399, 0.1415899568738931),
 (942, 0.13498538684879574),
 (3603, 0.12258512012128736),
 (2403, 0.11852902613288109),
 (838, 0.11581508223430054),
 (1245, 0.11431512309211009),
 (740, 0.113079183458059),
 (43, 0.11196032334004538),
 (94, 0.11032651423288937),
 (2138, 0.09253531408518909),
 (3104, 0.09224403598052139),
 (634, 0.09154705701674891),
 (47, 0.09115720100066335),
 (1341, 0.0902650431649201),
 (1178, 0.08732362348424455),
 (529, 0.08591994756833259),
 (150, 0.08530818125228448),
 (2966, 0.08072035871975969),
 (2060, 0.07985113908534044),
 (1053, 0.07905762857235527),
 (1796, 0.0784787631654588),
 (775, 0.07819004432036487),
 (182, 0.07814662454080266),
 (68, 0.07788068882087319),
 (3055, 0.07747171918786659),
 (2244, 0.07681834061302681),
 (3157, 0.07463520401253124),
 (311, 0.07248613632315021),
 (2130, 0.07204524347418373),
 (1804, 0.0719901836216657),
 (3648, 0.0714273250138737),
 (1092, 0.07114847515784227),
 (2103, 0.07109389288326345),
 (2500, 

In [42]:
def get_recomm(movie, n):
    
    # get index from dataframe
    index = df[df['title'] ==  movie].index[0]
    
    # sort top n similar movies
    similar_movies = sorted(list(enumerate(cos_mat[index])), reverse=True, key = lambda x: x[1])
    
    # extract names from dataframe and return the movie names
    recomm_mov = []
    for i in similar_movies[1:n+1]:
        recomm_mov.append(df.iloc[i[0]].title)
        
    return recomm_mov

In [43]:
# test the function
get_recomm('Avatar',4)

['The Helix... Loaded', 'The Book of Life', 'Apollo 18', 'Aliens']

In [44]:
get_recomm("The Dark Knight", 3)

['The Dark Knight Rises', 'Batman Begins', 'Batman Returns']

In [45]:
get_recomm("Mission: Impossible", 3)

['Mission: Impossible III',
 'Mission: Impossible II',
 'Mission: Impossible - Ghost Protocol']

Up until now, our movie recommendation system has relied solely on movie names. However, there is a limitation: if a movie name is not present in the DataFrame, or what if a user wants to get recommendations based on cast and crews? our system won’t function as intended. But fear not, for we have a solution at hand! Remember how we incorporated the cast and crew information into the corpus? Well, it’s time to leverage that.

Although the cast and crew details are not directly included in our DataFrame, we can still make use of them. Here’s where TF-IDF comes to the rescue. By applying TF-IDF transformation to the keywords or tags, we can convert them into vectors of the same length as our cosine matrix.

In [46]:
keywords = "Christopher Nolan"
keywords = keywords.split()
keywords = " ".join(keywords)
keywords

'Christopher Nolan'

In [47]:
# transform the string to vector representation
key_tfidf = tfidf.transform([keywords])
key_tfidf

<1x30258 sparse matrix of type '<class 'numpy.float64'>'
	with 2 stored elements in Compressed Sparse Row format>

In [48]:
key_tfidf.shape

(1, 30258)

In [49]:
# compute cosine similarity    
result = cosine_similarity(key_tfidf, tfidf_matrix)
result

array([[0., 0., 0., ..., 0., 0., 0.]])

In [50]:
# sort top n similar movies   
similar_key_movies = sorted(list(enumerate(result[0])), reverse=True, key=lambda x: x[1])
similar_key_movies

[(1196, 0.24981625443638547),
 (1033, 0.2269891672477714),
 (14, 0.2173854313908544),
 (119, 0.20393152996372776),
 (95, 0.2003657551723531),
 (1393, 0.1974346497892047),
 (96, 0.19031466183323786),
 (65, 0.1674131953333691),
 (196, 0.16184034862758323),
 (4644, 0.15860445527894698),
 (3463, 0.15540336731717094),
 (3, 0.154612547194409),
 (3572, 0.14683765801039356),
 (2868, 0.13499580580114212),
 (106, 0.13381206120830644),
 (3577, 0.13307587784510655),
 (423, 0.12305006734967193),
 (4484, 0.1191808653725676),
 (3694, 0.10018861343708382),
 (3769, 0.10010832362964717),
 (1758, 0.09940750004125574),
 (3551, 0.09652012833828355),
 (2199, 0.09564541105303324),
 (584, 0.0931283399329479),
 (1650, 0.09261229846062927),
 (1658, 0.0879336104268975),
 (1802, 0.0833693181712317),
 (2388, 0.07681298891404421),
 (3235, 0.07551090003440986),
 (1649, 0.07523808200620612),
 (2745, 0.07227935010452113),
 (1109, 0.07185852011805015),
 (506, 0.07178809112789503),
 (2028, 0.07137891631069808),
 (4605, 

In [51]:
def get_recomm_keywords(keywords, n):
    
    keywords = keywords.split()
    keywords = " ".join(keywords)
    
    # transform the string to vector representation
    key_tfidf = tfidf.transform([keywords]) 
    
    # compute cosine similarity    
    result = cosine_similarity(key_tfidf, tfidf_matrix)
    
    # sort top n similar movies   
    similar_key_movies = sorted(list(enumerate(result[0])), reverse=True, key=lambda x: x[1])
    
    # extract names from dataframe and return movie names
    recomm_mov = []
    for i in similar_key_movies[1:n+1]:
        recomm_mov.append(df.iloc[i[0]].title)
        
    return recomm_mov

In [52]:
get_recomm_keywords("Christopher Nolan", 4)

['Insomnia', 'Man of Steel', 'Batman Begins', 'Interstellar']

In [53]:
get_recomm_keywords("Daniel Craig", 4)

['Harrison Montgomery', 'Scary Movie 4', 'Action Jackson', 'Dream House']

Now, it’s time to switch to the last gear and transition from the notebook environment to a dynamic web application using Streamlit. To utilize the weights from the similarity matrix and the DataFrame effectively, we will save them as binary files using either joblib or pickle.

In [55]:
df.head()

Unnamed: 0,id,title,vote_average,corpus
0,19995,Avatar,7.2,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,6.9,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,6.3,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,7.6,Following the death of District Attorney Harve...
4,49529,John Carter,6.1,"John Carter is a war-weary, former military ca..."


In [54]:
import joblib

In [56]:
joblib.dump(df, 'models/movie_db.df')
joblib.dump(cos_mat, 'models/cos_mat.mt')
joblib.dump(tfidf, 'models/vectorizer.tf')
joblib.dump(tfidf_matrix, 'models/tfidf_mat.tf')

['models/tfidf_mat.tf']

This way, we can easily load and utilize them within our web application. With these preparations, we bring this section to a fulfilling conclusion.