# Movie Recommender using Word Embeddings
## Learning Objective
Here in this assignment, you will train word embeddings using gensim. Then you will use the learnt embeddings to a movie recommender application.

<b><div style="text-align: right">[TOTAL POINTS: 20]</div></b>

## Assignment Overview

In this assignment, you will demonstrate the application of word embeddings for movie recommendation. You will pre-process the dataset and extract the useful columns from the dataset (containing description or information about the movie in text format). Then you will estimate the word embeddings of the words in the description using [gensim](https://) library. You will average the word embeddings of the description and based on the cosine similarity between the similar description, you will build a movie recommender.

## Dataset Description

In this assignment, you will use [The Movies Dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset?select=movies_metadata.csv). The dataset contains metadata for 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017.

This dataset consists of the following files:

movies_metadata.csv: This is the main Movies Metadata file. It contains information about 45,000 movies featured in the Full MovieLens dataset.
The features of the `movies_metadata.csv` include:
* genres
* overview
* tagline
* posters
* backdrops
* budget
* review
* release dates
* languages
* production countries
* production companies

keywords.csv: Contains the movie plot keywords for MovieLens movies.

credits.csv: Consists of Cast and Crew Information for the movies.

links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.

links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.

ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.

\
In this programming assignment you will only use the `movies_metadata.csv` dataset.

\
**Licence**:  [CC0: Public Domain](https://creativecommons.org/publicdomain/zero/1.0/)

**Source**: https://www.kaggle.com/rounakbanik/the-movies-dataset?select=movies_metadata.csv \
**Number of instances**: 45466



In [1]:
!pip install -q gensim
!pip install -q scikit-learn

ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\Users\\acer\\Desktop\\fuse_ml\\env\\Lib\\site-packages\\~umpy.libs\\libscipy_openblas64_-caad452230ae4ddb57899b8b3a33c55c.dll'
Check the permissions.

You should consider upgrading via the 'C:\Users\acer\Desktop\fuse_ml\env\Scripts\python.exe -m pip install --upgrade pip' command.
You should consider upgrading via the 'C:\Users\acer\Desktop\fuse_ml\env\Scripts\python.exe -m pip install --upgrade pip' command.


In [2]:
import warnings
warnings.filterwarnings('ignore')

Let's quickly look at the portion of the dataset.

In [3]:
import pandas as pd
df1 = pd.read_csv('movies_metadata.csv')
df1.head(5)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [4]:
df2 = pd.read_csv('keywords.csv')
df2.head(5)

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


There are some id's which contains date as the id number. Let's delete those rows.

In [5]:
df1[df1.id.str.contains('\d{4}-\d{2}-\d{2}', regex= True, na=False)]

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
19730,- Written by Ørnås,0.065736,/ff9qCepilowshEtG2GYWwzt2bs4.jpg,"[{'name': 'Carousel Productions', 'id': 11176}...","[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso...",1997-08-20,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,1,,,,,,,,,
29503,Rune Balot goes to a casino connected to the ...,1.931659,/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg,"[{'name': 'Aniplex', 'id': 2883}, {'name': 'Go...","[{'iso_3166_1': 'US', 'name': 'United States o...",2012-09-29,0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,...,12,,,,,,,,,
35587,Avalanche Sharks tells the story of a bikini ...,2.185485,/zaSf5OG7V8X8gqFvly88zDdRm46.jpg,"[{'name': 'Odyssey Media', 'id': 17161}, {'nam...","[{'iso_3166_1': 'CA', 'name': 'Canada'}]",2014-01-01,0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,22,,,,,,,,,


In [6]:
df1 = df1.drop([19730, 29503, 35587])
df2 = df2.drop([19730, 29503, 35587])

Let's change the datatype of id column to be int64

In [7]:
df1['id'] = df1['id'].astype('int64')

Let's merge two dataframes on id.

In [8]:
df = df1.merge(df2,on='id')
df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


Let's remove the following columns from the dataset. These columns are either numeric or not providing much information about the description of the movie as texts.

adult: containing information about the movie is "adult" or not.

belongs_to_collection: contating information about the collection type 

budget: the budget of the movie

homepage: original homepage of the move

imdb_id: the imdb id of the movie

id: the unique identifier of the movie

original_title: the original title of the movie

release_date: the releasing date of the movie

poster_path: the path containing image of the movie

production_countries: the production countries of the movie

revenue: the revenue of the movie

runtime: the runtime of the movie

spoken_languages: the spoken languages of the movie

status: the status of the movie

video: either the movie has video or not

vote_average: the vote average of the movie

vote_count: the vote count of the movie

In [9]:
df = df.drop(['adult', 'belongs_to_collection','budget', 
              'homepage', 'imdb_id', 'id', 'original_title',
              'release_date', 'poster_path', 'production_countries',
              'popularity','revenue','runtime', 'spoken_languages', 
              'status', 'video','vote_average', 'vote_count'], axis=1)
df.head()

Unnamed: 0,genres,original_language,overview,production_companies,tagline,title,keywords
0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",en,"Led by Woody, Andy's toys live happily in his ...","[{'name': 'Pixar Animation Studios', 'id': 3}]",,Toy Story,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",en,When siblings Judy and Peter discover an encha...,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...",Roll the dice and unleash the excitement!,Jumanji,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",en,A family wedding reignites the ancient feud be...,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...",Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",en,"Cheated on, mistreated and stepped on, the wom...",[{'name': 'Twentieth Century Fox Film Corporat...,Friends are the people who let you be yourself...,Waiting to Exhale,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,"[{'id': 35, 'name': 'Comedy'}]",en,Just when George Banks has recovered from his ...,"[{'name': 'Sandollar Productions', 'id': 5842}...",Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


### Exercise 1 : Concat Description
<b><div style="text-align: right">[POINTS: 2]</div></b>

You can now see that there are following columns in the dataset.

title: the movie name or the title of the movie 

keywords: keyword for the movie

genres: the genre of the movie

original_language: original language of the movie

overview: the overview(description) of the movie

production companies: the production companies of the movie

tagline: the brief description of the movie

Your task is to concat all of these descriptions of the movies as a single description as a string.

**Tasks:**

* concat all of these descriptions of the movies as a single description as a string in the variable `df['concat_description']`.

* Concatination should follow the following format:\
`keywords + space + genres + space + original_language + space + production_companies + space + tagline + space + overview`\
`Note: Don't forget to typecast to string format.`


In [10]:
df['concat_description'] = None
### Ex-1-Task-1
### BEGIN SOLUTION 
# YOUR CODE HERE
df['concat_description'] = (
    df['keywords'].astype(str) + " " +
    df['genres'].astype(str) + " " +
    df['original_language'].astype(str) + " " +
    df['production_companies'].astype(str) + " " +
    df['tagline'].astype(str) + " " +
    df['overview'].astype(str)
)
### END SOLUTION

In [11]:
# Intentionally Left Blank

In [12]:
df = df[['title', 'concat_description']]
df.head()

Unnamed: 0,title,concat_description
0,Toy Story,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,Jumanji,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,Grumpier Old Men,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,Waiting to Exhale,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,Father of the Bride Part II,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


### Exercise 2 : Preprocessing Dataset
<b><div style="text-align: right">[POINTS: 5]</div></b>

Now that you have title and concatenated descriptions of all of the movies. Your task is to lowecase the text of descriptions, to remove the stop words, to remove blacklisted words such as **id, name and nan**, to remove numbers and finally to remove punctuations from the `concat_description`.

**Tasks:**

* lowercase the description
* remove the stop words
* remove punctuations such as "{", ".", "?", ":" etc.\
`Hint: Use nltk.tokenize.RegexpTokenizer()`
* remove blacklist words ("id", "name", and "nan")
* remove numbers



In [13]:
%%time
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

CPU times: total: 1.47 s
Wall time: 13.8 s


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\acer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [14]:
def make_lower_case(text):
    """
    Args:
        text (str): a string
    Returns:
        text_lower (str):  a lowecased string
    """
    text_lower = None
    ### Ex-2-Task-1
    ### BEGIN SOLUTION 
    # YOUR CODE HERE
    if isinstance(text, str):   # check to avoid NaN
        text_lower = text.lower()
    else:
        text_lower = ""
    ### END SOLUTION
    return text_lower

In [15]:
# Intentionally Left Blank

In [16]:
from nltk.corpus import stopwords
def remove_stop_words(text):
    """
    Args:
        text (str): a string
    Returns:
        removed_stop_word_text (str):  string with removed stop words in the text
    """
    text = text.split()
    stop_words = set(stopwords.words("english"))
    removed_stop_word_text = None
    ### Ex-2-Task-2
    ### BEGIN SOLUTION 
    # YOUR CODE HERE
    filtered_words = [word for word in text if word.lower() not in stop_words]
    removed_stop_word_text = " ".join(filtered_words)
    ### END SOLUTION
    return removed_stop_word_text
    

In [17]:
# Intentionally Left Blank

In [18]:
def remove_blacklist_words(text):
    """
    Args:
        text (str): a string
    Returns:
        removed_stop_word_text (str):  string with removed black listed words in the text
    """
    text = text.split()
    black_list = ["id", "name", "nan"]
    removed_black_list_text = None
    ### Ex-2-Task-3
    ### BEGIN SOLUTION 
    # YOUR CODE HERE
    filtered_words = [word for word in text if word.lower() not in black_list]
    removed_black_list_text = " ".join(filtered_words)
    ### END SOLUTION
    return removed_black_list_text

In [19]:
# Intentionally Left Blank

In [20]:
import re
def remove_numbers(text):
    """
    Args:
        text (str): a string
    Returns:
        text_lower (str):  string with removed numbers from the text.
    """
    pattern = r'[0-9]'
    ### Ex-2-Task-4
    ### BEGIN SOLUTION 
    # YOUR CODE HERE
    removed_numbers_text = re.sub(pattern, '', text)
    ### END SOLUTION
    return removed_numbers_text

In [21]:
# Intentionally Left Blank

In [22]:
from nltk.tokenize import RegexpTokenizer
def remove_punctuation(text):
    """
    Args:
        text (str): a string
    Returns:
        text_lower (str):  a lowecased string
    """
    tokenizer = RegexpTokenizer(r'\w+')
    removed_punctuation_text = None
    ### Ex-2-Task-5
    ### BEGIN SOLUTION 
    # YOUR CODE HERE
    tokens = tokenizer.tokenize(text)
    removed_punctuation_text = " ".join(tokens)
    ### END SOLUTION
    return removed_punctuation_text

In [23]:
# Intentionally Left Blank

In [24]:
df['description'] = df['concat_description'].apply(make_lower_case)
df['description'] = df.description.apply(remove_stop_words)
df['description'] = df.description.apply(remove_punctuation)
df['description'] = df.description.apply(remove_blacklist_words)
df['description'] = df.description.apply(remove_numbers)

Let's look at the cleaned description.

In [25]:
df = df.drop(['concat_description'], axis=1)
df.head()

Unnamed: 0,title,description
0,Toy Story,jealousy toy boy friendship friends riva...
1,Jumanji,board game disappearance based children s b...
2,Grumpier Old Men,fishing best friend duringcreditsstinger o...
3,Waiting to Exhale,based novel interracial relationship single...
4,Father of the Bride Part II,baby midlife crisis confidence aging daug...


In [26]:
print(df['description'][0])

 jealousy  toy  boy  friendship  friends  rivalry  boy next door  new toy  toy comes life  animation  comedy  family en pixar animation studios  led woody andy s toys live happily room andy s birthday brings buzz lightyear onto scene afraid losing place andy s heart woody plots buzz circumstances separate buzz woody owner duo eventually learns put aside differences


In [27]:
df['description'][:10]

0     jealousy  toy  boy  friendship  friends  riva...
1     board game  disappearance  based children s b...
2     fishing  best friend  duringcreditsstinger  o...
3     based novel  interracial relationship  single...
4     baby  midlife crisis  confidence  aging  daug...
5     robbery  detective  bank  obsession  chase  s...
6     paris  brother brother relationship  chauffeu...
7     action  adventure  drama  family en walt disn...
8     terrorist  hostage  explosive  vice president...
9     cuba  falsely accused  secret identity  compu...
Name: description, dtype: object

### Exercise 3: Word Embeddings with Gensim Word2Vec 
<b><div style="text-align: right">[POINTS: 3]</div></b>

Now that you have preprocessed the dataset, your task is to build a model for embeddings with [gensim](https://radimrehurek.com/gensim/).

You will have to use the following parameters for your word2vec model.

Parameters:

1. **sentences = (list) processed text in the format of list of list of words.**

In your case `sentences` should be like:
```
[
  ['jealousy',  'toy', 'boy', 'friendship', 'friends', .....],  # description 0
  ['board',  'game',  'disappearance', 'based', 'children',.......],   # description 1
  ['fishing',  'best', 'friend', 'duringcreditsstinger', ....],  # description 2
                      ......
                      ......
  ['terrorist',  'hostage',  'explosive', 'vice', 'president', .....]  # description 8
                     ......
                     ......
]
```

2. **sg = 1**

3. **size = 300**

4. **window = 10**

5. **min_count = 3**

6. **seed = 14**

**Tasks:**
* assign sentences to processed_text in the format of list of list of words as described in the parameter 1 in the description of the assignment.

* create a Word2Vec model from gensim using the parameters described in the parameter section of the description of the assignment.

In [28]:
sentences = [lines.split() for lines in df['description']]
len(sentences)

46479

In [30]:
import pickle
from gensim.models import Word2Vec

### Ex-3-Task-1

import time
start_time = time.time()

sentences = None
model = None

### BEGIN SOLUTION 
# YOUR CODE HERE
sentences = [line.split() for line in df['description']]
model = Word2Vec(
    sentences=sentences,
    vector_size=300,   # size parameter
    window=10,         # context window
    min_count=3,       # ignore words with freq < 3
    sg=1,              # skip-gram
    seed=14
)

### END SOLUTION

print(f"Time for execution: {time.time() - start_time} secs")


Time for execution: 68.10414004325867 secs


In [31]:
# Intentionally Left Blank

In [32]:
model.wv.most_similar('pilot')

[('aviation', 0.7125465869903564),
 ('pilots', 0.708581805229187),
 ('airline', 0.7084277868270874),
 ('bomber', 0.6474608182907104),
 ('bombardier', 0.6446314454078674),
 ('stewardess', 0.6406971216201782),
 ('air', 0.639661967754364),
 ('carrier', 0.6387283205986023),
 ('controller', 0.6373656392097473),
 ('dirigible', 0.6369784474372864)]

In [33]:
model.wv.most_similar('animation')

[('animated', 0.6976865530014038),
 ('cartoon', 0.6790888905525208),
 ('pixar', 0.660089910030365),
 ('aardman', 0.6595355272293091),
 ('fleischer', 0.6400825381278992),
 ('barbera', 0.6341936588287354),
 ('disneytoon', 0.6327579021453857),
 ('animations', 0.6249199509620667),
 ('cgi', 0.6247080564498901),
 ('soyuzmultfilm', 0.6177193522453308)]

### Exercise 4: Averaging Word Embeddings of all of the Words in the Description
<b><div style="text-align: right">[POINTS: 5]</div></b>

Look at an example, suppose you have the following description:
```
jealousy  toy  boy  friendship  friends  rivalry  boy next door  new toy  toy comes ...
```

You will sum all of the word vectors in the description


i.e.

$\text{sum_desc_vec = vec(jealousy) + vec(toy)+ vec(boy)+ vec(friendship)+ vec(friends) + vec(rivalry) + vec(boy) + vec(next) + ....}$


Here, $\text{vec(jealousy)}$ represents word vectors of the word **"jealousy"** and so on.


Then, you will estimate the averaged vector for a description as:


$\text{avg_desc_vec} = \frac{\text{sum_desc_vec}}{\text{number of words in the description}}$

Your task is to calculate the avergae of the word vectors for all of the descriptions and store it in column named `avg_description_vector`.


**Tasks:**
* Get the averaged vector of all of the words in the description.

**Note: you will only average the words which are in the vocab of the word embeddings i.e if the word is in `model.wv.vocab.keys()`.**

Also if there are only words in the descriptions which are not in the vocab of the word embeddings, assign them to array of zeros of 300 dims. i.e (300, )

**Hint:** The word embedding vectors for the word `'word'` can be accessed as `model.wv['word']`.

In [38]:
%%time
import sklearn
import numpy as np
def avg_desc_vector(description):
    """
    Get the average vector of each words in the description.

    Args:
        description(str): the combined description of a movie

    Returns:
        vector_sum(numpy array): Numpy array of the averaged vectors of each words in the description
    """
    sum_desc_vec = 0
    avg_desc_vec = 0
    num_words = 0
    words = description.split()
    vector_sum = np.zeros(model.vector_size) 
    for word in words:
        ### Ex-4-Task-1
        ### BEGIN SOLUTION 
        # YOUR CODE HERE
        if word in model.wv.key_to_index:  # check if word is in vocab
            vector_sum += model.wv[word]
            num_words += 1
    if num_words > 0:
        avg_desc_vec = vector_sum / num_words
    else:
        avg_desc_vec = vector_sum
        
        ### END SOLUTION
    return avg_desc_vec

CPU times: total: 0 ns
Wall time: 0 ns


In [39]:
# Intentionally Left Blank

In [40]:
df['avg_description_vector'] = None
### Ex-4-Task-2
### BEGIN SOLUTION 
# YOUR CODE HERE
df['avg_description_vector'] = df['description'].apply(avg_desc_vector)
### END SOLUTION

In [41]:
# Intentionally Left Blank

In [42]:
df.head()

Unnamed: 0,title,description,avg_description_vector
0,Toy Story,jealousy toy boy friendship friends riva...,"[-0.0406301780767216, 0.15173303371850858, 0.0..."
1,Jumanji,board game disappearance based children s b...,"[0.05615052352301203, 0.14558140348929624, 0.0..."
2,Grumpier Old Men,fishing best friend duringcreditsstinger o...,"[0.04322379522232546, 0.12761024502752763, 0.0..."
3,Waiting to Exhale,based novel interracial relationship single...,"[-0.0022949514868245884, 0.19027576149779965, ..."
4,Father of the Bride Part II,baby midlife crisis confidence aging daug...,"[-0.02222414949781024, 0.17766872473864578, -0..."


In [43]:
import numpy as np
avg_desc_vector_all = np.array(df['avg_description_vector'].to_list())
avg_desc_vector_all.shape

(46479, 300)

In [44]:
avg_desc_vector_all.shape

(46479, 300)

### Exercise 5: Cosine Similarities between Movie and all of the Movies in the DataFrame
<b><div style="text-align: right">[POINTS: 3]</div></b>


Your task is to get the cosine similarities between the given movie and all of the movies in the DataFrame.

**Tasks:**
* `movie_index`: Get the index of the movie in the DataFrame

* `movie_avg_desc_vector`: Get the Description Vector of the movie

* `cosine_similarities`: Get the cosine similarities between description vector of the movie and all of the descriptions (vectors) of the DataFrame.

In [45]:
from sklearn.metrics.pairwise import cosine_similarity

def similarity_scores(movie, df=df, avg_desc_vector_all=avg_desc_vector_all):
    """
    Similarity scores between movie and all of the movies in the dataframe.
    
    Args:
        movie (str): Title of the Movie
        df(DataFrame): DataFrame containing Movie title and Movie Description
        avg_movie_vector_all(numpy array): average embeddings of all of the Movie Descriptions

    Returns:
        list: list of top 5 similar Movie title with their cosine similarities score
    """
    movie_index = None
    movie_avg_desc_vector = None
    cosine_similarities = None
    ### Ex-5-Task-1
    ### BEGIN SOLUTION 
    # YOUR CODE HERE
    movie_index = df[df['title'] == movie].index[0]
    movie_avg_desc_vector = df.loc[movie_index, 'avg_description_vector'].reshape(1, -1)
    all_vectors = np.vstack(df['avg_description_vector'].values)
    cosine_similarities = cosine_similarity(movie_avg_desc_vector, all_vectors).flatten()
    similar_indices = cosine_similarities.argsort()[::-1][1:6]
    top_movies = [(df.iloc[i]['title'], cosine_similarities[i]) for i in similar_indices]

    ### END SOLUTION
    
    return top_movies

In [46]:
# Intentionally Left Blank

### Exercise 6: Movie Recommendation
<b><div style="text-align: right">[POINTS: 2]</div></b>


Now, your task is to recommend top 5 similar movies based on the cosine similary scores between the averaged description vector of the given movie and all of the averaged description vectors.

**Tasks:**
* cosine_similarities: cosine similarities between description vector of the movie and all of the descriptions (vectors) of the dataset.

In [47]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def recommendations(movie, df=df):
    """
    Recommending the top 5 similar movies

    Args:
        movie (str): Title of the Movie
        df (DataFrame): DataFrame containing Movie title and Movie Description
    Returns:
        list: list of top 5 similar Movie title with their cosine similarities score
    """
    
    cosine_similarities = None
    ### Ex-6-Task-1
    ### BEGIN SOLUTION 
    # YOUR CODE HERE
    movie_index = df[df['title'] == movie].index[0]
    movie_vector = df.loc[movie_index, 'avg_description_vector'].reshape(1, -1)
    all_vectors = np.vstack(df['avg_description_vector'].values)
    cosine_similarities = cosine_similarity(movie_vector, all_vectors)
    ### END SOLUTION
    similarities_scores = list(enumerate(cosine_similarities.squeeze().tolist()))  # index and vector values of cosine similarities
    sorted_similarities_scores = sorted(similarities_scores, key = lambda x: x[1], reverse = True) # sorted in descending order of index and vector values of cosine similarities
    top5_sim_scores = sorted_similarities_scores[1:6]  # top 5 indices and similarity scores
    top5_movie_indices = [index for index, score in top5_sim_scores] # top 5 movie indices
    top5_movie_scores = [score for index, score in top5_sim_scores] # top 5 movie scores
    top5_movie_titles = df.iloc[top5_movie_indices]['title'].tolist() # top 5 movie titles
    
    return list(zip(top5_movie_titles, top5_movie_scores)) 

In [48]:
# Intentionally Left Blank

Now that you have sucessfully created a recommendation for a movie, let's see some of the recommendation given by the model for test movies.

In [49]:
recommendations("Toy Story")

[('Toy Story 2', 0.9539537628291224),
 ('Toy Story 3', 0.9484250666014553),
 ('The Tangerine Bear: Home in Time for Christmas!', 0.9372639755000818),
 ('Simply Irresistible', 0.9331920970979619),
 ('The Mouse and His Child', 0.9309377852336275)]

In [50]:
recommendations("The Godfather")

[('The Godfather: Part II', 0.9566672321526715),
 ('The Godfather: Part III', 0.9416939692131607),
 ('A Bronx Tale', 0.9372969003235181),
 ("Jane Austen's Mafia!", 0.9325540149339953),
 ('Trespass Against Us', 0.932519292878916)]

In [51]:
recommendations("Avatar")

[('Aliens', 0.9624112131472289),
 ('Planet of the Apes', 0.9545358557637249),
 ('Planet of the Apes', 0.9530490941916165),
 ('The Day the Earth Stood Still', 0.9514794141844011),
 ('Dune', 0.9494578456715068)]

In [52]:
recommendations("The Fault in Our Stars")

[('Fallen', 0.9410216198280146),
 ('The Lovely Bones', 0.939780525951233),
 ('Miracles from Heaven', 0.9393165585512979),
 ('Twilight', 0.9390665253379721),
 ('A Life Less Ordinary', 0.9368484984300887)]

Congratulations for sucessfully implementing a recommendation system using various NLP concepts!!