<a href="https://colab.research.google.com/github/astrapi69/DroidBallet/blob/master/NLP_D1_2_E1_Similarity_Content_Recommenders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a id='Q0'></a>
<center><a target="_blank" href="https://learning.constructor.org/"><img src="https://drive.google.com/uc?id=1RNy-ds7KWXFs7YheGo9OQwO3OnpvRSU1" width="200" style="background:none; border:none; box-shadow:none;" /></a> </center>

_____

<center> <h1> Movie Recommendations with Document Similarity </h1> </center>

<p style="margin-bottom:1cm;"></p>

_____

<center>Constructor Academy, 2024</center>


# Movie Recommendations with Document Similarity

Recommender systems are one of the popular and most adopted applications of machine learning. They are typically used to recommend entities to users and these entites can be anything like products, movies, services and so on.

Popular examples of recommendations include,
- Amazon suggesting products on its website
- Amazon Prime, Netflix, Hotstar recommending movies\shows
- YouTube recommending videos to watch

Typically recommender systems can be implemented in three ways:

- Simple Rule-based Recommenders: Typically based on specific global metrics and thresholds like movie popularity, global ratings etc.
- Content-based Recommenders: This is based on providing similar entities based on a specific entity of interest. Content metadata can be used here like movie descriptions, genre, cast, director and so on
- Collaborative filtering Recommenders: Here we don't need metadata but we try to predict recommendations and ratings based on past ratings of different users and specific items.

We will be building a movie recommendation system here where based on data\metadata pertaining to different movies, we try and recommend similar movies of interest!

![](https://i.imgur.com/c7Go7d3.png)

Since our focus in not really recommendation engines but NLP, we will be leveraging the text-based metadata for each movie to try and recommend similar movies based on specific movies of interest. This falls under content-based recommenders.

# Install Dependencies

In [None]:
!pip install textsearch
!pip install contractions
import nltk
nltk.download('punkt')
nltk.download('stopwords')

# Load and View Data

In [None]:
import pandas as pd

df = pd.read_csv('https://github.com/dipanjanS/nlp_workshop_dhs18/raw/master/Unit%2010%20-%20Project%208%20-%20Movie%20Recommendations%20with%20Document%20Similarity/tmdb_5000_movies.csv.gz', compression='gzip')
df.info()

In [None]:
df.head()

## Let's focus on only the tagline and overview fields

__Your Task:__ Concatenate the `tagline` and `overview` fields and create a new column called description in the dataframe

In [None]:
df = df[['title', 'tagline', 'overview', 'popularity']]
df.tagline.fillna('', inplace=True)

df['description'] = <YOUR CODE HERE>

df.dropna(inplace=True)
df = df.sort_values(by=['popularity'], ascending=False)
df.info()

In [None]:
df.head()

# Build a Movie Recommender System

Here you will build your own movie recommender system. We will use the following pipeline:
- Text pre-processing
- Feature Engineering
- Document Similarity Computation
- Find top similar movies
- Build a movie recommendation function


## Document Similarity

Recommendations are about understanding the underlying features which make us favour one choice over the other. Similarity between items(in this case movies) is one way to understanding why we choose one movie over another. There are different ways to calculate similarity between two items. One of the most widely used measures is __cosine similarity__ which we have already used in the previous unit.

### Cosine Similarity

Cosine Similarity is used to calculate a numeric score to denote the similarity between two text documents. Mathematically, it is defined as follows:

$$ cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $$

In [None]:
import nltk
import re
import numpy as np
import contractions

stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # fix contractions
    doc = <YOUR CODE HERE>
    # remove special characters
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, flags=re.I|re.A)
    # lower case
    doc = <YOUR CODE HERE>
    # strip whitespaces
    doc = <YOUR CODE HERE>
    # tokenize document
    tokens = <YOUR CODE HERE>
    #filter stopwords out of document
    filtered_tokens = <YOUR CODE HERE>
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(list(df['description']))
len(norm_corpus)

## Extract TF-IDF Features

In [None]:
<YOUR CODE HERE>

tfidf_matrix = <YOUR CODE HERE>
tfidf_matrix.shape

## Compute Pairwise Document Similarity

In [None]:
<YOUR CODE HERE>

doc_sim = <YOUR CODE HERE>
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

## Get List of Movie Titles

In [None]:
movies_list = df['title'].values
movies_list, movies_list.shape

## Find Top Similar Movies for a Sample Movie

Let's take __Minions__ the most popular movie the the dataframe above and try and find the most similar movies which can be recommended

#### Find movie ID for 'Minions'

In [None]:
movie_idx = <YOUR CODE HERE>
movie_idx

#### Get movie similarities

In [None]:
movie_similarities = <YOUR CODE HERE>
movie_similarities

#### Get top 5 similar movie IDs

In [None]:
similar_movie_idxs = <YOUR CODE HERE>
similar_movie_idxs

#### Get top 5 similar movies

In [None]:
similar_movies = <YOUR CODE HERE>
similar_movies

### Build a movie recommender function to recommend top 5 similar movies for any movie

The movie title, movie title list and document similarity matrix dataframe will be given as inputs to the function

In [None]:
def movie_recommender(movie_title, movies=movies_list, doc_sims=doc_sim_df):
    # find movie id
    movie_idx = <YOUR CODE HERE>
    # get movie similarities
    movie_similarities = <YOUR CODE HERE>
    # get top 5 similar movie IDs
    similar_movie_idxs = <YOUR CODE HERE>
    # get top 5 movies
    similar_movies = <YOUR CODE HERE>
    # return the top 5 movies
    return similar_movies

# Get popular Movie Recommendations

In [None]:
popular_movies = ['Minions', 'Interstellar', 'Deadpool', 'Jurassic World', 'Pirates of the Caribbean: The Curse of the Black Pearl',
              'Dawn of the Planet of the Apes', 'The Hunger Games: Mockingjay - Part 1', 'Terminator Genisys',
              'Captain America: Civil War', 'The Dark Knight', 'The Martian', 'Batman v Superman: Dawn of Justice',
              'Pulp Fiction', 'The Godfather', 'The Shawshank Redemption', 'The Lord of the Rings: The Fellowship of the Ring',
              'Harry Potter and the Chamber of Secrets', 'Star Wars', 'The Hobbit: The Battle of the Five Armies',
              'Iron Man']

In [None]:
for movie in popular_movies:
    print('Movie:', movie)
    print('Top 5 recommended Movies:', movie_recommender(movie_title=movie, movies=movies_list, doc_sims=doc_sim_df))
    print()

# Movie Recommendation with Embeddings

We used count based normalized features in the previous section. Can we use word embeddings and then compute movie similarity? We definitely can! Here we will use the FastText model and train it on our corpus.

The FastText model was first introduced by Facebook in 2016 as an extension and supposedly improvement of the vanilla Word2Vec model. Based on the original paper titled ‘Enriching Word Vectors with Subword Information’ by Mikolov et al. which is an excellent read to gain an in-depth understanding of how this model works. Overall, FastText is a framework for learning word representations and also performing robust, fast and accurate text classification. The framework is open-sourced by Facebook on GitHub and claims to have the following.
- Recent state-of-the-art English word vectors.
- Word vectors for 157 languages trained on Wikipedia and Crawl.
- Models for language identification and various supervised tasks.

Though I haven’t implemented this model from scratch, based on the research paper, following is what I learnt about how the model works. In general, predictive models like the Word2Vec model typically considers each word as a distinct entity (e.g. `where`) and generates a dense embedding for the word. However this poses to be a serious limitation with languages having massive vocabularies and many rare words which may not occur a lot in different corpora. The Word2Vec model typically ignores the morphological structure of each word and considers a word as a single entity. The FastText model considers each word as a Bag of Character n-grams. This is also called as a subword model in the paper.

We add special boundary symbols < and > at the beginning and end of words. This enables us to distinguish prefixes and suffixes from other character sequences. We also include the word w itself in the set of its n-grams, to learn a representation for each word (in addition to its character n-grams). Taking the word `where` and n=3 (tri-grams) as an example, it will be represented by the character n-grams: `<wh, whe, her, ere, re>` and the special sequence `<where>` representing the whole word. Note that the sequence , corresponding to the word `<her>` is different from the tri-gram `her` from the word `where`.

Here we leverage `gensim` to build our embeddings

## Build the FastText embedding model here

Remember more the iterations usually better the embeddings but the more time it will take depending on your system CPU

50 iterations might take 15-20 mins

In [None]:
<YOUR CODE HERE>

tokenized_docs = <YOUR CODE HERE>
# ideal config params size: 300, window: 30, min_count=2 or more, iter=50 or more (use 10 if it takes too much time)
ft_model = <YOUR CODE HERE>

# Generate document level embeddings

Word embedding models give us an embedding for each word, how can we use it for downstream ML\DL tasks? one way is to flatten it or use sequential models. A simpler approach is to average all word embeddings for words in a document and generate a fixed-length document level emebdding

In [None]:
def averaged_word2vec_vectorizer(corpus, model, num_features):
    <YOUR CODE HERE>
    return np.array(features)

In [None]:
doc_vecs_ft = <YOUR CODE HERE>
doc_vecs_ft.shape

# Get Movie Recommendations

We will leverage cosine similarity again to generate recommendations

In [None]:
doc_sim = <YOUR CODE HERE>
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

In [None]:
for movie in popular_movies:
    print('Movie:', movie)
    print('Top 5 recommended Movies:', movie_recommender(movie_title=movie, movies=movies_list, doc_sims=doc_sim_df))
    print()