<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Recommendation-System" data-toc-modified-id="Recommendation-System-1">Recommendation System</a></span><ul class="toc-item"><li><span><a href="#Importing-Libraries" data-toc-modified-id="Importing-Libraries-1.1">Importing Libraries</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-2">Conclusion</a></span></li></ul></div>

![Screen%20Shot%202023-04-28%20at%205.03.05%20PM.png](attachment:Screen%20Shot%202023-04-28%20at%205.03.05%20PM.png)

![Screen%20Shot%202023-05-08%20at%201.49.10%20PM.png](attachment:Screen%20Shot%202023-05-08%20at%201.49.10%20PM.png)

### Recommendation System

In addition to predicting movie popularity, we will also implement a recommendation system that uses movie overviews to create embeddings and then ranks movies using cosine similarity. This system will engage users and keep them watching movies they enjoy.

#### Importing Libraries

In [1]:
import pandas as pd
from tqdm.notebook import tqdm
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from sentence_transformers import SentenceTransformer

tqdm.pandas()

In [3]:
# Uploading data
tmdb_5000_cred = pd.read_csv(r'D:\OneDrive - NITT\Custom_Download\tmdb_5000_credits.csv', index_col=False)
tmdb_5000_mov = pd.read_csv(r'D:\OneDrive - NITT\Custom_Download\tmdb_5000_movies.csv',index_col=False)

In [4]:
# Merging the data
tmdb_5000_cred.columns = ['id','tittle','cast','crew']
tmdb_5000_mov = tmdb_5000_mov.merge(tmdb_5000_cred,on='id')

In [5]:
# Creating a copy 
df = tmdb_5000_mov.copy()

In [6]:
# Storing the necessary columns
imp_cols = ['genres','original_title','overview','popularity']

In [7]:
# Storing the columns in a new dataframe'
data = df[imp_cols]

In [8]:
# Viewing dataframe 
data.head()

Unnamed: 0,genres,original_title,overview,popularity
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577
1,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615
2,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Spectre,A cryptic message from Bond’s past sends him o...,107.376788
3,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295
4,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",John Carter,"John Carter is a war-weary, former military ca...",43.926995


In [9]:
# Extracting names from a list of dictionaries
def get_val(dictionary_list):
    val = [d['name'] for d in eval(dictionary_list)]
    return val

In [10]:
# Creating new column withe the names 
data['genres'] = data['genres'].progress_apply(get_val)

  0%|          | 0/4803 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['genres'] = data['genres'].progress_apply(get_val)


In [11]:
# Viewing genres
data['genres']

0       [Action, Adventure, Fantasy, Science Fiction]
1                        [Adventure, Fantasy, Action]
2                          [Action, Adventure, Crime]
3                    [Action, Crime, Drama, Thriller]
4                [Action, Adventure, Science Fiction]
                            ...                      
4798                        [Action, Crime, Thriller]
4799                                [Comedy, Romance]
4800               [Comedy, Drama, Romance, TV Movie]
4801                                               []
4802                                    [Documentary]
Name: genres, Length: 4803, dtype: object

In [12]:
# Viewing selected 10 overviews
data['overview'][10]

'Superman returns to discover his 5-year absence has allowed Lex Luthor to walk free, and that those he was closest too felt abandoned and have moved on. Luthor plots his ultimate revenge that could see millions killed and change the face of the planet forever, as well as ridding himself of the Man of Steel.'

# Huggingface embedding

The "paraphrase-MiniLM-L6-v2" model is an embedding model that converts text into numerical representations. These embeddings capture the context of the input text, allowing for comparison like in our case - similarity measurement - in order to get closer mattch to another movie review.

In [13]:
# pip install -U sentence-transformers

In [14]:
# Storing the model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

In [15]:
# Viewing 
data

Unnamed: 0,genres,original_title,overview,popularity
0,"[Action, Adventure, Fantasy, Science Fiction]",Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577
1,"[Adventure, Fantasy, Action]",Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615
2,"[Action, Adventure, Crime]",Spectre,A cryptic message from Bond’s past sends him o...,107.376788
3,"[Action, Crime, Drama, Thriller]",The Dark Knight Rises,Following the death of District Attorney Harve...,112.312950
4,"[Action, Adventure, Science Fiction]",John Carter,"John Carter is a war-weary, former military ca...",43.926995
...,...,...,...,...
4798,"[Action, Crime, Thriller]",El Mariachi,El Mariachi just wants to play his guitar and ...,14.269792
4799,"[Comedy, Romance]",Newlyweds,A newlywed couple's honeymoon is upended by th...,0.642552
4800,"[Comedy, Drama, Romance, TV Movie]","Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic...",1.444476
4801,[],Shanghai Calling,When ambitious New York attorney Sam is sent t...,0.857008


Taking an overview text as input  and return embedding using hugging face pre-train model. 

In [16]:
# Getting the embedding
def get_embedding_sent(overview):
    
    #Sentences we want to encode.
    sentence = [overview]

    #Sentences are encoded by calling model.encode()
    embedding = model.encode(sentence)
    return embedding[0].tolist()

In [17]:
# Viewing data
data

Unnamed: 0,genres,original_title,overview,popularity
0,"[Action, Adventure, Fantasy, Science Fiction]",Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577
1,"[Adventure, Fantasy, Action]",Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615
2,"[Action, Adventure, Crime]",Spectre,A cryptic message from Bond’s past sends him o...,107.376788
3,"[Action, Crime, Drama, Thriller]",The Dark Knight Rises,Following the death of District Attorney Harve...,112.312950
4,"[Action, Adventure, Science Fiction]",John Carter,"John Carter is a war-weary, former military ca...",43.926995
...,...,...,...,...
4798,"[Action, Crime, Thriller]",El Mariachi,El Mariachi just wants to play his guitar and ...,14.269792
4799,"[Comedy, Romance]",Newlyweds,A newlywed couple's honeymoon is upended by th...,0.642552
4800,"[Comedy, Drama, Romance, TV Movie]","Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic...",1.444476
4801,[],Shanghai Calling,When ambitious New York attorney Sam is sent t...,0.857008


In [18]:
# Dropping null values
data.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.dropna(inplace=True)


We are adding a new column with the embedding for each movie

In [None]:
# Calculating sentence embeddings for movie overviews

data['Sent_Embedding'] = ( data['overview'].progress_apply
                          (get_embedding_sent)
                         )
# data['Embedding'] = data['overview'].progress_apply(lambda x: get_embedding(x))
data.head()

  0%|          | 0/4800 [00:00<?, ?it/s]

We convert the input text into numerical representations using an embedding function. By calculating the cosine similarity between the input text embedding and the embeddings of all movies, we determine their similarity. The dataframe is then sorted by this similarity, and the top 10 movies with the highest similarity are selected.

The selected movies are returned with their titles, similarity values, genres, and popularity, providing a list of recommended movies that closely match the type pf gerne that was inputed.

In [None]:
def get_recommendation(text,embd_fn,embd_col):
    
    # Creating a copy 
    temp = data.copy()
    
    # Getting the embedding representations of all movies

    y = embd_fn(text)
    
    # Preparing for calculation by converting to array

    x_embed = np.array([i for i in temp[embd_col]])
    
    
    # Assignning  the cosine similarity values to 
    # the 'similarity' column

    cs = cosine_similarity(x_embed,np.array(y).reshape(1,-1))
    
    # Sorting the dataframe by similarity
    temp['similarity'] = cs
    
    
    # Selecting the columns of interest for the 
    # recommendations and get the top 10 results

    temp = temp.sort_values('similarity',ascending=False)
    
    # Selecting the columns of interest for the 
    # recommendations and get the top 10 results

    temp = temp[['original_title','similarity',
                 'genres','popularity']].head(10)

    return temp

In [None]:
# Store Example 
ex = get_recommendation('horror movie with action',get_embedding_sent,'Sent_Embedding') # return 10 recommendation

In [None]:
# Resetting the index 
ex.reset_index(drop=1)

In [None]:
# Looking into the data before we store it
data

In [None]:
# Exporting the data that is needed
data.to_csv('Embedding_chkpoint_1.csv',index=False)

In [None]:
Stop

# Huggingface

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
from transformers import AutoModelForSequenceClassification
from transformers import (
    AutoModel, AutoConfig, 
    AutoTokenizer, logging
)
logging.set_verbosity_error()
logging.set_verbosity_warning()

max_seq_length = 512

In [None]:
# _pretrained_model = "nickmuchi/finbert-tone-finetuned-fintwitter-classification" #'roberta-base'

Steps get_embedding_bert:

* Retrieving the pre-trained model.

* Configurating for the model with the option to output hidden states.

* Initializing the BERT model for sequence classification with the loaded pre-trained model.

* Initializing the tokenizer associated with the pre-trained model.

* Tokenizing input overview text using the tokenizer, ensuring the addition of special tokens and applying padding.

* Passinging the tokenized to the BERT model and retrieve the outputs, including the hidden states.

* Stacking all the hidden states into a single tensor.

* Extracting the last layer's hidden state for each token and converting it to an array.

* Returning the hidden state as a list.


In [None]:
def get_embedding_bert(text,chkpoint):
    
    #  Loading the pre-trained model/tokenizer
    _pretrained_model = chkpoint

    config = AutoConfig.from_pretrained(_pretrained_model)
    
    
    config.update({'output_hidden_states':True}) 
    # config.update({'max_position_embeddings':256})
    model_bert = AutoModelForSequenceClassification.from_pretrained(_pretrained_model, config=config)
    tokenizer = AutoTokenizer.from_pretrained(_pretrained_model,use_fast=False)
    
    
    
    # Tokenizing the input text test
    features = tokenizer.batch_encode_plus(
        [text],
        add_special_tokens=True,
        padding='max_length',
        max_length=max_seq_length,
        truncation=True,
        return_tensors='pt',
        return_attention_mask=True
    )
    
    
        # Pass the input through the BERT model

    outputs = model_bert(features['input_ids'], 
                         features['attention_mask'])
    all_hidden_states = torch.stack(outputs['hidden_states']) #  torch.stack(outputs[2]) 
    
    return all_hidden_states[-1][:, 0].cpu().detach().numpy()[0].tolist()

In [None]:
# Storing checkpoint # 1
chkpoint1 = 'bert-large-uncased'
# Passing the function using lambda
data['Embedding_chkpoint_1'] = data['overview'].progress_apply(lambda x: get_embedding_bert(x,chkpoint1))

In [None]:
# chkpoint1 = 'bert-base-uncased'
# c = data['overview'][:2].progress_apply(lambda x: get_embedding_bert(x,chkpoint1))

In [None]:
# Storing checkpoint # 2
chkpoint2 = 'bert-large-uncased'
# Passing the function using lambda

data['Embedding_chkpoint_2'] = data['overview'].progress_apply(lambda x: get_embedding_bert(x,chkpoint2))

In [None]:
# chkpoint2 = 'bert-large-uncased'
# c = data['overview'][:2].progress_apply(lambda x: get_embedding_bert(x,chkpoint2))

In [None]:
# Example of getting recommendation (10)
get_recommendation('horror movie with action',
                   get_embedding_bert,'Embedding_chkpoint_1') 

In [None]:
# Example of getting recommendation (10)
get_recommendation('horror movie with action',
                   get_embedding_bert,'Embedding_chkpoint_2') # return 10 recommendation

### Conclusion
Our recommendation system utilizes BERT embeddings, a language model, to build a personalized movie recommendation system based on movie overviews. By analyzing the content and context of movies, we can suggest similar movies within your preferred genre.  By providing personalized recommendations, this system aims to keep customers engaged and encourage them to explore  movies tailored to their preferences