# Notebook 4B: Doc2Vec Recommender

I will create a character recommendation system by using gensim's Doc2Vec model to calculate the similarity between one chosen character's movie lines to other characters' movie lines. The final result will be a function that allows a user to input a movie character's name, and it will return the top 10 most similar characters.


[**Doc2Vec Process**](https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e)

Doc2Vec is a neural network that utilizes Word2Vec and a paragraph matrix (aka document vector). In essence, it uses Word2Vec's two algorithms, continuous bag of words (CBOW) and skim gram. The CBOW is an algorithm that scans the context of surrounding words in a text to predict a word, whereas the skip gram uses a single word to predict the context of all surrounding words. Word2Vec will be used to predict the concept of a word, and Doc2Vec acts as a memory that is trained to represent the concept/topic of a document.


![alt text](../photos/doc2vec_graph.png "Title")


In our particular case, each character's total movie lines will be synonymous to a document. Based on the `mov_model` dataframe, there are 76127 documents (movie lines) that need to be processed.

## Pre-Processing

In [195]:
import pandas as pd
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

In [100]:
mov_model = pd.read_pickle("../data/mov_model.pkl")

There are 76127 movie characters.

In [101]:
len(mov_model)

76127

In [102]:
mov_model[['character','clean_text']].head(3)

Unnamed: 0,character,clean_text
0,bartender,What can I get you You forgot to pay
1,bianca,Did you change your hair You might wanna think...
2,bianca and walter,The sound of a fifteen year old in labor


Changing the `clean_text` values to be all lower case

In [103]:
mov_model['clean_text'] = mov_model['clean_text'].map(lambda x: x.lower())

I will now combine and group the documents and tag them for identification purposes. It is not conducive to tag the documents based on the character names because many movie characters share the same name, so filtering will be a nightmare. Instead, each character in the `mov_model` dataframe can be identified by its unique id that ranges from 0 to 76127. As a result, I will tag each of their move lines by their unique id as well. 

In [104]:
all_docs = list(df['clean_text'])

In [105]:
tagged_docs = [TaggedDocument(words = word_tokenize(doc.lower()), tags = [str(pos)]) for pos, doc in enumerate(all_docs)]

Example of examining the bartender's `clean_text` and `tagged_data` by using its unique id 0.

In [106]:
mov_model.at[0,'clean_text']

'what can i get you you forgot to pay '

In [107]:
tagged_docs[0]

TaggedDocument(words=['what', 'can', 'i', 'get', 'you', 'you', 'forgot', 'to', 'pay'], tags=['0'])

## Training Doc2Vec NN Model
The Doc2Vec is a neural network with a single projection and a hidden layer that is used to train on the corpus, which is all of the combined documents. The inputs of the model consist of two vectors: the word vectors of each word in each document and the paragraph vector. 

The following code is altered from [medium](https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5).

Gensim's Doc2Vec model has several parameters that can be modified/trained:

`size` = 150
   - A size of 150 means that each document's vector will contain 150 elements, so each document will occupy a point in a 150 dimensional space. A higher size implies that there are more dimensions, which allows for more differentiation between documents.
   
`alpha` = 0.025
- The alpha is the initial learning rate which is to minimize the loss function. 

`min_count` = 2
- Ignores all words with total frequency lower than 2

`dm` = 1
- DM = 1 means that the Distributed Memory version of Paragraph Vector (PV-DM) will be used to for the training algorithm. Essentially, it acts as a memory that remembers what is missing from the the current context of a character's movie line. 

`max_epochs` = 100
- 100 epochs mean that all the tagged documents pass through the neural network 100 times in order to optimize the learning by using gradient descent to decrease the loss function


In [182]:
model = Doc2Vec(vector_size = 150,
                alpha = 0.025, 
                min_alpha = 0.00025,
                min_count = 2,
                dm = 1)

# Builds the vocabulary from all of the documents
model.build_vocab(tagged_docs)

max_epochs = 100

for epoch in range(max_epochs):    
    if epoch % 5 == 0:
        print(f'Processing epoch number: {epoch}')
        
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("../models/d2v.model")
print("Model Saved")

Processing epoch number: 0
Processing epoch number: 5
Processing epoch number: 10
Processing epoch number: 15
Processing epoch number: 20
Processing epoch number: 25
Processing epoch number: 30
Processing epoch number: 35
Processing epoch number: 40




Processing epoch number: 45
Processing epoch number: 50
Processing epoch number: 55
Processing epoch number: 60
Processing epoch number: 65
Processing epoch number: 70
Processing epoch number: 75
Processing epoch number: 80
Processing epoch number: 85
Processing epoch number: 90
Processing epoch number: 95
Model Saved


## Character Similarity
With the saved model, I will now examine the top 10 similar characters based on Doc2Vec's [most_similar()](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Doc2VecKeyedVectors.most_similar) function. The most_similar() function computes cosine similarity between a simple mean of the projection weight vectors of the given docs. I am going to use the bartender from 10 Things I Hate About You (1999) as an example.

In [190]:
mov_model.loc[0,['imdb_title','character','text']]

imdb_title         10 Things I Hate About You (1999)
character                                  bartender
text          What can I get you? You forgot to pay!
Name: 0, dtype: object

In [200]:
model.docvecs.most_similar(0, topn = 10)

[(28619, 0.32750070095062256),
 (30941, 0.31470227241516113),
 (54820, 0.301654577255249),
 (56930, 0.2961265742778778),
 (14410, 0.29411712288856506),
 (64972, 0.29212257266044617),
 (67517, 0.29112035036087036),
 (21364, 0.2874348759651184),
 (12569, 0.2868254482746124),
 (58117, 0.28625547885894775)]

The function returns the top 10 most similar characters. The first number of each tuple is each character's unique id, and the second number is the character's cosine similarity. I will now investigate character 30941's movie lines:

In [203]:
mov_model.loc[30941,['imdb_title','character','text']]

imdb_title                                      I Am Sam (2001)
character                                               bailiff
text          ...the whole truth, and nothing but the truth,...
Name: 30941, dtype: object

In [204]:
mov_model.loc[30941,'text']

'...the whole truth, and nothing but the truth, so help you God?'

Based on the short sentences of each movie character, it is hard to determine how exactly 'similar' these characters are. With the saved Doc2Vec model, I will now create a character recommendation function where it will allow the user to filter for characters by toggling genres, movies, and other parameters.

# Proceed to Notebook 5: Character Recommendation Function