# Part 2: Doc2Vec

<hr>

In [3]:
import importlib
import gensim
import nltk
from materials.code import utils
importlib.reload(utils)
import matplotlib.pyplot as plt

# IMPORT SOME BASIC TOOLS:
from pprint import pprint
import pyarrow

[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1056)>


<br><br>
###  Introduction
Recall that an embedding is just another way of representing our text data. In the case of word2vec, we were trying to learn an embedding that allowed us to relate the [one-hot](https://en.wikipedia.org/wiki/One-hot) representations of a given word, to the one-hot representations of the context words. Because of this, word2vec provides a single vector representation for each distinct word in our vocabulary. This could be useful if we wanted to discover similar words based on their contextual usage, or to compare corpora based on the embeddings of their individual words. 

But word-level vector representations may not be the best way to represent our text in all circumstances. Consider the problem of classifying the Rotten Tomatoes movie reviews; each review has a *varying number of words* and we would like to use **all these words together** when classifying the reviews. So, if we represent the sentences as the set of word vectors that comprise it, we'll have (for each sentence) a matrix with as many rows as words, and as many columns as the embedding size. The problem is that (most) classifiers expect a fixed size input [tensor](https://en.wikipedia.org/wiki/Tensor#:~:text=In%20mathematics%2C%20a%20tensor%20is,scalars%2C%20and%20even%20other%20tensors.).   

So, how can we solve this problem? Doc2Vec is a simple modification to the word2vec algorithm proposed by [[Mikolov, 2014]](https://arxiv.org/pdf/1405.4053.pdf) that creates a fixed-length numeric representation of a document (e.g. a movie review) regardless of the document's length. Technically, this is accomplished by providing a one-hot `document_id` as one of the inputs to word2vec when training, alongside the context. Effectively, doc2vec, combines the semantic meaning for each documents words. You can read Mikolov's paper for the full technical details, or see [this blog](https://medium.com/wisio/a-gentle-introduction-to-doc2vec-db3e8c0cce5e) for a more intuitive explanation for how doc2vec differs from word2vec. 

**Note:** Doc2Vec is just one approach to combining the word2vec vectors and it is by no means authoritative or "the best". There are several simple alternatives: we could take a simple average of the word vectors, or we could represent the sentences as a (very large and very sparse) bag-of-word-vectors; like any solution, these have their pros and cons. I bring this to your attention because it's important to understand that there is no such thing as a "good" or "bad" method; The appropriate method depends on the problem you are trying to solve.

### Data
In this component of the tutorial, we will train a doc2vec model. To begin, let's import the Rotten Tomatoes Dataset again, and break it into `sentences`, and class labels `y`, and tokenize the sentences using the `nltk` tokenizer:

In [12]:
#-------------------------------------------------
# Import the rotton tomatoes dataset:
#-------------------------------------------------
from datasets import load_dataset
dataset   = load_dataset('rotten_tomatoes')

#-------------------------------------------------
# Flatten out the dataset into a list of sentences and outcome, y
#-------------------------------------------------
sentences = dataset['train']['text']  + dataset['validation']['text'] + dataset['test']['text']
y         = dataset['train']['label'] + dataset['validation']['label'] + dataset['test']['label']

#-------------------------------------------------
# Tokenize each of the sentences using nltk:
#-------------------------------------------------
for i,sentence in enumerate(sentences):
    sentences[i] = nltk.word_tokenize(gensim.utils.to_unicode(sentence))


Using custom data configuration default
Reusing dataset rotten_tomatoes_movie_review (/Users/ghamut/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/9198dbc50858df8bdb0d5f18ccaf33125800af96ad8434bc8b829918c987ee8a)


#### Training
In addition to `word2vec`, `gensim` provides a doc2vec model that we can use to transform our movie reviews into fixed length vectors. Note that `gensim` assumes a special format for the documents - which I illustrate below:

In [13]:
#-------------------------------------------------
# Convert data into 'documents' for processing by gensim
#-------------------------------------------------
import gensim
from gensim.models import doc2vec

documents = [doc2vec.TaggedDocument(doc, [i]) for i, doc in enumerate(sentences)]

With our data properly formatted, we can generate the sentence-level embeddings for the Rotten Tomatoes data using the [gensim](https://radimrehurek.com/gensim/) doc2vec implementation. 

In [14]:
#-------------------------------------------------
# Speed things up with multiprocessing
#-------------------------------------------------
import multiprocessing
CORES = multiprocessing.cpu_count()

#-------------------------------------------------
# Train Doc2Vec Model, selecting hyper-paramters
#-------------------------------------------------
doc2vec_model = doc2vec.Doc2Vec(documents    = documents,
                                 dm          = 1,    # ({1,0}, optional) – Defines the training algorithm. If dm=1, ‘distributed memory’ (PV-DM) is used. Otherwise, distributed bag of words (PV-DBOW) is employed.
                                 dbow_words  = 1,    # ({1,0}, optional) – If 0, use the sum of the context word vectors. If 1, use the mean. Only applies when dm is used in non-concatenative mode.
                                 vector_size = 100,  # (int, optional)   – Dimensionality of the feature vectors
                                 window      = 8,    # (int, optional)   – The maximum distance between the current and predicted word within a sentence.
                                 min_count   = 2,    # (int, optional)   – Ignores all words with total frequency lower than this.
                                 epochs      = 10,   # (int, optional)   – Number of iterations (epochs) over the corpus.
                                 workers     = CORES # Allows for parallelization across multiple cores of your machine - should speed things up.
                                )

<br><br>And just like that - we've trained the doc2vec model. Note that just like `word2vec`, `doc2vec` has several hyper-parameters that will impact the precise nature of the embeddings. I've provided some additional comments next to each of the hyper-parameters that should help clarify what they do. `gensim` also provides a built function that creates the document vectors, given an input:

In [7]:
import numpy as np

document = "The rock rocks"
vector = doc2vec_model.infer_vector(nltk.word_tokenize(gensim.utils.to_unicode(document)))
print(document,':\n', vector,'\n')

document = "and so he went a very very long way"
vector = doc2vec_model.infer_vector(nltk.word_tokenize(gensim.utils.to_unicode(document)))
print(document,':\n', vector)

The rock rocks :
 [-1.44153601e-02 -2.39591170e-02  2.69290917e-02  1.95036866e-02
  7.44855823e-03 -1.72724389e-02 -1.54242264e-02 -1.25857601e-02
 -2.89890729e-02  6.63137250e-03  4.52466455e-04  7.94259831e-03
  1.89221483e-02  1.85004249e-02 -1.03155021e-02  3.43167712e-03
 -2.12104619e-02  2.97078281e-03  8.10250174e-03  2.34576147e-02
  2.73774359e-02 -3.13795768e-02  3.16853151e-02 -7.49370956e-05
 -1.85929593e-02  3.14153987e-03 -8.53542425e-03  1.94764752e-02
  3.36200086e-04 -6.48112455e-03  2.77458131e-02 -4.89503220e-02
 -1.34389261e-02 -2.37140339e-03 -1.46882078e-02  6.82174275e-03
 -7.46581284e-03 -1.24850636e-03  2.46334635e-02 -1.44057879e-02
 -7.13158213e-03 -3.47599242e-04 -7.98465312e-03 -2.82172728e-02
 -5.69180846e-02  1.76429693e-02  6.27975038e-04  3.72305624e-02
 -2.87549570e-02  6.36350140e-02  1.80693567e-02  3.46224234e-02
 -2.10719015e-02 -2.13983841e-02  6.81256456e-03  3.51674296e-03
 -2.00131256e-02 -2.89899344e-03 -3.54144052e-02 -8.41913186e-03
 -1.022

<br><br> Notice that this representation for both "documents" are 100 dimensional, just as we requested. Now let's cast each of our document vectors to a fixed length representation:

In [8]:
#--------------------------------------------------------            
# Generate Vector Representations of the documents
#--------------------------------------------------------
vectors = []
for sentence in sentences:
    vectors.append(doc2vec_model.infer_vector(sentence))
vectors = np.array(vectors)

print('-------------------------------------------------------')
print('Dimentions of our document vector matrix')
print('-------------------------------------------------------')
np.shape(vectors)

-------------------------------------------------------
Dimentions of our document vector matrix
-------------------------------------------------------


(10662, 100)

<hr> 

## Learning Exercise 2: 
#### Worth 1/5 Points
#### A. Use Doc2Vec Features for Rotten Tomatoes Classification
Train a simple Logistic Regression model using `sklearn` that uses Doc2Vec features to predict the Rotten tomatoes movie reviews class. Use an 80%-20% training-test split of the data. Try 3-5 configurations of the doc2vec hyper-parameters (the exact settings are up to you) and report the AUROC of the logistic regression for each setting of the hyper-parameters. Comment on any differences you observe in the performance of the model as a function of the hyper-parameters settings and comment on why this might be the case. 

In [9]:
################################################################################
# INSERT YOUR CODE HERE
# DO NOT FORGET TO PRINT YOUR MEANINGFUL RESULTS TO THE SCREEN.
################################################################################

<span style="color:red"> INSERT AN INTERPRETATION OF YOUR RESULTS HERE </span>

#### B. Creating Document Vectors from Word Vectors
Train a `word2vec` model (with hyper-parameter settings of your choice) using the Rotton Tomatoes movie reviews. For each document, construct a fixed size document vector from the word vectors in the document. For instance, if you have a set of five 100-dimensional vectors that describe the five words in a review, you should combine those into one 100 dimensional vector. Propose a way to combine the vectors assuming that you want to use the vectors for review classification; justify your method.

In [11]:
################################################################################
# INSERT YOUR CODE HERE
# DO NOT FORGET TO PRINT YOUR MEANINGFUL RESULTS TO THE SCREEN.
################################################################################

<span style="color:red"> INSERT AN INTERPRETATION OF YOUR RESULTS HERE </span>

#### C. Comparing Embeddings through Visualization
Project the best performing doc2vec embedding from **part A** and the vectors you generated in **part B** into two dimensional representations using t-SNE or PCA. Use the `matplotlib` `scatter` function to compare the 2D representations from part A to part B. More specifically, please visualize each document as a point in two-dimensional space and color each document (i.e. point) according to it's class (e.g. positive reviews colored red, and negative reviews colored blue). Comment on differences you observe (if any) and reflect on why these differences (if any) might exist.

In [None]:
################################################################################
# INSERT YOUR CODE HERE
# DO NOT FORGET TO PRINT YOUR MEANINGFUL RESULTS TO THE SCREEN.
################################################################################

<span style="color:red"> INSERT AN INTERPRETATION OF YOUR RESULTS HERE </span>

<hr>
<h1><span style="color:red"> Self Assessment </span></h1>
Please provide an assessment of how successfully you accomplished the learning exercises in this assignment according to the instruction provided; do not assign yourself points for effort. This self assessment will be used as a starting point when I grade your assignments. Please note that if you over-estimate your grade on a given learning exercise, you will face a 50% penalty on the total points granted for that exercise. If you underestimate your grade, there will be no penalty.

* Learning Exercise: 
    * <span style="color:red">X</span>/1 points