# Doc2Vec sentiment classification (IMDB)
Doc2Vec is a sentence embeddings representation, first introduced by Le & Mikolov in the paper "Distributed Representations of Sentences and Documents" <http://cs.stanford.edu/~quocle/paragraph_vector.pdf>.
In this notebook we will replicate part of the experiments described in the paper, more specifically the sentiment classification on the IMDB dataset.

The experiment basically consists in two main models trained:

1. Doc2Vec model trained on train+unlabeled data, to obtain document embeddings on IMDB dataset.
2. Logistic Regression trained on train data's Doc2Vec embeddings.

In order to test accuracy of the system, the Doc2Vec model is used first to infer an embedding for each test document. These embeddings are then used to predict the sentiment of the review through the trained Logistic Regression.

First of all we download the IMDB dataset from tensorflow_datasets:

In [1]:
import tensorflow_datasets as tfds


data = tfds.load(
    name="imdb_reviews",
    as_supervised=False)


Then we create a class where the iterator is a generator that reads IMDB reviews, tokenizes them, and creates TaggedDocument to be fed into the Doc2Vec models. 

We need this class to stream data from disk during the Doc2Vec training and vocabulary creation. In this manner, the computer will not swap memory.

In [2]:
import re

from gensim.models.doc2vec import TaggedDocument
from gensim.utils import to_unicode

def tokenize_text(text: str) -> list:
    """ Clean from html tags and tokenize """
    # Standardize newlines
    cleanr = re.compile('<.*?>')
    temp = re.sub(cleanr, '\n', text)
	
    # Remove repetition of blank spaces, by keeping only one
    temp = re.sub(" +", " ", temp)

    clean = temp.strip()
    return to_unicode(clean).split()


class TxtCorpus(list):
    """
    Iterable that returns TaggedDocument objects, used for Doc2Vec training.
    Process documents one by one using generators, avoiding to fill up RAM.
    """
    def __init__(self, tf_dataset, parts=['train', 'test', 'unsupervised']):
        self.tf_dataset = tf_dataset
        self.parts = parts

    def __iter__(self):
        for part in self.parts:
            for i, line in enumerate(self.tf_dataset[part]):
                text = line['text'].numpy().decode()
                tokens = tokenize_text(text)
                if i%10000 < 1:
                    print(f"DOCUMENT WITH ID(TAG): {part}-{i}")
                yield TaggedDocument(tokens, [f"{part}-{i}"])


The embeddings for the training of the Logistic Regression model are calculated during the training of the Doc2Vec model. Instead, the embeddings for the test set, like in a real application scenario, must be inferred once the model is trained.

In [3]:
import numpy as np

def load_from_model(d2v_model, tf_dataset, part):
    """ Load embeddings created during training phase."""
    X = []
    Y = []
    for i, line in enumerate(tf_dataset[part]):
        X.append(d2v_model.docvecs[f"{part}-{i}"])
        Y.append(line['label'].numpy())
    return np.asarray(X), np.asarray(Y)

def infer_from_model(d2v_model, tf_dataset, part):
    """ Infer embedding for a text not seen in training phase."""
    X = []
    Y = []
    for i, line in enumerate(tf_dataset[part]):
        text = line['text'].numpy().decode()

        X.append(d2v_model.infer_vector(tokenize_text(text)))
        Y.append(line['label'].numpy())
    return np.asarray(X), np.asarray(Y)

When training the logistic regression, a little bit of hyperparameter tuning is done in the process through grid search in a set of hyperparameters. The accuracy is then calculated on the best model found through a 5-fold cross validation. The limited number of features (typical of embeddings, compared to bag of words), allows us to do that relatively quickly. 
This way we expect to get that little extra boost in accuracy.

Since the problem is binary classification, the dataset is balanced, and we do not have specific requirements on precision and recall, accuracy is an acceptable metric in this case.

In [4]:
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV

def train_lr_evaluate(model, data):
    print(f"Evaluating {model}")
    X_train, Y_train = load_from_model(model, data, 'train')
    print("Training data embeddings loaded.")

    X_test, Y_test = infer_from_model(model, data, 'test')
    print("Test data embeddings inferred.")

    lr = linear_model.LogisticRegression()

    penalty = ['l1', 'l2']
    C = [0.0001, 0.001, 0.1, 1, 10, 100]
    hyperparameters = dict(C=C, penalty=penalty, solver=['liblinear'])

    clf = GridSearchCV(lr, hyperparameters, cv=5, verbose=2, n_jobs=4)
    print(f"Training and testing logistic regression... ", end='')
    start = time()
    clf.fit(X_train, Y_train)
    print(f"finished in {time() - start} seconds")
    return clf.best_estimator_.score(X_test, Y_test)


In the next steps, we combine unsupervised and train data for the Doc2Vec vocabulary creation and embeddings training. 

With some guidance from the gensim creators https://radimrehurek.com/gensim/auto_examples/howtos/run_doc2vec_imdb.html on parameters to optimize training time without affecting performances, we define the default parameters for the trained models:
* ``vector_size=100``, as there not seems to be a decay in performances compared to the paper's size of 400.
* ``epochs=20``, large enough for the model to learn features from the text.
* ``min_count=2`` discards words that appear in only one document, and hence do not benefit from the training that relies on the co-occurence of words in different documents
* ``sample=0``: no downsampling of frequent words
* ``hs=0`` and ``negative=5``: 5 "noise words" are updated for every update of positive samples
 
Starting from those base parameters, we train three models:

* One with the Distributed Bag of word (``DBOW``) method described in the paper.
* One with the Distiributed Memory (``DM``) method described in the paper, where the paragraph vector and word ``vectors are averaged during training`` to predict the next word in a context. 
* One with the Distiributed Memory (``DM``) method described in the paper, where the paragraph vector and word ``vectors are concatenated during training`` to predict the next word in a context. Concatenation results in a bigger and slower to train model, compared to averaging.

In the paper, the authors noted that concatenating the trained DBOW and DM paragraph vectors improves performance. We will repeat the same experiment.

In [5]:
from time import time
import multiprocessing

from gensim.models.doc2vec import Doc2Vec
from gensim.test.test_doc2vec import ConcatenatedDoc2Vec


td_reviews = TxtCorpus(data, parts=['train', 'unsupervised'])

common_params = dict(vector_size=100, epochs=20, min_count=2, sample=0, workers=multiprocessing.cpu_count(), negative=5, hs=0)

models = [
    # PV-DBOW training
    Doc2Vec(dm=0, **common_params),
    # PV-DM w/ default averaging; a higher starting learning rate may improve CBOW/PV-DM modes
    Doc2Vec(dm=1, window=10, alpha=0.05, comment='alpha=0.05', **common_params),
    # PV-DM w/ concatenation
    Doc2Vec(dm=1, dm_concat=1, window=5, **common_params),
]
models_by_name = {}

print("creating vocabularies...")
start = time()
for model in models:
    model.build_vocab(td_reviews)
    print(f"{model} vocabulary created")
    models_by_name[str(model)] = model
print(f"vocabularies created in {time() - start} seconds")

models_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([models[0], models[1]])
models_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([models[0], models[2]])

creating vocabularies...
DOCUMENT WITH ID(TAG): train-0
DOCUMENT WITH ID(TAG): train-10000
DOCUMENT WITH ID(TAG): train-20000
DOCUMENT WITH ID(TAG): unsupervised-0
DOCUMENT WITH ID(TAG): unsupervised-10000
DOCUMENT WITH ID(TAG): unsupervised-20000
DOCUMENT WITH ID(TAG): unsupervised-30000
DOCUMENT WITH ID(TAG): unsupervised-40000
Doc2Vec(dbow,d100,n5,mc2,t8) vocabulary created
DOCUMENT WITH ID(TAG): train-0
DOCUMENT WITH ID(TAG): train-10000
DOCUMENT WITH ID(TAG): train-20000
DOCUMENT WITH ID(TAG): unsupervised-0
DOCUMENT WITH ID(TAG): unsupervised-10000
DOCUMENT WITH ID(TAG): unsupervised-20000
DOCUMENT WITH ID(TAG): unsupervised-30000
DOCUMENT WITH ID(TAG): unsupervised-40000
Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t8) vocabulary created
DOCUMENT WITH ID(TAG): train-0
DOCUMENT WITH ID(TAG): train-10000
DOCUMENT WITH ID(TAG): train-20000
DOCUMENT WITH ID(TAG): unsupervised-0
DOCUMENT WITH ID(TAG): unsupervised-10000
DOCUMENT WITH ID(TAG): unsupervised-20000
DOCUMENT WITH ID(TAG): un

In [6]:
accuracies = {}

for model_name, model in models_by_name.items():
    if model_name.startswith("Doc2Vec"):
        print(f"Training {model_name}... ", end='')
        start = time()
        model.train(td_reviews, total_examples=model.corpus_count, epochs=model.epochs)
        print(f"finished in {time() - start} seconds")
        accuracies[str(model)] = train_lr_evaluate(model, data)
        print(f"Accuracy: {accuracies[str(model)]} \n")


for model in [models_by_name['dbow+dmm'], models_by_name['dbow+dmc']]:
    accuracies[str(model)] = train_lr_evaluate(model, data)
    print(f"Accuracy: {accuracies[str(model)]} \n")

-10000
DOCUMENT WITH ID(TAG): unsupervised-20000
DOCUMENT WITH ID(TAG): unsupervised-30000
DOCUMENT WITH ID(TAG): unsupervised-40000
DOCUMENT WITH ID(TAG): train-0
DOCUMENT WITH ID(TAG): train-10000
DOCUMENT WITH ID(TAG): train-20000
DOCUMENT WITH ID(TAG): unsupervised-0
DOCUMENT WITH ID(TAG): unsupervised-10000
DOCUMENT WITH ID(TAG): unsupervised-20000
DOCUMENT WITH ID(TAG): unsupervised-30000
DOCUMENT WITH ID(TAG): unsupervised-40000
DOCUMENT WITH ID(TAG): train-0
DOCUMENT WITH ID(TAG): train-10000
DOCUMENT WITH ID(TAG): train-20000
DOCUMENT WITH ID(TAG): unsupervised-0
DOCUMENT WITH ID(TAG): unsupervised-10000
DOCUMENT WITH ID(TAG): unsupervised-20000
DOCUMENT WITH ID(TAG): unsupervised-30000
DOCUMENT WITH ID(TAG): unsupervised-40000
DOCUMENT WITH ID(TAG): train-0
DOCUMENT WITH ID(TAG): train-10000
DOCUMENT WITH ID(TAG): train-20000
DOCUMENT WITH ID(TAG): unsupervised-0
DOCUMENT WITH ID(TAG): unsupervised-10000
DOCUMENT WITH ID(TAG): unsupervised-20000
DOCUMENT WITH ID(TAG): unsuper

In [7]:
from pprint import pprint
pprint(accuracies)

{'Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t8)': 0.81804,
 'Doc2Vec(dbow,d100,n5,mc2,t8)': 0.88948,
 'Doc2Vec(dbow,d100,n5,mc2,t8)+Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t8)': 0.8884,
 'Doc2Vec(dbow,d100,n5,mc2,t8)+Doc2Vec(dm/c,d100,n5,w5,mc2,t8)': 0.8876,
 'Doc2Vec(dm/c,d100,n5,w5,mc2,t8)': 0.717}


As we can see from the results, DBOW embeddings alone performs as good as any other solution. Concatenating DBOW and DM vectors results at most in a negligible improvement, not justifying the extra time and computational power used to train and predict on an additional DM model.

The accuracy of the paper (92.58%) was not achieved, as the best results we got were at most about 89% accuracy.