<h1 style='color:#00868b; text-align:center;'>Doc2Vec<span class="tocSkip"></span></h1>

# Read in csv file

We will be training a model on the balanced dataset.

In [3]:
import pandas as pd
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [4]:
# balanced
df_balanced = pd.read_csv("corpus_sprint3_balanced_cleaned_all.csv", encoding="utf-8")

In [6]:
print("nulls in df_selected:", df_balanced["Consumer complaint narrative"].isnull().sum())

nulls in df_selected: 0


We can run <code>conda install -c conda-forge gensim</code> in our anaconda terminal to install Gensim.

We will now use our CSV files as input for our Doc2Vec algorithm. [Gensim](https://radimrehurek.com/gensim/models/doc2vec.html)'s implementation of Doc2Vec expects our consumer complaint narratives to be of class TaggedDocuments. We can use the following code to achieve this, where <code>df</code> represents our dataframe:

# Tokenization

In [7]:
import nltk
from nltk.corpus import stopwords

def tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) < 2:
                continue
            tokens.append(word.lower())
    return tokens

The input for a Doc2Vec model should be a list of TaggedDocument(['list','of','words'], [TAG]). We could give every document a unique ID (such as a sequential serial number) as a document tag, or a shared string tag representing something else about it, or both at the same time. We have chosen to pass the Product as a tag.

In [10]:
tagged_balanced = [TaggedDocument(words=tokenize_text(complaint), tags=[i]) for i, complaint in enumerate(df_balanced["Consumer complaint narrative"])]

# Distributed Bag of Words (DBOW)

[Lau and Baldwin](https://arxiv.org/pdf/1607.05368.pdf) provide recommendations on hyper-parameter settings for general-purpose applications.

The hyperparameters that were tuned by Lau and Baldwin are:
* vector size: the dimension of word vectors;
* window size: left/right context window size;
* min count: minimum fequency threshold for word types;
* sub-sampling: threshold to downsample high frequency words;
* negative sample: number of negative word samples;
* epoch: number of training epochs. More iterations take more time and eventually reach a point of diminishing returns.

They did not tune the initial and minimum learning rates (α and α<sub>min</sub>, respectively), and used the the following values for all experiments: α = .025 and α<sub>min</sub> = .0001. The learning rate decreases linearly per epoch from the initial rate to the minimum rate ([source](https://arxiv.org/pdf/1607.05368.pdf)).

Creating the model and setting its parameters:
* We set <code>dm = 0</code>, which means we are using DBOW. The feature vectors are obtained by training a neural network on the task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph ([source](https://arxiv.org/pdf/1607.05368.pdf)).
* We set <code>window = 2</code>, which is the maximum distance between the current and predicted word within a sentence.
* We set <code>min_count = 2</code>, which means we ignore all words with a total frequency lower than this number.
* We set <code>vector_size = 50</code>, which means we choose the dimensionality of the generated feature vectors.
* We set <code>alpha = 0.025</code> as the initial learning rate, which will linearly drop to <code>min_alpha = 0.0001</code> as training progresses.
* We train it for 20 epochs, <code>epochs = 20</code>.
* <code>workers = 4</code> allows us to use 4 worker threads.

We can use multiprocessing to speed up learning.

In [11]:
import multiprocessing
cores = multiprocessing.cpu_count()

Progress bar:

In [12]:
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")

  from pandas import Panel


## Create the model, build the vocabulary and train the model

### Balanced dataset

In [13]:
# create the model
model_balanced = Doc2Vec(dm=0, vector_size=50, window=2, min_count=2, alpha= 0.025, min_alpha=0.0001, epochs=20, workers=cores)

# build the vocabulary
model_balanced.build_vocab([x for x in tqdm(tagged_balanced)])

100%|██████████| 126593/126593 [00:00<00:00, 2946344.19it/s]


**Training**

We have chosen to train for 20 epochs, because a value of 10-20 is common in published work ([source](https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html)) when dealing with tens-of-thousands to millions of documents. 

In [14]:
model_balanced.train(tagged_balanced, total_examples=model_balanced.corpus_count, epochs=model_balanced.epochs)

**Save the model**

We can use <code>save()</code> to save our model to a binary file.

In [15]:
model_balanced.save("model_doc2vec_balanced_20epochs")

This model can later be loaded using <code>load()</code>.

## Usage of this model 

The feature vectors in this Doc2Vec Model can now be used as inputs in conjunction with user-selected columns, such as Products, Issues, Sub-products and Sub-issues. This creates the input for the unsupervised algorithms.