<h1 style='color:#00868b; text-align:center;'>Doc2Vec<span class="tocSkip"></span></h1>

# Read in csv file

We will be training a model on both the imbalanced and balanced dataset.

In [1]:
import pandas as pd
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [2]:
# imbalanced
df_imbalanced = pd.read_csv("corpus_sprint2_imbalanced_cp.csv", encoding="utf-8")
# balanced
df_balanced = pd.read_csv("corpus_sprint2_balanced_cp.csv", encoding="utf-8")

We can run <code>conda install -c conda-forge gensim</code> in our anaconda terminal to install Gensim.

We will now use our CSV files as input for our Doc2Vec algorithm. [Gensim](https://radimrehurek.com/gensim/models/doc2vec.html)'s implementation of Doc2Vec expects our consumer complaint narratives to be of class TaggedDocuments. We can use the following code to achieve this, where <code>df</code> represents our dataframe:

In [3]:
df_imbalanced["Consumer complaint narrative"][0]

'i have complained many times that the credit reporting by experian is inaccurate and they always just say wait awhile it will be fixed later and yet it never is. they are incapable of providing accurate information and do not take responsibility for their errors. this is a fake service designed to serve only their needs and not the consumer s needs. i can not find any way to cancel and when i mention it on the phone i am immediately disconnected. i want to be a part of the class action lawsuit against them since they are responsible for reducing my credit rating releasing my personal information and my fraud complaints have multiplied over the last 2-3 years. i have spoken to many ineffectual and uncaring agents on their phone support lines and am just tired of paying them to degrade my credit for random incorrect reasons like i paid a utility bill with another bank account and not raising it for legitimate things like paying off thousands of dollars of debt getting 3 new credit cards

There appears to be 1 narrative that is null. Drop it.

In [4]:
df_imbalanced = df_imbalanced.dropna()
df_balanced = df_balanced.dropna()

# Tokenization

In [5]:
import nltk
from nltk.corpus import stopwords

def tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) < 2:
                continue
            tokens.append(word.lower())
    return tokens

The input for a Doc2Vec model should be a list of TaggedDocument(['list','of','words'], [TAG]). We could give every document a unique ID (such as a sequential serial number) as a document tag, or a shared string tag representing something else about it, or both at the same time. We have chosen to pass the Product as a tag.

In [6]:
tagged_imbalanced = [TaggedDocument(complaint, [i]) for i, complaint in enumerate(df_imbalanced["Consumer complaint narrative"])]


In [None]:
tagged_balanced = [TaggedDocument(tokenize_text(complaint), [i]) for i, complaint in enumerate(df_balanced["Consumer complaint narrative"])]

# Distributed Bag of Words (DBOW)

[Lau and Baldwin](https://arxiv.org/pdf/1607.05368.pdf) provide recommendations on hyper-parameter settings for general-purpose applications. They also conclude that DBOW is a better model than DMPV.

The hyperparameters that were tuned by Lau and Baldwin are:
* vector size: the dimension of word vectors;
* window size: left/right context window size;
* min count: minimum fequency threshold for word types;
* sub-sampling: threshold to downsample high frequency words;
* negative sample: number of negative word samples;
* epoch: number of training epochs. More iterations take more time and eventually reach a point of diminishing returns.

They did not tune the initial and minimum learning rates (α and α<sub>min</sub>, respectively), and uses the the following values for all experiments: α = .025 and α<sub>min</sub> = .0001. The learning rate decreases linearly per epoch from the initial rate to the minimum rate ([source](https://arxiv.org/pdf/1607.05368.pdf)).

Creating the model and setting its parameters:
* We set <code>dm = 0</code>, which means we are using DBOW. The feature vectors are obtained by training a neural network on the task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph ([source](https://arxiv.org/pdf/1607.05368.pdf)).
* We set <code>window = 2</code>, which is the maximum distance between the current and predicted word within a sentence.
* We set <code>min_count = 2</code>, which means we ignore all words with a total frequency lower than this number.
* We set <code>vector_size = 50</code>, which means we choose the dimensionality of the generated feature vectors.
* We set <code>alpha = 0.025</code> as the initial learning rate, which will linearly drop to <code>min_alpha = 0.0001</code> as training progresses.
* We train it for 20 epochs, <code>epochs = 20</code>.
* <code>workers = 4</code> allows us to use 4 worker threads.

We can use multiprocessing to speed up learning.

In [None]:
import multiprocessing
cores = multiprocessing.cpu_count()

Progress bar:

In [None]:
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")

## Create the model, build the vocabulary and train the model

### Imbalanced dataset

In [None]:
# create the model
model_imbalanced = Doc2Vec(dm=0, vector_size=50, window=2, min_count=2, alpha= 0.025, min_alpha=0.0001, epochs=20, workers=cores)

# build the vocabulary
model_imbalanced.build_vocab([x for x in tqdm(tagged_imbalanced)])

**Training**

We have chosen to train for 20 epochs, because a value of 10-20 is common in published work ([source](https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html)) when dealing with tens-of-thousands to millions of documents. 

In [None]:
model_imbalanced.train(tagged_imbalanced, total_examples=model_imbalanced.corpus_count, epochs=model_imbalanced.epochs)

**Save the model**

We can use <code>save()</code> to save our model to a binary file.

In [None]:
model_imbalanced.save("model_doc2vec_imbalanced_20epochs")

This model can later be loaded using <code>load()</code>.

### Balanced dataset

In [None]:
# create the model
model_balanced = Doc2Vec(dm=0, vector_size=50, window=2, min_count=2, alpha= 0.025, min_alpha=0.0001, epochs=20, workers=cores)

# build the vocabulary
model_balanced.build_vocab([x for x in tqdm(tagged_balanced)])

**Training**

We have chosen to train for 20 epochs, because a value of 10-20 is common in published work ([source](https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html)) when dealing with tens-of-thousands to millions of documents. 

In [None]:
model_balanced.train(tagged_balanced, total_examples=model_balanced.corpus_count, epochs=model_balanced.epochs)

**Save the model**

We can use <code>save()</code> to save our model to a binary file.

In [None]:
model_balanced.save("model_doc2vec_balanced_20epochs")

This model can later be loaded using <code>load()</code>.

## Usage of this model 

The feature vectors in this Doc2Vec Model can now be used as inputs in conjunction with user-selected columns, such as Products, Issues, Sub-products and Sub-issues. This creates the input for the unsupervised algorithms.