<h1 style='color:#00868b; text-align:center;'>Doc2Vec<span class="tocSkip"></span></h1>

# Read in csv file

We will be training a model on both the imbalanced and balanced dataset.

In [55]:
import pandas as pd
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [56]:
# imbalanced
df_imbalanced = pd.read_csv("corpus_sprint2_imbalanced_cp.csv", encoding="utf-8")
# balanced
df_balanced = pd.read_csv("corpus_sprint2_balanced_cp.csv", encoding="utf-8")

We can run <code>conda install -c conda-forge gensim</code> in our anaconda terminal to install Gensim.

We will now use our CSV files as input for our Doc2Vec algorithm. [Gensim](https://radimrehurek.com/gensim/models/doc2vec.html)'s implementation of Doc2Vec expects our consumer complaint narratives to be of class TaggedDocuments. We can use the following code to achieve this, where <code>df</code> represents our dataframe:

In [57]:
df_imbalanced["Consumer complaint narrative"][0]

'i have complained many times that the credit reporting by experian is inaccurate and they always just say wait awhile it will be fixed later and yet it never is. they are incapable of providing accurate information and do not take responsibility for their errors. this is a fake service designed to serve only their needs and not the consumer s needs. i can not find any way to cancel and when i mention it on the phone i am immediately disconnected. i want to be a part of the class action lawsuit against them since they are responsible for reducing my credit rating releasing my personal information and my fraud complaints have multiplied over the last 2-3 years. i have spoken to many ineffectual and uncaring agents on their phone support lines and am just tired of paying them to degrade my credit for random incorrect reasons like i paid a utility bill with another bank account and not raising it for legitimate things like paying off thousands of dollars of debt getting 3 new credit cards

There appears to be 1 narrative that is null. Drop it.

In [58]:
df_imbalanced = df_imbalanced.dropna()
df_balanced = df_balanced.dropna()

# Tokenization

In [59]:
import nltk
from nltk.corpus import stopwords

def tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) < 2:
                continue
            tokens.append(word.lower())
    return tokens

The input for a Doc2Vec model should be a list of TaggedDocument(['list','of','words'], [TAG]). We could give every document a unique ID (such as a sequential serial number) as a document tag, or a shared string tag representing something else about it, or both at the same time. We have chosen to pass the Product as a tag.

In [60]:
tagged_imbalanced = df.apply(
    lambda r: TaggedDocument(words=tokenize_text(r["Consumer complaint narrative"]), tags=[r.Product]), axis=1)


In [61]:
tagged_balanced = df.apply(
    lambda r: TaggedDocument(words=tokenize_text(r["Consumer complaint narrative"]), tags=[r.Product]), axis=1)

# Distributed Bag of Words (DBOW)

[Lau and Baldwin](https://arxiv.org/pdf/1607.05368.pdf) provide recommendations on hyper-parameter settings for general-purpose applications. They also conclude that DBOW is a better model than DMPV.

The hyperparameters that were tuned by Lau and Baldwin are:
* vector size: the dimension of word vectors;
* window size: left/right context window size;
* min count: minimum fequency threshold for word types;
* sub-sampling: threshold to downsample high frequency words;
* negative sample: number of negative word samples;
* epoch: number of training epochs. More iterations take more time and eventually reach a point of diminishing returns.

They did not tune the initial and minimum learning rates (α and α<sub>min</sub>, respectively), and uses the the following values for all experiments: α = .025 and α<sub>min</sub> = .0001. The learning rate decreases linearly per epoch from the initial rate to the minimum rate ([source](https://arxiv.org/pdf/1607.05368.pdf)).

Creating the model and setting its parameters:
* We set <code>dm = 0</code>, which means we are using DBOW. The feature vectors are obtained by training a neural network on the task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph ([source](https://arxiv.org/pdf/1607.05368.pdf)).
* We set <code>window = 2</code>, which is the maximum distance between the current and predicted word within a sentence.
* We set <code>min_count = 2</code>, which means we ignore all words with a total frequency lower than this number.
* We set <code>vector_size = 50</code>, which means we choose the dimensionality of the generated feature vectors.
* We set <code>alpha = 0.025</code> as the initial learning rate, which will linearly drop to <code>min_alpha = 0.0001</code> as training progresses.
* We train it for 20 epochs, <code>epochs = 20</code>.
* <code>workers = 4</code> allows us to use 4 worker threads.

We can use multiprocessing to speed up learning.

In [62]:
import multiprocessing
cores = multiprocessing.cpu_count()

Progress bar:

In [67]:
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")

  from pandas import Panel


## Create the model, build the vocabulary and train the model

### Imbalanced dataset

In [68]:
# create the model
model_imbalanced = Doc2Vec(dm=0, vector_size=50, window=2, min_count=2, alpha= 0.025, min_alpha=0.0001, epochs=20, workers=cores)

# build the vocabulary
model_imbalanced.build_vocab([x for x in tqdm(tagged_imbalanced)])

100%|██████████| 485700/485700 [00:00<00:00, 2585724.58it/s]

collecting all words and their counts
PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags





PROGRESS: at example #10000, processed 1686727 words (5492981/s), 23089 word types, 13 tags
PROGRESS: at example #20000, processed 3347511 words (5191453/s), 32853 word types, 14 tags
PROGRESS: at example #30000, processed 5074107 words (5396762/s), 40885 word types, 15 tags
PROGRESS: at example #40000, processed 6776756 words (5592520/s), 47901 word types, 15 tags
PROGRESS: at example #50000, processed 8388078 words (5125932/s), 53878 word types, 15 tags
PROGRESS: at example #60000, processed 10021353 words (4634181/s), 59470 word types, 16 tags
PROGRESS: at example #70000, processed 11636600 words (5069954/s), 64668 word types, 16 tags
PROGRESS: at example #80000, processed 13358393 words (5041528/s), 70242 word types, 16 tags
PROGRESS: at example #90000, processed 15167662 words (4665715/s), 75921 word types, 17 tags
PROGRESS: at example #100000, processed 16981389 words (4772450/s), 81543 word types, 17 tags
PROGRESS: at example #110000, processed 18819956 words (5591383/s), 87511 

**Training**

We have chosen to train for 20 epochs, because a value of 10-20 is common in published work ([source](https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html)) when dealing with tens-of-thousands to millions of documents. 

In [71]:
model_imbalanced.train(tagged_imbalanced, total_examples=model_imbalanced.corpus_count, epochs=model_imbalanced.epochs)

training model with 8 workers on 99613 vocabulary and 50 features, using sg=1 hs=0 sample=0.001 negative=5 window=2
EPOCH 1 - PROGRESS: at 1.90% examples, 1148070 words/s, in_qsize 15, out_qsize 1
EPOCH 1 - PROGRESS: at 3.99% examples, 1201078 words/s, in_qsize 15, out_qsize 0
EPOCH 1 - PROGRESS: at 6.19% examples, 1250797 words/s, in_qsize 15, out_qsize 0
EPOCH 1 - PROGRESS: at 8.40% examples, 1272008 words/s, in_qsize 15, out_qsize 0
EPOCH 1 - PROGRESS: at 10.56% examples, 1262608 words/s, in_qsize 16, out_qsize 0
EPOCH 1 - PROGRESS: at 12.79% examples, 1269948 words/s, in_qsize 15, out_qsize 0
EPOCH 1 - PROGRESS: at 15.04% examples, 1277982 words/s, in_qsize 15, out_qsize 0
EPOCH 1 - PROGRESS: at 17.14% examples, 1284837 words/s, in_qsize 15, out_qsize 0
EPOCH 1 - PROGRESS: at 19.15% examples, 1285931 words/s, in_qsize 15, out_qsize 0
EPOCH 1 - PROGRESS: at 21.23% examples, 1291629 words/s, in_qsize 15, out_qsize 0
EPOCH 1 - PROGRESS: at 23.29% examples, 1295081 words/s, in_qsize 15

EPOCH 2 - PROGRESS: at 92.81% examples, 1235401 words/s, in_qsize 15, out_qsize 0
EPOCH 2 - PROGRESS: at 94.97% examples, 1237509 words/s, in_qsize 15, out_qsize 0
EPOCH 2 - PROGRESS: at 97.13% examples, 1239551 words/s, in_qsize 15, out_qsize 0
EPOCH 2 - PROGRESS: at 99.33% examples, 1241935 words/s, in_qsize 15, out_qsize 0
worker thread finished; awaiting finish of 7 more threads
worker thread finished; awaiting finish of 6 more threads
worker thread finished; awaiting finish of 5 more threads
worker thread finished; awaiting finish of 4 more threads
worker thread finished; awaiting finish of 3 more threads
worker thread finished; awaiting finish of 2 more threads
worker thread finished; awaiting finish of 1 more threads
worker thread finished; awaiting finish of 0 more threads
EPOCH - 2 : training on 82938993 raw words (61537809 effective words) took 49.5s, 1242862 effective words/s
EPOCH 3 - PROGRESS: at 2.16% examples, 1301027 words/s, in_qsize 15, out_qsize 0
EPOCH 3 - PROGRESS:

EPOCH 4 - PROGRESS: at 78.79% examples, 1269343 words/s, in_qsize 16, out_qsize 0
EPOCH 4 - PROGRESS: at 80.91% examples, 1270131 words/s, in_qsize 15, out_qsize 0
EPOCH 4 - PROGRESS: at 83.00% examples, 1270415 words/s, in_qsize 15, out_qsize 0
EPOCH 4 - PROGRESS: at 85.11% examples, 1270418 words/s, in_qsize 15, out_qsize 0
EPOCH 4 - PROGRESS: at 86.94% examples, 1266807 words/s, in_qsize 14, out_qsize 1
EPOCH 4 - PROGRESS: at 88.81% examples, 1263568 words/s, in_qsize 15, out_qsize 0
EPOCH 4 - PROGRESS: at 90.59% examples, 1259420 words/s, in_qsize 15, out_qsize 0
EPOCH 4 - PROGRESS: at 92.41% examples, 1256795 words/s, in_qsize 15, out_qsize 0
EPOCH 4 - PROGRESS: at 94.40% examples, 1256578 words/s, in_qsize 15, out_qsize 0
EPOCH 4 - PROGRESS: at 96.56% examples, 1257858 words/s, in_qsize 15, out_qsize 0
EPOCH 4 - PROGRESS: at 98.74% examples, 1260052 words/s, in_qsize 15, out_qsize 0
worker thread finished; awaiting finish of 7 more threads
worker thread finished; awaiting finish 

EPOCH 6 - PROGRESS: at 62.95% examples, 1282034 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS: at 65.12% examples, 1283554 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS: at 67.20% examples, 1281786 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS: at 68.80% examples, 1273527 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS: at 70.83% examples, 1273686 words/s, in_qsize 14, out_qsize 1
EPOCH 6 - PROGRESS: at 72.76% examples, 1269486 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS: at 74.83% examples, 1270942 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS: at 76.77% examples, 1267970 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS: at 78.93% examples, 1269714 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS: at 80.96% examples, 1269429 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS: at 82.95% examples, 1267855 words/s, in_qsize 16, out_qsize 0
EPOCH 6 - PROGRESS: at 85.06% examples, 1267823 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRE

EPOCH 8 - PROGRESS: at 46.76% examples, 1308931 words/s, in_qsize 15, out_qsize 0
EPOCH 8 - PROGRESS: at 48.97% examples, 1310532 words/s, in_qsize 15, out_qsize 1
EPOCH 8 - PROGRESS: at 51.18% examples, 1312973 words/s, in_qsize 16, out_qsize 0
EPOCH 8 - PROGRESS: at 53.32% examples, 1312394 words/s, in_qsize 15, out_qsize 0
EPOCH 8 - PROGRESS: at 55.50% examples, 1312230 words/s, in_qsize 14, out_qsize 1
EPOCH 8 - PROGRESS: at 57.70% examples, 1311701 words/s, in_qsize 15, out_qsize 0
EPOCH 8 - PROGRESS: at 59.93% examples, 1312322 words/s, in_qsize 15, out_qsize 0
EPOCH 8 - PROGRESS: at 62.10% examples, 1311835 words/s, in_qsize 15, out_qsize 0
EPOCH 8 - PROGRESS: at 64.12% examples, 1309925 words/s, in_qsize 15, out_qsize 0
EPOCH 8 - PROGRESS: at 66.35% examples, 1310500 words/s, in_qsize 15, out_qsize 0
EPOCH 8 - PROGRESS: at 68.59% examples, 1312359 words/s, in_qsize 13, out_qsize 2
EPOCH 8 - PROGRESS: at 70.74% examples, 1314280 words/s, in_qsize 15, out_qsize 0
EPOCH 8 - PROGRE

EPOCH 10 - PROGRESS: at 25.51% examples, 1311969 words/s, in_qsize 15, out_qsize 0
EPOCH 10 - PROGRESS: at 27.68% examples, 1318654 words/s, in_qsize 16, out_qsize 0
EPOCH 10 - PROGRESS: at 29.40% examples, 1298678 words/s, in_qsize 15, out_qsize 0
EPOCH 10 - PROGRESS: at 31.33% examples, 1290605 words/s, in_qsize 15, out_qsize 0
EPOCH 10 - PROGRESS: at 33.29% examples, 1281899 words/s, in_qsize 16, out_qsize 0
EPOCH 10 - PROGRESS: at 35.33% examples, 1277558 words/s, in_qsize 15, out_qsize 0
EPOCH 10 - PROGRESS: at 37.48% examples, 1279052 words/s, in_qsize 15, out_qsize 0
EPOCH 10 - PROGRESS: at 39.70% examples, 1284630 words/s, in_qsize 14, out_qsize 1
EPOCH 10 - PROGRESS: at 41.76% examples, 1283258 words/s, in_qsize 15, out_qsize 0
EPOCH 10 - PROGRESS: at 43.94% examples, 1288600 words/s, in_qsize 15, out_qsize 0
EPOCH 10 - PROGRESS: at 46.15% examples, 1292076 words/s, in_qsize 15, out_qsize 0
EPOCH 10 - PROGRESS: at 48.27% examples, 1291375 words/s, in_qsize 15, out_qsize 0
EPOC

EPOCH 12 - PROGRESS: at 8.08% examples, 1225134 words/s, in_qsize 15, out_qsize 0
EPOCH 12 - PROGRESS: at 10.12% examples, 1217199 words/s, in_qsize 15, out_qsize 0
EPOCH 12 - PROGRESS: at 12.36% examples, 1232619 words/s, in_qsize 15, out_qsize 0
EPOCH 12 - PROGRESS: at 14.66% examples, 1247433 words/s, in_qsize 15, out_qsize 0
EPOCH 12 - PROGRESS: at 16.91% examples, 1268372 words/s, in_qsize 15, out_qsize 0
EPOCH 12 - PROGRESS: at 19.07% examples, 1283635 words/s, in_qsize 15, out_qsize 0
EPOCH 12 - PROGRESS: at 21.27% examples, 1298283 words/s, in_qsize 15, out_qsize 0
EPOCH 12 - PROGRESS: at 23.46% examples, 1308176 words/s, in_qsize 15, out_qsize 0
EPOCH 12 - PROGRESS: at 25.61% examples, 1315394 words/s, in_qsize 15, out_qsize 0
EPOCH 12 - PROGRESS: at 27.75% examples, 1320317 words/s, in_qsize 15, out_qsize 0
EPOCH 12 - PROGRESS: at 29.98% examples, 1322894 words/s, in_qsize 15, out_qsize 0
EPOCH 12 - PROGRESS: at 32.19% examples, 1321606 words/s, in_qsize 15, out_qsize 0
EPOCH

worker thread finished; awaiting finish of 7 more threads
worker thread finished; awaiting finish of 6 more threads
worker thread finished; awaiting finish of 5 more threads
worker thread finished; awaiting finish of 4 more threads
worker thread finished; awaiting finish of 3 more threads
worker thread finished; awaiting finish of 2 more threads
worker thread finished; awaiting finish of 1 more threads
worker thread finished; awaiting finish of 0 more threads
EPOCH - 13 : training on 82938993 raw words (61534813 effective words) took 49.0s, 1256097 effective words/s
EPOCH 14 - PROGRESS: at 2.14% examples, 1284649 words/s, in_qsize 16, out_qsize 1
EPOCH 14 - PROGRESS: at 4.30% examples, 1286851 words/s, in_qsize 15, out_qsize 0
EPOCH 14 - PROGRESS: at 6.44% examples, 1298776 words/s, in_qsize 15, out_qsize 0
EPOCH 14 - PROGRESS: at 8.37% examples, 1269777 words/s, in_qsize 14, out_qsize 1
EPOCH 14 - PROGRESS: at 10.45% examples, 1254503 words/s, in_qsize 16, out_qsize 0
EPOCH 14 - PROGR

EPOCH 15 - PROGRESS: at 81.78% examples, 1283032 words/s, in_qsize 15, out_qsize 0
EPOCH 15 - PROGRESS: at 83.96% examples, 1283761 words/s, in_qsize 15, out_qsize 0
EPOCH 15 - PROGRESS: at 86.00% examples, 1282450 words/s, in_qsize 15, out_qsize 0
EPOCH 15 - PROGRESS: at 88.23% examples, 1284222 words/s, in_qsize 15, out_qsize 0
EPOCH 15 - PROGRESS: at 90.48% examples, 1286338 words/s, in_qsize 14, out_qsize 1
EPOCH 15 - PROGRESS: at 92.69% examples, 1288608 words/s, in_qsize 16, out_qsize 0
EPOCH 15 - PROGRESS: at 94.95% examples, 1290873 words/s, in_qsize 15, out_qsize 0
EPOCH 15 - PROGRESS: at 97.17% examples, 1292664 words/s, in_qsize 15, out_qsize 0
EPOCH 15 - PROGRESS: at 99.30% examples, 1293302 words/s, in_qsize 16, out_qsize 0
worker thread finished; awaiting finish of 7 more threads
worker thread finished; awaiting finish of 6 more threads
worker thread finished; awaiting finish of 5 more threads
worker thread finished; awaiting finish of 4 more threads
worker thread finishe

EPOCH 17 - PROGRESS: at 62.62% examples, 1276403 words/s, in_qsize 15, out_qsize 0
EPOCH 17 - PROGRESS: at 64.45% examples, 1271914 words/s, in_qsize 16, out_qsize 1
EPOCH 17 - PROGRESS: at 66.13% examples, 1263335 words/s, in_qsize 15, out_qsize 0
EPOCH 17 - PROGRESS: at 68.08% examples, 1260069 words/s, in_qsize 15, out_qsize 0
EPOCH 17 - PROGRESS: at 70.06% examples, 1260506 words/s, in_qsize 15, out_qsize 0
EPOCH 17 - PROGRESS: at 71.95% examples, 1257110 words/s, in_qsize 16, out_qsize 0
EPOCH 17 - PROGRESS: at 74.11% examples, 1258830 words/s, in_qsize 15, out_qsize 0
EPOCH 17 - PROGRESS: at 76.02% examples, 1256838 words/s, in_qsize 15, out_qsize 0
EPOCH 17 - PROGRESS: at 77.78% examples, 1251329 words/s, in_qsize 15, out_qsize 0
EPOCH 17 - PROGRESS: at 79.89% examples, 1253103 words/s, in_qsize 15, out_qsize 0
EPOCH 17 - PROGRESS: at 82.10% examples, 1256479 words/s, in_qsize 15, out_qsize 0
EPOCH 17 - PROGRESS: at 84.42% examples, 1259722 words/s, in_qsize 15, out_qsize 0
EPOC

EPOCH 19 - PROGRESS: at 47.73% examples, 1334107 words/s, in_qsize 16, out_qsize 1
EPOCH 19 - PROGRESS: at 49.78% examples, 1330073 words/s, in_qsize 16, out_qsize 0
EPOCH 19 - PROGRESS: at 51.87% examples, 1329257 words/s, in_qsize 15, out_qsize 0
EPOCH 19 - PROGRESS: at 54.12% examples, 1329658 words/s, in_qsize 15, out_qsize 1
EPOCH 19 - PROGRESS: at 56.42% examples, 1331108 words/s, in_qsize 15, out_qsize 0
EPOCH 19 - PROGRESS: at 58.73% examples, 1332742 words/s, in_qsize 16, out_qsize 0
EPOCH 19 - PROGRESS: at 61.03% examples, 1332684 words/s, in_qsize 15, out_qsize 0
EPOCH 19 - PROGRESS: at 63.22% examples, 1333708 words/s, in_qsize 14, out_qsize 1
EPOCH 19 - PROGRESS: at 65.49% examples, 1335080 words/s, in_qsize 15, out_qsize 0
EPOCH 19 - PROGRESS: at 67.64% examples, 1332946 words/s, in_qsize 16, out_qsize 0
EPOCH 19 - PROGRESS: at 69.55% examples, 1330109 words/s, in_qsize 16, out_qsize 1
EPOCH 19 - PROGRESS: at 71.56% examples, 1326350 words/s, in_qsize 16, out_qsize 1
EPOC

**Save the model**

We can use <code>save()</code> to save our model to a binary file.

In [72]:
model_imbalanced.save("model_doc2vec_imbalanced_20epochs")

saving Doc2Vec object under model_doc2vec_20epochs, separately None
saved model_doc2vec_20epochs


This model can later be loaded using <code>load()</code>.

### Balanced dataset

In [73]:
# create the model
model_balanced = Doc2Vec(dm=0, vector_size=50, window=2, min_count=2, alpha= 0.025, min_alpha=0.0001, epochs=20, workers=cores)

# build the vocabulary
model_balanced.build_vocab([x for x in tqdm(tagged_balanced)])

100%|██████████| 485700/485700 [00:00<00:00, 2585353.77it/s]

collecting all words and their counts
PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags





PROGRESS: at example #10000, processed 1686727 words (5785054/s), 23089 word types, 13 tags
PROGRESS: at example #20000, processed 3347511 words (5739817/s), 32853 word types, 14 tags
PROGRESS: at example #30000, processed 5074107 words (5244739/s), 40885 word types, 15 tags
PROGRESS: at example #40000, processed 6776756 words (5454493/s), 47901 word types, 15 tags
PROGRESS: at example #50000, processed 8388078 words (5427309/s), 53878 word types, 15 tags
PROGRESS: at example #60000, processed 10021353 words (5585354/s), 59470 word types, 16 tags
PROGRESS: at example #70000, processed 11636600 words (5252634/s), 64668 word types, 16 tags
PROGRESS: at example #80000, processed 13358393 words (5097625/s), 70242 word types, 16 tags
PROGRESS: at example #90000, processed 15167662 words (4873896/s), 75921 word types, 17 tags
PROGRESS: at example #100000, processed 16981389 words (5540804/s), 81543 word types, 17 tags
PROGRESS: at example #110000, processed 18819956 words (5438757/s), 87511 

**Training**

We have chosen to train for 20 epochs, because a value of 10-20 is common in published work ([source](https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html)) when dealing with tens-of-thousands to millions of documents. 

In [74]:
model_balanced.train(tagged_imbalanced, total_examples=model_imbalanced.corpus_count, epochs=model_imbalanced.epochs)

training model with 8 workers on 99613 vocabulary and 50 features, using sg=1 hs=0 sample=0.001 negative=5 window=2
EPOCH 1 - PROGRESS: at 2.10% examples, 1277073 words/s, in_qsize 13, out_qsize 2
EPOCH 1 - PROGRESS: at 4.36% examples, 1315870 words/s, in_qsize 15, out_qsize 0
EPOCH 1 - PROGRESS: at 6.55% examples, 1328947 words/s, in_qsize 15, out_qsize 0
EPOCH 1 - PROGRESS: at 8.83% examples, 1341852 words/s, in_qsize 15, out_qsize 0
EPOCH 1 - PROGRESS: at 11.09% examples, 1333002 words/s, in_qsize 15, out_qsize 0
EPOCH 1 - PROGRESS: at 13.31% examples, 1328991 words/s, in_qsize 15, out_qsize 0
EPOCH 1 - PROGRESS: at 15.10% examples, 1291897 words/s, in_qsize 14, out_qsize 1
EPOCH 1 - PROGRESS: at 17.24% examples, 1299177 words/s, in_qsize 15, out_qsize 0
EPOCH 1 - PROGRESS: at 19.44% examples, 1314293 words/s, in_qsize 15, out_qsize 0
EPOCH 1 - PROGRESS: at 21.70% examples, 1329268 words/s, in_qsize 15, out_qsize 0
EPOCH 1 - PROGRESS: at 23.96% examples, 1340971 words/s, in_qsize 14

worker thread finished; awaiting finish of 4 more threads
worker thread finished; awaiting finish of 3 more threads
worker thread finished; awaiting finish of 2 more threads
worker thread finished; awaiting finish of 1 more threads
worker thread finished; awaiting finish of 0 more threads
EPOCH - 2 : training on 82938993 raw words (61536620 effective words) took 46.1s, 1333758 effective words/s
EPOCH 3 - PROGRESS: at 2.13% examples, 1290070 words/s, in_qsize 15, out_qsize 0
EPOCH 3 - PROGRESS: at 4.30% examples, 1291774 words/s, in_qsize 15, out_qsize 0
EPOCH 3 - PROGRESS: at 6.44% examples, 1302058 words/s, in_qsize 15, out_qsize 0
EPOCH 3 - PROGRESS: at 8.62% examples, 1306586 words/s, in_qsize 15, out_qsize 0
EPOCH 3 - PROGRESS: at 10.86% examples, 1301197 words/s, in_qsize 15, out_qsize 0
EPOCH 3 - PROGRESS: at 13.11% examples, 1304722 words/s, in_qsize 15, out_qsize 0
EPOCH 3 - PROGRESS: at 15.38% examples, 1312103 words/s, in_qsize 16, out_qsize 0
EPOCH 3 - PROGRESS: at 17.60% ex

EPOCH 4 - PROGRESS: at 93.57% examples, 1302151 words/s, in_qsize 15, out_qsize 0
EPOCH 4 - PROGRESS: at 95.90% examples, 1304939 words/s, in_qsize 15, out_qsize 0
EPOCH 4 - PROGRESS: at 98.00% examples, 1304678 words/s, in_qsize 15, out_qsize 0
worker thread finished; awaiting finish of 7 more threads
worker thread finished; awaiting finish of 6 more threads
worker thread finished; awaiting finish of 5 more threads
worker thread finished; awaiting finish of 4 more threads
worker thread finished; awaiting finish of 3 more threads
worker thread finished; awaiting finish of 2 more threads
worker thread finished; awaiting finish of 1 more threads
worker thread finished; awaiting finish of 0 more threads
EPOCH - 4 : training on 82938993 raw words (61540176 effective words) took 47.2s, 1304988 effective words/s
EPOCH 5 - PROGRESS: at 2.19% examples, 1331832 words/s, in_qsize 15, out_qsize 0
EPOCH 5 - PROGRESS: at 4.44% examples, 1339273 words/s, in_qsize 16, out_qsize 0
EPOCH 5 - PROGRESS: 

EPOCH 6 - PROGRESS: at 54.52% examples, 982288 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS: at 56.83% examples, 993032 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS: at 59.10% examples, 1003068 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS: at 61.39% examples, 1012823 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS: at 63.35% examples, 1018150 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS: at 65.33% examples, 1022807 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS: at 66.91% examples, 1020624 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS: at 68.63% examples, 1021974 words/s, in_qsize 16, out_qsize 0
EPOCH 6 - PROGRESS: at 70.82% examples, 1030499 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS: at 72.94% examples, 1035589 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS: at 74.77% examples, 1038895 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS: at 76.45% examples, 1038224 words/s, in_qsize 15, out_qsize 0
EPOCH 6 - PROGRESS

EPOCH 8 - PROGRESS: at 35.49% examples, 1281495 words/s, in_qsize 15, out_qsize 0
EPOCH 8 - PROGRESS: at 37.57% examples, 1281488 words/s, in_qsize 15, out_qsize 0
EPOCH 8 - PROGRESS: at 39.79% examples, 1286776 words/s, in_qsize 16, out_qsize 1
EPOCH 8 - PROGRESS: at 41.99% examples, 1289384 words/s, in_qsize 15, out_qsize 0
EPOCH 8 - PROGRESS: at 44.14% examples, 1293472 words/s, in_qsize 15, out_qsize 0
EPOCH 8 - PROGRESS: at 46.18% examples, 1292338 words/s, in_qsize 15, out_qsize 0
EPOCH 8 - PROGRESS: at 48.20% examples, 1289052 words/s, in_qsize 15, out_qsize 0
EPOCH 8 - PROGRESS: at 50.44% examples, 1291445 words/s, in_qsize 15, out_qsize 0
EPOCH 8 - PROGRESS: at 52.66% examples, 1294980 words/s, in_qsize 16, out_qsize 0
EPOCH 8 - PROGRESS: at 54.85% examples, 1296159 words/s, in_qsize 15, out_qsize 0
EPOCH 8 - PROGRESS: at 57.04% examples, 1295253 words/s, in_qsize 15, out_qsize 0
EPOCH 8 - PROGRESS: at 59.13% examples, 1293193 words/s, in_qsize 15, out_qsize 0
EPOCH 8 - PROGRE

EPOCH 10 - PROGRESS: at 16.54% examples, 1237430 words/s, in_qsize 15, out_qsize 0
EPOCH 10 - PROGRESS: at 18.60% examples, 1249526 words/s, in_qsize 14, out_qsize 1
EPOCH 10 - PROGRESS: at 20.27% examples, 1234265 words/s, in_qsize 15, out_qsize 0
EPOCH 10 - PROGRESS: at 22.16% examples, 1234856 words/s, in_qsize 15, out_qsize 0
EPOCH 10 - PROGRESS: at 23.78% examples, 1216001 words/s, in_qsize 15, out_qsize 0
EPOCH 10 - PROGRESS: at 25.56% examples, 1211910 words/s, in_qsize 15, out_qsize 0
EPOCH 10 - PROGRESS: at 27.55% examples, 1216263 words/s, in_qsize 15, out_qsize 0
EPOCH 10 - PROGRESS: at 29.48% examples, 1214035 words/s, in_qsize 15, out_qsize 0
EPOCH 10 - PROGRESS: at 31.38% examples, 1210406 words/s, in_qsize 15, out_qsize 0
EPOCH 10 - PROGRESS: at 33.43% examples, 1209751 words/s, in_qsize 15, out_qsize 0
EPOCH 10 - PROGRESS: at 35.15% examples, 1198203 words/s, in_qsize 14, out_qsize 1
EPOCH 10 - PROGRESS: at 37.29% examples, 1203535 words/s, in_qsize 15, out_qsize 0
EPOC

worker thread finished; awaiting finish of 2 more threads
worker thread finished; awaiting finish of 1 more threads
worker thread finished; awaiting finish of 0 more threads
EPOCH - 11 : training on 82938993 raw words (61536369 effective words) took 48.3s, 1273347 effective words/s
EPOCH 12 - PROGRESS: at 2.14% examples, 1283072 words/s, in_qsize 15, out_qsize 0
EPOCH 12 - PROGRESS: at 4.41% examples, 1325329 words/s, in_qsize 15, out_qsize 0
EPOCH 12 - PROGRESS: at 6.70% examples, 1351508 words/s, in_qsize 14, out_qsize 1
EPOCH 12 - PROGRESS: at 8.91% examples, 1348922 words/s, in_qsize 15, out_qsize 0
EPOCH 12 - PROGRESS: at 10.91% examples, 1308545 words/s, in_qsize 15, out_qsize 1
EPOCH 12 - PROGRESS: at 13.11% examples, 1305279 words/s, in_qsize 15, out_qsize 0
EPOCH 12 - PROGRESS: at 15.26% examples, 1301803 words/s, in_qsize 15, out_qsize 0
EPOCH 12 - PROGRESS: at 17.23% examples, 1294004 words/s, in_qsize 15, out_qsize 0
EPOCH 12 - PROGRESS: at 19.28% examples, 1298363 words/s,

EPOCH 13 - PROGRESS: at 88.10% examples, 1253952 words/s, in_qsize 15, out_qsize 0
EPOCH 13 - PROGRESS: at 90.30% examples, 1255857 words/s, in_qsize 15, out_qsize 0
EPOCH 13 - PROGRESS: at 92.46% examples, 1258053 words/s, in_qsize 15, out_qsize 0
EPOCH 13 - PROGRESS: at 94.63% examples, 1260178 words/s, in_qsize 15, out_qsize 0
EPOCH 13 - PROGRESS: at 96.84% examples, 1262202 words/s, in_qsize 15, out_qsize 0
EPOCH 13 - PROGRESS: at 99.03% examples, 1264298 words/s, in_qsize 16, out_qsize 0
worker thread finished; awaiting finish of 7 more threads
worker thread finished; awaiting finish of 6 more threads
worker thread finished; awaiting finish of 5 more threads
worker thread finished; awaiting finish of 4 more threads
worker thread finished; awaiting finish of 3 more threads
worker thread finished; awaiting finish of 2 more threads
worker thread finished; awaiting finish of 1 more threads
worker thread finished; awaiting finish of 0 more threads
EPOCH - 13 : training on 82938993 raw 

EPOCH 15 - PROGRESS: at 74.03% examples, 1330117 words/s, in_qsize 15, out_qsize 0
EPOCH 15 - PROGRESS: at 76.15% examples, 1330311 words/s, in_qsize 15, out_qsize 0
EPOCH 15 - PROGRESS: at 78.29% examples, 1328351 words/s, in_qsize 15, out_qsize 0
EPOCH 15 - PROGRESS: at 80.42% examples, 1327748 words/s, in_qsize 15, out_qsize 0
EPOCH 15 - PROGRESS: at 82.34% examples, 1324910 words/s, in_qsize 15, out_qsize 0
EPOCH 15 - PROGRESS: at 84.54% examples, 1324496 words/s, in_qsize 15, out_qsize 0
EPOCH 15 - PROGRESS: at 86.71% examples, 1324599 words/s, in_qsize 14, out_qsize 1
EPOCH 15 - PROGRESS: at 89.02% examples, 1326807 words/s, in_qsize 15, out_qsize 0
EPOCH 15 - PROGRESS: at 91.34% examples, 1328361 words/s, in_qsize 15, out_qsize 0
EPOCH 15 - PROGRESS: at 93.44% examples, 1328844 words/s, in_qsize 15, out_qsize 0
EPOCH 15 - PROGRESS: at 95.59% examples, 1328546 words/s, in_qsize 16, out_qsize 0
EPOCH 15 - PROGRESS: at 97.71% examples, 1328184 words/s, in_qsize 15, out_qsize 0
work

EPOCH 17 - PROGRESS: at 61.31% examples, 1292253 words/s, in_qsize 15, out_qsize 0
EPOCH 17 - PROGRESS: at 63.47% examples, 1293909 words/s, in_qsize 15, out_qsize 0
EPOCH 17 - PROGRESS: at 65.71% examples, 1296051 words/s, in_qsize 15, out_qsize 0
EPOCH 17 - PROGRESS: at 67.93% examples, 1296084 words/s, in_qsize 14, out_qsize 1
EPOCH 17 - PROGRESS: at 70.20% examples, 1301576 words/s, in_qsize 15, out_qsize 0
EPOCH 17 - PROGRESS: at 72.52% examples, 1304262 words/s, in_qsize 15, out_qsize 0
EPOCH 17 - PROGRESS: at 74.65% examples, 1305268 words/s, in_qsize 14, out_qsize 1
EPOCH 17 - PROGRESS: at 76.86% examples, 1305611 words/s, in_qsize 15, out_qsize 0
EPOCH 17 - PROGRESS: at 79.00% examples, 1306064 words/s, in_qsize 15, out_qsize 0
EPOCH 17 - PROGRESS: at 81.22% examples, 1307832 words/s, in_qsize 15, out_qsize 0
EPOCH 17 - PROGRESS: at 83.45% examples, 1309065 words/s, in_qsize 16, out_qsize 0
EPOCH 17 - PROGRESS: at 85.70% examples, 1310525 words/s, in_qsize 15, out_qsize 0
EPOC

EPOCH 19 - PROGRESS: at 50.53% examples, 1350217 words/s, in_qsize 16, out_qsize 0
EPOCH 19 - PROGRESS: at 52.69% examples, 1349677 words/s, in_qsize 16, out_qsize 0
EPOCH 19 - PROGRESS: at 54.85% examples, 1348443 words/s, in_qsize 15, out_qsize 0
EPOCH 19 - PROGRESS: at 56.93% examples, 1343510 words/s, in_qsize 15, out_qsize 0
EPOCH 19 - PROGRESS: at 59.07% examples, 1340784 words/s, in_qsize 15, out_qsize 0
EPOCH 19 - PROGRESS: at 61.22% examples, 1338512 words/s, in_qsize 15, out_qsize 0
EPOCH 19 - PROGRESS: at 63.25% examples, 1336005 words/s, in_qsize 15, out_qsize 0
EPOCH 19 - PROGRESS: at 65.38% examples, 1334213 words/s, in_qsize 15, out_qsize 0
EPOCH 19 - PROGRESS: at 67.40% examples, 1329867 words/s, in_qsize 15, out_qsize 0
EPOCH 19 - PROGRESS: at 69.55% examples, 1331330 words/s, in_qsize 14, out_qsize 1
EPOCH 19 - PROGRESS: at 71.49% examples, 1326622 words/s, in_qsize 15, out_qsize 0
EPOCH 19 - PROGRESS: at 73.75% examples, 1327241 words/s, in_qsize 15, out_qsize 0
EPOC

**Save the model**

We can use <code>save()</code> to save our model to a binary file.

In [75]:
model_balanced.save("model_doc2vec_balanced_20epochs")

saving Doc2Vec object under model_doc2vec_balanced_20epochs, separately None
saved model_doc2vec_balanced_20epochs


This model can later be loaded using <code>load()</code>.

## Usage of this model 

The feature vectors in this Doc2Vec Model can now be used as inputs in conjunction with user-selected columns, such as Products, Issues, Sub-products and Sub-issues. This creates the input for the unsupervised algorithms.