# Doc2Vec Model
In this notebook I am using Gensim API to learn paragraph and document embeddings via the distributed memory and distributed bag of words models from Quoc Le and Tomas Mikolov: “Distributed Representations of Sentences and Documents”.



## 1. Introduction
Word2vec, created by a team of researchers at Google led by Tomáš Mikolov, implements a word embedding model that enables us to create these kinds of distributed representations. The word2vec algorithm trains word representations based on either a continuous bag-of-words (CBOW) or skip-gram model, such that words are embedded in space along with similar words based on their context. For example, Gensim’s implementation uses a feedforward network.


The doc2vec algorithm is an extension of word2vec. It proposes a paragraph vector—an unsupervised algorithm that learns fixed-length feature representations from variable length documents. This representation attempts to inherit the semantic properties of words such that “red” and “colorful” are more similar to each other than they are to “river” or “governance.” Moreover, the paragraph vector takes into consideration the ordering of words within a narrow context, similar to an n-gram model. The combined result is much more effective than a bag-of-words or bag-of-n-grams model because it generalizes better and has a lower dimensionality but still is of a fixed length so it can be used in common machine learning algorithms.

##### The Gensim way
Neither NLTK nor Scikit-Learn provide implementations of these kinds of word embeddings. Gensim’s implementation allows users to train both word2vec and doc2vec models on custom corpora and also conveniently comes with a model that is pretrained on the Google news corpus.

To train the model first, I load the corpus into memory and create a list of TaggedDocument objects, which extend the LabeledSentence, and in turn the distributed representation of word2vec. TaggedDocument objects consist of words and tags. I can instantiate the tagged document with the list of tokens along with the article index, that uniquely identifies the instance.

Once I have a list of tagged documents, the code instantiate the Doc2Vec model and specify the size of the vector as well as the minimum count, which ignores all tokens that have a frequency less than that number. Once instantiated, an unsupervised neural network is trained to learn the vector representations, which can then be accessed via the docvecs property.

The model itself can be saved to disk and retrained in an active fashion, making it extremely flexible for a variety of use cases. However, on larger corpora, training can be slow and memory intensive, and it might not be as good as a TF–IDF model with Principal Component Analysis (PCA) or Singular Value Decomposition (SVD) applied to reduce the feature space. 

In [30]:
## Import libraries

## General
import pandas as pd
from tqdm import tqdm
from pprint import pprint
import time

## Gensim
from gensim.models.doc2vec import TaggedDocument, Doc2Vec
from gensim.utils import tokenize, simple_preprocess


## Scikit-Learn 
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from skopt import BayesSearchCV
from skopt import dump, load

## 2. Import data
First I load the pre-processed text into the notebook and use the categories I generated in the previous notebook to merge less frequent categories with the most appropriate ones. The aim of this step is to reduce the number of categories to 5. The reason for this apporach is the imbalance among the categories in the dataset.

In [31]:
# Read the data
df_normal_text = pd.read_csv('../data/interim/covid_articles_normalized.csv')

## Merge Tags

tag_map = {'consumer':'general',
           'healthcare':'science',
           'automotive':'business',
           'environment':'science',
           'construction':'business',
           'ai':'tech'}

df_normal_text['tags'] = [(lambda tags: tag_map[tags] if tags in tag_map.keys() else tags)(tags)
                          for tags in df_normal_text['topic_area']]
df_normal_text.tags.value_counts()

business    245652
general      86372
finance      22386
tech          8915
science       5595
Name: tags, dtype: int64

## 3. Gensim Doc2Vec model
This step involves 
* **Splitting the data in train and test groups**: I use the Scikit-Learn train-test split model with a ratio of 30% test data. 
* **Tokenizing and tagging the articles**: Gensim has a tokenizer to which create word tokens. I used article index as the tag for each entry. Tags are unique IDs for each article used to look-up the learned vectors after training. 
* **Initializing the model**: Training a Doc2Vec model is a memory intensive process. I had to adopt measures to fit the data in memory. A tool for managing the size of the model is the vector_size which indicates the dimensionality of the feature vectors. I set it to 100 for the base case. The default number of iterations (epochs) over the corpus is 10 for Doc2Vec. Typical iteration counts in the published Paragraph Vector paper results, using 10s-of-thousands to millions of docs, are 10-20. More iterations take more time and eventually reach a point of diminishing returns. I used 15 epochs for the base case and will try different values with bayes search.
* **Building the vocabulary dictionary**: The vocabulary is a list of all of the unique words extracted from the training corpus.
* **training the doc2wev NN**: Use the ```model.train``` to fit the create embedings for each article.
* **Infer Vectors**: I use the trained model to infer a vector for any piece of text by passing a list of words to the ```model.infer_vector``` function. This vector can then be compared with other vectors via cosine similarity.

In [32]:
## Split the data in train and test
train, test = train_test_split(df_normal_text[['content', 'tags']], test_size=0.3, random_state=21)

## Tokenizing and tagging the articles
train_tagged=[]
for index, row in train.iterrows():
    train_tagged.append(TaggedDocument(words=list(tokenize(row.content)), tags=str(index)))

In [33]:
## Let's look at a tagged article
train_tagged[0]

TaggedDocument(words=['send', 'aviation', 'industry', 'liquidity', 'strap', 'across', 'world', 'send', 'administration', 'part', 'government', 'ownership', 'follow', 'global', 'massive', 'reduction', 'air', 'travel', 'international', 'air', 'transport', 'association', 'predict', 'sector', 'head', 'us', 'billion', 'billion', 'net', 'loss', 'direct', 'result', 'covid', 'many', 'struggle', 'navigate', 'uncharted', 'territory', 'one', 'survive', 'crisis', 'strategically', 'creative', 'must', 'find', 'way', 'public', 'health', 'maintain', 'profitability', 'one', 'way', 'embrace', 'ultra', 'long', 'haul', 'flight', 'research', 'aviation', 'consultant', 'scholar', 'university', 'investigate', 'operator', 'ultra', 'long', 'haul', 'flight', 'capacity', 'like', 'well', 'place', 'new', 'normal', 'emerge', 'people', 'priority', 'shift', 'toward', 'concern', 'health', 'ultra', 'long', 'haul', 'flight', 'significant', 'benefit', 'current', 'common', 'model', 'stop', 'change', 'plane', 'big', 'hub', 

In [34]:
## Initialize the model with vector size and number of epochs
model = Doc2Vec(vector_size=100, epochs=15)

In [35]:
## Building the vocabulary dictionary
model.build_vocab(train_tagged)

In [36]:
## Let's check how mnay times a word appeared in the corpus.
print(f"Word 'vaccine' appeared {model.wv.get_vecattr('vaccine', 'count')} times in the training corpus.")

Word 'vaccine' appeared 124781 times in the training corpus.


In [37]:
## Training the model
model.train(pd.Series(train_tagged).values, total_examples=model.corpus_count, epochs=model.epochs)

In [38]:
## A function to infer the documents
def vectorize(model, corpus):
    regressors, tags = zip(*[(model.infer_vector(doc[0].split(), steps=20), doc[1]) for doc in corpus.values])
    return regressors, tags

In [39]:
X_train, y_train = vectorize(model, train)
X_test, y_test = vectorize(model, test)

In [19]:
## Save model
#model.save('../models/gensim-model-doc2vec-vs300')

## Load model
#new_model = gensim.models.Word2Vec.load('../models/gensim-model-doc2vec-')

## 4. Classification Algorithm - Logistic Regression
In the previous notebook (classification with Tfidf), I concluded that showed that logistic regression is the most appropriate model for classsification of articles in this dataset. So, I will use logistic regression to classifiy the articles using the embeding generated with the doc2vec model.

In [40]:
clf_lr = LogisticRegression(solver='liblinear')
clf_lr.fit(X_train, y_train)

LogisticRegression(solver='liblinear')

In [41]:
y_pred = clf_lr.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    business       0.76      0.92      0.83     73674
     finance       0.52      0.01      0.01      6763
     general       0.69      0.55      0.62     25951
     science       0.51      0.09      0.15      1614
        tech       0.50      0.07      0.12      2674

    accuracy                           0.75    110676
   macro avg       0.60      0.33      0.35    110676
weighted avg       0.72      0.75      0.71    110676



The results indicate that the class imbalance causes issues for classification of less populated categories. I've seen similar behaviour with Tfidf. In the next step I use bayes search to explore the hyperparamter space.

## 5. Hyper-parameter Tuning

### 5.1. Define a Doc2Vec Class To Integrate with Scikit-Learn API
In order to use the ```bayessearchcv``` I have to create a wrapper around the Gensim's Doc2Vec model that uses a similar structure as native Scikit-Learn classes. I passed the doc2vec parameters to ```__init__``` and then defined the tagging, tokenization, building vocabulary, and training in the fit method. The transform method retruns the infer_vector for train and test data.

In [15]:
class Doc2VecModel(BaseEstimator):

    def __init__(self, dm=1, vector_size=100, window=1, epochs=10, max_vocab_size=12e7, min_count=1):
        #print('>>>>>>>>init() called.\n')
        self.d2v_model = None
        self.vector_size = vector_size
        self.window = window
        self.dm = dm
        self.epochs = epochs
        self.max_vocab_size = max_vocab_size
        self.min_count = min_count

    def fit(self, corpus, y=None):
        #print('>>>>>>>>fit() called.\n')
        ## Initialize model
        self.d2v_model = Doc2Vec(vector_size=self.vector_size, window=self.window, dm=self.dm, epochs=self.epochs,
                                 max_vocab_size=self.max_vocab_size, min_count=self.min_count,
                                 alpha=0.025, min_alpha=0.001, seed=21)
        ## Tag docs
        docs_tagged=[]
        for index, row in corpus.iteritems():
            docs_tagged.append(TaggedDocument(words=list(tokenize(row)), tags=str(index)))
        ## Build vocabulary
        self.d2v_model.build_vocab(docs_tagged)
        ## Train model
        self.d2v_model.train(pd.Series(docs_tagged).values, total_examples=self.d2v_model.corpus_count, epochs=self.d2v_model.epochs)
        return self

    def transform(self, corpus):
        #print('>>>>>>>>transform() called.\n')
        sents = corpus.values
        regressors = [(self.d2v_model.infer_vector(doc[0].split(), steps=20)) for doc in sents]
        regressors = pd.DataFrame(regressors, index=corpus.index)
        return regressors


    def fit_transform(self, corpus, y=None):
        self.fit(corpus)
        return self.transform(corpus)

### 5.2. Search Configuration
The next step is to define the pipeline and parameter grid for bayessearch. 
* Doc2Vec hyperparameters
    * window – The maximum distance between the current and predicted word within a sentence.
    * dm -  Defines the training algorithm. If dm=1, ‘distributed memory’ (PV-DM) is used. Otherwise, distributed bag of words (PV-DBOW) is employed. 
    * vector_size – Dimensionality of the feature vectors.
    * min_count – Ignores all words with total frequency lower than this.
    * epochs – Number of iterations (epochs) over the corpus. Defaults to 10 for Doc2Vec.
    
The bayessearchcv with doc2vec model is a memory intensive process. To handle this task I had to limit the number of parallel processors. Since the doc2vec takes a longtime to create the embeddings and I had limited parallel workers, number of bayessearch iterations and crossfolds are reduced.

In [16]:
## Model Specifications
# Create pipeline
model_pipe = Pipeline([('vect', Doc2VecModel()),
                      ('clf', LogisticRegression(solver='liblinear'))])

# Parameter grid
param_grid = {'vect__window': list(range(5)),
              'vect__dm': [0,1],
              'vect__vector_size': list(range(100,400)),
              'vect__min_count': list(range(100)),
              'vect__epochs': [10,15,20,25,30],
              'clf__dual': (True,False),
              'clf__max_iter': [100,110,120,130,140],
              'clf__C': (1e-5, 1e2, "log-uniform"),
}

# Number of iterations: Number of parameter settings that are sampled.
n_iter = 15

# Split the data to train and test
train, test = train_test_split(df_normal_text[['content', 'tags']], test_size=0.3, random_state=21)

### 5.3. BayesSearchCV

In [17]:
## Initialize the search
bayes_search_cv = BayesSearchCV(estimator=model_pipe, search_spaces=param_grid,
                                n_iter=n_iter, n_jobs=2, verbose=1, random_state=21, cv=4)

## Print search configurations
print("Performing grid search...")
print("pipeline:", [name for name, _ in model_pipe.steps])
print("parameters:")
for key, value in param_grid.items():
    print('{}: {}'.format(key, value))

## Run the search
bayes_search_cv.fit(train.content, train.tags.values)

## Print best parameters and results
print("Best score: %0.3f" % bayes_search_cv.best_score_)
print("Best parameters set:")
best_parameters = bayes_search_cv.best_estimator_.get_params()

for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
pipeline: ['vect', 'clf']
parameters:
vect__window: [0, 1, 2, 3, 4]
vect__dm: [0, 1]
vect__vector_size: [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273,

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   4 out of   4 | elapsed: 65.6min finished


Fitting 4 folds for each of 1 candidates, totalling 4 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   4 out of   4 | elapsed: 342.1min finished


Fitting 4 folds for each of 1 candidates, totalling 4 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   4 out of   4 | elapsed: 163.9min finished


Fitting 4 folds for each of 1 candidates, totalling 4 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   4 out of   4 | elapsed: 108.3min finished


Fitting 4 folds for each of 1 candidates, totalling 4 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   4 out of   4 | elapsed: 311.1min finished


Fitting 4 folds for each of 1 candidates, totalling 4 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   4 out of   4 | elapsed: 351.4min finished


Fitting 4 folds for each of 1 candidates, totalling 4 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   4 out of   4 | elapsed: 145.4min finished


Fitting 4 folds for each of 1 candidates, totalling 4 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   4 out of   4 | elapsed: 154.6min finished


Fitting 4 folds for each of 1 candidates, totalling 4 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   4 out of   4 | elapsed: 144.0min finished


Fitting 4 folds for each of 1 candidates, totalling 4 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   4 out of   4 | elapsed: 153.2min finished


Fitting 4 folds for each of 1 candidates, totalling 4 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   4 out of   4 | elapsed: 266.1min finished


Fitting 4 folds for each of 1 candidates, totalling 4 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   4 out of   4 | elapsed: 173.4min finished


Fitting 4 folds for each of 1 candidates, totalling 4 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   4 out of   4 | elapsed: 183.5min finished


Fitting 4 folds for each of 1 candidates, totalling 4 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   4 out of   4 | elapsed: 101.8min finished


Fitting 4 folds for each of 1 candidates, totalling 4 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   4 out of   4 | elapsed: 136.2min finished


Best score: 0.691
Best parameters set:
	clf__C: 0.005206429077063814
	clf__dual: True
	clf__max_iter: 120
	vect__dm: 0
	vect__epochs: 25
	vect__min_count: 93
	vect__vector_size: 219
	vect__window: 1


In [24]:
y_pred =bayes_search_cv.predict(test.content)
y_test = test.tags.values
print(classification_report(y_test, y_pred))

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

    business       0.70      0.96      0.81     73674
     finance       0.00      0.00      0.00      6763
     general       0.62      0.22      0.33     25951
     science       0.00      0.00      0.00      1614
        tech       0.00      0.00      0.00      2674

    accuracy                           0.69    110676
   macro avg       0.26      0.24      0.23    110676
weighted avg       0.61      0.69      0.62    110676



In [29]:
pprint(bayes_search_cv.cv_results_)

defaultdict(<class 'list'>,
            {'mean_fit_time': [1908.7735111117363,
                               10165.226145923138,
                               4814.506383061409,
                               3139.0467371940613,
                               9222.512724280357,
                               10486.558747112751,
                               4255.642635285854,
                               4540.6362399458885,
                               4223.315940797329,
                               4544.508498668671,
                               7915.9643377661705,
                               5124.964894413948,
                               5392.34798771143,
                               2971.5023089647293,
                               3988.6643110513687],
             'mean_score_time': [37.514176428318024,
                                 39.1325164437294,
                                 81.95469522476196,
                                 60.08116686344147,
      

In [20]:
## Save model
# dump(bayes_search_cv, '../models/bayes_d2v.pkl')

## Load model
# new_model = load('../models/bayes_d2v.pkl')

Running a bayessearchcv is very time intensive step. The results of bayessearch do not improve the classification results. So, I will not explore a more focused hyperparameter space and will move to another model to investigate performance of DL for the given dataset.