# Training Doc2Vec

In this notebook we will train from scratch a DBOW document embedding model based on the Yelp dataset.

Take it easy and pay attention to the model, how easy it is to define it, and how easy it is to define Doc2Vec in gensim (which adds a layer over Keras).

You can run this lab both locally or in Colab.

- To run in Colab just go to `https://colab.research.google.com`, sign-in and you upload this notebook. Colab has GPU access for free.
- To run locally just run `jupyter notebook` and access the notebook in this lab. You would need to first install the requirements in `requirements.txt`

Follow the instructions. Good luck!


In [None]:
import random

import gensim
import numpy as np
import pandas as pd
import smart_open
from gensim.models.callbacks import CallbackAny2Vec
from sklearn.model_selection import train_test_split
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

np.random.seed(42)
embedding_dim = 100
vocabulary_size_to_use = 50000  # Of course in production you would train this for days, with all your dataset in batches
epochs = 10  # And with more epochs
train_file_path = './train_yelp.csv'
test_file_path = './test_yelp.csv'

In [None]:
%%writefile get_data.sh
if [ ! -f yelp.csv ]; then
  wget -O yelp.csv https://www.dropbox.com/s/xds4lua69b7okw8/yelp.csv?dl=0
fi

In [None]:
!bash get_data.sh

In [None]:
path = './yelp.csv'
yelp = pd.read_csv(path)
# Create a new DataFrame that only contains the 5-star and 1-star reviews.
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]
X = yelp_best_worst.text
y = yelp_best_worst.stars.map({1:0, 5:1})
y = yelp_best_worst.stars.map({1:0, 5:1})
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
X_train.to_csv(train_file_path, header=False, index=False, columns=['text'])
X_test.to_csv(test_file_path, header=False, index=False, columns=['text'])

In [None]:
# FILL in the gaps
def read_corpus(fname, tokens_only=False):
    with smart_open.open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            tokens = None  # tokenize and preprocess line. Try to search in gensim
            if tokens_only:
                yield tokens
            else:
                # For training data, add tags and yield the result. The end yielded result should be a TaggedDocument
                yield None  # FILL

Notice that we add a unique identifier for each document, preparing it for DBOW.

In [None]:
# Create the train and test corpora by using the read_corpus we have done. Filter the train_corpus to size vocabulary_size_to_use
train_corpus = None
test_corpus = None

In [None]:
print(train_corpus[:2])


In [None]:
# Generate a Doc2Vec model in Gensim of embedding size embedding_dim and the number of epochs we specified above
model = None

# Build the vocabulary with the build_vocab method on the model (initialize the weights)
None

In [None]:

# Train the model with the train method
None

In [None]:
vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
print(vector)


We successfully and quickly converted sentences into 100 dimensional vectors!

In [None]:

# Pick a random document from the test corpus and infer a vector from the model
doc_id = None
inferred_vector = None

# Get the most similar documents on the train corpus
sims = None

# Compare and print the most similar documents from the train corpus
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(u'MOST SIMILAR %s: «%s»\n' % (sims[0], ' '.join(train_corpus[sims[0][0]].words)))