# Lab 3

## Training Doc2Vec

In this notebook we will train from scratch a DBOW document embedding model based on the Yelp dataset.

Take it easy and pay attention to the model, how easy it is to define it, and how easy it is to define Doc2Vec in gensim (which adds a layer over Keras).

You can run this lab both locally or in Colab.

- To run in Colab just go to `https://colab.research.google.com`, sign-in and you upload this notebook. Colab has GPU access for free.
- To run locally just run `jupyter notebook` and access the notebook in this lab. You would need to first install the requirements in `requirements.txt`

Follow the instructions. Good luck!


In [None]:
import random

import gensim
import numpy as np
import pandas as pd
import smart_open
from gensim.models.callbacks import CallbackAny2Vec
from sklearn.model_selection import train_test_split
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

np.random.seed(42)
embedding_dim = 100
vocabulary_size_to_use = 50000  # Of course in production you would train this for days, with all your dataset in batches
epochs = 10  # And with more epochs
train_file_path = './train_yelp.csv'
test_file_path = './test_yelp.csv'

In [3]:
%%writefile get_data.sh
if [ ! -f yelp.csv ]; then
  wget -O yelp.csv https://www.dropbox.com/s/xds4lua69b7okw8/yelp.csv?dl=0
fi

Writing get_data.sh


In [4]:
!bash get_data.sh

--2022-10-20 18:09:40--  https://www.dropbox.com/s/xds4lua69b7okw8/yelp.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.85.18, 2620:100:6035:18::a27d:5512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.85.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/xds4lua69b7okw8/yelp.csv [following]
--2022-10-20 18:09:41--  https://www.dropbox.com/s/raw/xds4lua69b7okw8/yelp.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucfaca51b4d5808e489cc5f7ca41.dl.dropboxusercontent.com/cd/0/inline/BvMAjzyhlOdRsamqC7x_TylcjV9L0Z-KnwYCa7Dn8a7XdbQ-6vAww1i5P7pLDpuB37hPqAKVBeQLQUnsPpyIFtwI1iqz2Ivq_N6Cd_9MJ5ZFBPlMrs_cEmv2y5f-BFLJp5ecM6ap5yMMX-KCzd6e3u41v1y-XKI_jN4iaVBP8T0VNA/file# [following]
--2022-10-20 18:09:41--  https://ucfaca51b4d5808e489cc5f7ca41.dl.dropboxusercontent.com/cd/0/inline/BvMAjzyhlOdRsamqC7x_TylcjV9L0Z-KnwYCa7Dn8a7XdbQ-6vAww1i5P7pLDpuB37hPqAKVBeQLQUnsPpyI

In [5]:
path = './yelp.csv'
yelp = pd.read_csv(path)
# Create a new DataFrame that only contains the 5-star and 1-star reviews.
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]
X = yelp_best_worst.text
y = yelp_best_worst.stars.map({1:0, 5:1})
y = yelp_best_worst.stars.map({1:0, 5:1})
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
X_train.to_csv(train_file_path, header=False, index=False, columns=['text'])
X_test.to_csv(test_file_path, header=False, index=False, columns=['text'])

In [17]:
# FILL in the gaps
def read_corpus(fname, tokens_only=False):
    with smart_open.open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            print(line)
            tokens = list(gensim.utils.simple_preprocess(line))  # tokenize and preprocess line. Try to search in gensim
            if tokens_only:
                yield tokens
            else:
                # For training data, add tags and yield the result. The end yielded result should be a TaggedDocument
                
                yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

Notice that we add a unique identifier for each document, preparing it for DBOW.

In [18]:
# Create the train and test corpora by using the read_corpus we have done. Filter the train_corpus to size vocabulary_size_to_use
train_corpus = list(read_corpus(train_file_path))[:vocabulary_size_to_use]
test_corpus = list(read_corpus(test_file_path, tokens_only = True))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m


My constructive feedback would be that I ordered my burger medium well and I received it medium rare. It still tasted yummy and I ate it. Had it been served that way to my hubby - he would not have been able to stomach the red juice flowing from the burger."

"In Phoenix for a concert and felt like Thai. Found this place on Yelp! and decided to check it out based on the reviews and the fact that it was near the friend we were visiting at the time. The restaurant is in a strip shopping center that we never would have noticed (or found) had we not been looking for it. Oh, but what a magical dining experience! The cashew chicken was the best I'd ever had (and we eat Thai a LOT!) and the chicken satay was exceptional. Loved the veggie Pad Thai and my husband got a vegetarian (tofu) curry dish he loved. Wish this place wasn't 400 miles from my house. I'd be eating here every week!"

"Can I tell you how much I despised Fate w

In [11]:
print(train_corpus[:2])


[TaggedDocument(words=['if', 'could', 'give', 'it', 'more', 'than', 'would', 'sweet', 'pea', 'and', 'live', 'down', 'the', 'street', 'literally', 'down', 'the', 'street', 'from', 'this', 'bar', 'we', 'waited', 'for', 'it', 'to', 'open', 'for', 'what', 'seemed', 'like', 'decades', 'praying', 'that', 'this', 'was', 'going', 'to', 'be', 'the', 'type', 'of', 'place', 'that', 'could', 'become', 'our', 'local', 'it', 'has', 'exceeded', 'our', 'expectations', 'the', 'atmosphere', 'is', 'amazing', 'the', 'drinks', 'are', 'amazing', 'every', 'last', 'one', 'of', 'them', 'but', 'the', 'margaritas', 'are', 'the', 'best', 've', 'ever', 'had', 'they', 'tasted', 'like', 'fresh', 'squeeze', 'of', 'sunshine', 'that', 'makes', 'me', 'happy', 'inside', 'margarita', 'mondays', 'margs', 'and', 'free', 'food', 'happy', 'hours', 'are', 'amazing', 'new', 'year', 'eve', 'last', 'year', 'was', 'amazing', 'the', 'year', 'anniversary', 'party', 'was', 'amazing', 'but', 'most', 'of', 'all', 'the', 'owner', 'and',

In [13]:
# Generate a Doc2Vec model in Gensim of embedding size embedding_dim and the number of epochs we specified above
model = gensim.models.doc2vec.Doc2Vec(vector_size=embedding_dim, min_count=2, epochs=3, workers=5)

# Build the vocabulary with the build_vocab method on the model (initialize the weights)
model.build_vocab(train_corpus)

In [14]:

# Train the model with the train method
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

In [15]:
vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
print(vector)


[ 9.09216178e-05  3.16491089e-04 -8.63005314e-03  1.42279850e-05
  6.94743963e-03  4.36570728e-03  1.38467003e-03  1.54486988e-02
 -3.69562046e-03 -3.63791687e-03  8.87537096e-03 -2.40826324e-04
 -1.34757720e-02 -6.89669861e-04  3.37656215e-03  1.06333597e-02
 -1.44060142e-02 -1.66702438e-02  4.21322044e-03 -1.24942819e-02
 -1.31013878e-02 -3.55944154e-03  8.31544341e-04  7.85583723e-03
 -7.17031304e-03  5.57244383e-03  1.72701888e-02  1.73577368e-02
 -5.72809204e-03 -8.12169164e-03  7.23810494e-03 -1.41612045e-03
  7.33559439e-03  1.40575925e-02  1.00075835e-02  1.38595188e-02
  2.01686591e-04  6.53354917e-03  8.35621916e-03  2.89464399e-04
  9.67292488e-03  2.38656183e-04 -1.59995649e-02 -1.80802215e-02
  5.43972105e-03  2.64760153e-03  2.89576041e-04  1.27990730e-02
  9.61644109e-03 -1.33465389e-02  2.38366961e-03 -3.85149382e-03
 -6.42450433e-03  9.14748758e-03 -4.98663634e-03 -2.19437061e-03
 -2.88071600e-03 -5.23647387e-03 -1.39376940e-02 -1.48095684e-02
 -1.28885387e-02 -1.46616

We successfully and quickly converted sentences into 100 dimensional vectors!

In [16]:

# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(test_corpus) - 1)
inferred_vector = model.infer_vector(test_corpus[doc_id])

# Get the most similar documents on the train corpus
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

# Compare and print the most similar documents from the train corpus
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(u'MOST SIMILAR %s: «%s»\n' % (sims[0], ' '.join(train_corpus[sims[0][0]].words)))

Test Document (767): «food mediocre at best got the traditional eggs benedict with home fries the only saving grace with the amount of ham on it the english muffin didn seem to be toasted the eggs were ok but nothing great the home fries my year old niece could do better job my burps right now aren even good enough to brag about»

MOST SIMILAR (13201, 0.975865364074707): «have now had the pleasure of experiencing all of their different types of bruschetta my top two favorites are the prosciutto mascarpone figs and tomato jam which has great combination of salty and sweet and the apples brie and fig jam one which is on the sweeter side but not overwhelming because the bread and the tart granny smith apple slices help balance out the flavor the chopped salad is very light and refreshing option to accompany the brushcetta appetizer th and wine is also restaurant so other food options are available have sampled the chicken and pesto pasta which has delicious cream sauce and their mac chees