# Day 2 - Exercise 2 - Doc2Vec

## Necessary imports

In order to handle the data properly we have to import the data and the modules we need:

In [1]:
# modules
import pandas as pd
import numpy as np
import re
import nltk
from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

First of all, you need to download the data set "tweets.csv" from the GitHub repository https://github.com/assenmacher-mat/nlp_notebooks.

__If you are running this notebook on colab ( https://colab.research.google.com/ ), you also need to run the next chunk in order to upload the data to colab.  
Choose it in the upload window and in it will be available on colab from now on.__  
(If you are running this notebook locally on your machine, you can skip the execution of this chunk)

In [124]:
from google.colab import files
uploaded = files.upload()

### Import the data set and perform pre-processing

__If you are running this notebook locally on your machine, you might need to adjust the path (depending on where you've saved the data).__  
(If you are running this notebook on colab, you can can leave the path unchanged)

In [5]:
tweet_data = pd.read_csv("trump.csv")
tweets_raw = [tweet for tweet in list(tweet_data.text)]
tweets = [doc.lower() for doc in tweets_raw]
tweets = [re.sub(r"https://.*|“|”|@", "", doc) for doc in tweets]
tweets = [re.sub(r"[\)\(\.\,;:!?\+\-\_\#\'\*\§\$\%\&]", "", doc) for doc in tweets]
tweets = [nltk.word_tokenize(doc) for doc in tweets]

# After all, we can finally start with the modeling part!  
(If you want to have a look at the help page, just execute the following chunk)

In [8]:
help(Doc2Vec)

### First, we determine the number of CPUs that are available on our machine  
(The more cores are available, the faster we can train our model)

In [9]:
import multiprocessing
cpus = multiprocessing.cpu_count()
print(cpus)

8


### Prepare the data set by transforming every tweet to a TaggedDocument

In [10]:
tagged_tweets = [TaggedDocument(words = d, tags = ["doc_" + str(i)]) for i, d in enumerate(tweets)]

### Display a "tagged" tweet

In [11]:
tagged_tweets[0]

TaggedDocument(words=['at', 'the', 'request', 'of', 'senthomtillis', 'i', 'have', 'declared', 'a', 'major', 'disaster', 'for', 'the', 'great', 'state', 'of', 'north', 'carolina', 'to', 'help', 'with', 'damages', 'from', 'hurricane', 'dorian', 'assistance', 'now', 'unlocked', 'to', 'recover', 'stronger', 'than', 'ever', 'thom', 'loves', 'nc', 'and', 'so', 'do', 'i'], tags=['doc_0'])

### Additional Task:  
Think about assigning multiple tags to each of the tweets. 
This could be interesting, if we had tweets from different politicians and wanted to learn additional representations for their style of tweeting.  
Try to assign a document identifier as well as the label $donald\_trump$ to all our tweets

In [12]:
two_tagged_tweets = [TaggedDocument(words = d, tags = ["doc_" + str(i), "donald_trump"]) for i, d in enumerate(tweets)]
two_tagged_tweets[0]

TaggedDocument(words=['at', 'the', 'request', 'of', 'senthomtillis', 'i', 'have', 'declared', 'a', 'major', 'disaster', 'for', 'the', 'great', 'state', 'of', 'north', 'carolina', 'to', 'help', 'with', 'damages', 'from', 'hurricane', 'dorian', 'assistance', 'now', 'unlocked', 'to', 'recover', 'stronger', 'than', 'ever', 'thom', 'loves', 'nc', 'and', 'so', 'do', 'i'], tags=['doc_0', 'donald_trump'])

### Set up the model parameters for the Distributed memory model  
(Now again with the corpus which documents are only assigned one tag)

In [13]:
d2v_model = Doc2Vec(dm = 1, dm_concat = 0, vector_size = 100, alpha = 0.025, min_alpha = 0.0001, 
                    window = 5, min_count = 5, sample = 0.001, negative = 5, workers = cpus - 1)

### Initialize the model with our twitter data:

In [14]:
d2v_model.build_vocab(documents = tagged_tweets, update = False, progress_per = 10000)

### Train our model:  
(Hint: If you want to compre the runtime of the model for different number of cores or epochs, just put "%timeit" in front of the command  
 in the next chunk. You will then get an evaluation of how long the process takes.)

In [15]:
%time d2v_model.train(documents = tagged_tweets, total_examples = d2v_model.corpus_count, epochs = 20)

Wall time: 4.19 s


### Chose a document and display it as a text

In [16]:
" ".join(tagged_tweets[10].words)

'the washington times ukraine envoy blows ‘ massive hole ’ into democrat accusations republicans at hearing find no trump pressure the ukrainian president also strongly stated that no pressure was put on him case closed'

### Display the most similar tweets to the one you chose

In [17]:
ids_similar = d2v_model.docvecs.most_similar(["doc_10"], topn = 3)
ids_similar

[('doc_2688', 0.9226269721984863),
 ('doc_132', 0.9198408126831055),
 ('doc_3014', 0.9137082099914551)]

In [18]:
[" ".join(tweet.words) for tweet in tagged_tweets if tweet.tags[0] in [id[0] for id in ids_similar]]

['the whistleblower ’ s complaint is completely different and at odds from my actual conversation with the new president of ukraine the socalled whistleblower knew practically nothing in that those ridiculous charges were far more dramatic amp wrong just like liddle ’ adam schiff',
 'john solomon as russia collusion fades ukrainian plot to help clinton emerges seanhannity foxnews',
 'disgraced fbi acting director andrew mccabe pretends to be a poor little angel when in fact he was a big part of the crooked hillary scandal amp the russia hoax a puppet for leakin ’ james comey ig report on mccabe was devastating part of insurance policy in case i won']

In [20]:
d2v_model.docvecs.similarity("doc_10", "doc_2688")

0.9226269

### Train a Distributed Bag-of-words model

In [26]:
dbow_model = Doc2Vec(dm = 0, dm_concat = 0, vector_size = 100, alpha = 0.025, min_alpha = 0.0001, 
                    window = 5, min_count = 5, sample = 0.001, negative = 5, workers = cpus - 1)
dbow_model.build_vocab(documents = tagged_tweets, update = False, progress_per = 10000)
%time dbow_model.train(documents = tagged_tweets, total_examples = dbow_model.corpus_count, epochs = 20)

Wall time: 2.43 s


### Compare how well the two models were able to learn meaningful word embeddings

In [31]:
d2v_model.wv.most_similar(positive = ["democrats"])

[('dems', 0.936676561832428),
 ('radical', 0.8013758063316345),
 ('left', 0.7550336122512817),
 ('trying', 0.7318845987319946),
 ('republicans', 0.7197769284248352),
 ('facts', 0.7146976590156555),
 ('they', 0.7040835618972778),
 ('end', 0.648250937461853),
 ('congress', 0.6472545862197876),
 ('asking', 0.647139310836792)]

In [32]:
dbow_model.wv.most_similar(positive = ["democrats"])

[('2016', 0.30525755882263184),
 ('fishing', 0.2913818955421448),
 ('product', 0.28849905729293823),
 ('times', 0.2806693911552429),
 ('judiciary', 0.2791927456855774),
 ('proved', 0.2754194438457489),
 ('debt', 0.2723675072193146),
 ('announcement', 0.26838499307632446),
 ('question', 0.2646797299385071),
 ('or', 0.26391512155532837)]

#### Now train a second Distributed Bag-of-words model and set the ``dbow_words``-option to 1

In [34]:
dbow2_model = Doc2Vec(dm = 0, dm_concat = 0, vector_size = 100, alpha = 0.025, min_alpha = 0.0001, dbow_words = 1,
                    window = 5, min_count = 5, sample = 0.001, negative = 5, workers = cpus - 1)
dbow2_model.build_vocab(documents = tagged_tweets, update = False, progress_per = 10000)
%time dbow2_model.train(documents = tagged_tweets, total_examples = dbow2_model.corpus_count, epochs = 20)

Wall time: 4.29 s


In [35]:
dbow2_model.wv.most_similar(positive = ["democrats"])

[('dems', 0.6945326328277588),
 ('democrat', 0.5769639015197754),
 ('committees', 0.5477724075317383),
 ('partner', 0.4884350895881653),
 ('loopholes', 0.48344436287879944),
 ('congresswomen', 0.47495728731155396),
 ('investigations', 0.47154688835144043),
 ('13', 0.47105997800827026),
 ('cities', 0.46607354283332825),
 ('radical', 0.46355095505714417)]