# Day 2 - Exercise 2 - Doc2Vec

## Necessary imports

In order to handle the data properly we have to import the data and the modules we need:

In [1]:
# modules
import pandas as pd
import numpy as np
import re
import nltk
from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

First of all, you need to download the data set "tweets.csv" from the GitHub repository https://github.com/assenmacher-mat/nlp_notebooks.

__If you are running this notebook on colab ( https://colab.research.google.com/ ), you also need to run the next chunk in order to upload the data to colab.  
Choose it in the upload window and in it will be available on colab from now on.__  
(If you are running this notebook locally on your machine, you can skip the execution of this chunk)

In [124]:
from google.colab import files
uploaded = files.upload()

### Import the data set

__If you are running this notebook locally on your machine, you might need to adjust the path (depending on where you've saved the data).__  
(If you are running this notebook on colab, you can can leave the path unchanged)

In [2]:
tweet_data = pd.read_csv("trump.csv")

### Next, just have a look at the data set in order to see what's inside

In [3]:
tweet_data.loc[:3,:]

Unnamed: 0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str
0,Twitter for iPhone,At the request of @SenThomTillis I have declar...,10-04-2019 21:59:44,8562,36356,False,1180241114403610626
1,Twitter for iPhone,Under my Administration Medicare Advantage pre...,10-04-2019 21:57:17,15248,54729,False,1180240498478534658
2,Twitter for iPhone,WOW this is big stuff! https://t.co/H12yxMfua3,10-04-2019 19:46:59,15655,50526,False,1180207709985165313
3,Twitter for iPhone,“I think it’s outrages that a Whistleblower is...,10-04-2019 14:12:23,19441,73966,False,1180123504924151809


### Extract the tweets to a list of texts:

In [4]:
tweets_raw = [tweet for tweet in list(tweet_data.text)]

### Display one exemplary tweet:

In [5]:
print(tweets_raw[1])

Under my Administration Medicare Advantage premiums next year will be their lowest in the last 13 years. We are providing GREAT healthcare to our Seniors. We cannot let the radical socialists take that away through Medicare for All!


### Perform the basic preprocessing steps before we continue:
    - everything to lowercase
    - expand contractions
    - delete url adresses
    - delete other unwanted tokens
    - tokenizing

In [6]:
tweets = [doc.lower() for doc in tweets_raw]

In [13]:
conts = [(r"don't", "do not"), (r"isn't", "is not")]
def expand(text, contractions = conts):
    for c in contractions:
        t = re.sub(c[0], c[1], text)
    return t

In [14]:
print(tweets[33])
print(expand(tweets[33]))

leader mccarthy we look forward to you soon becoming speaker of the house. the do nothing dems don’t have a chance! https://t.co/uwpdgjg99f
leader mccarthy we look forward to you soon becoming speaker of the house. the do nothing test don’t have a chance! https://t.co/uwpdgjg99f


### Print a list of unique tokens that are at the moment present in our corpus
### Based on this, we can identify which tokens occur the we potentially want to exclude

In [7]:
print(sorted(set(nltk.word_tokenize(" ".join(tweets))))[:100])

['!', '#', '$', '%', '&', "'", "''", "'case", "'collusion", "'could", "'crisis", "'forgotten", "'god", "'right", "'s", "'spying", "'ve", '(', ')', '+', '-', '--', '.', '..', '...', '..again', '..all', '..also', '..amounts', '..are', '..between', '..breaking', '..but', '..call', '..came', '..chairman', '..comcast', '..congresswomen', '..deferral', '..despite', '..if', '..mexico', '..much', '..my', '..news', '..nice', '..not', '..now', '..on', '..other', '..saying', '..shouting', '..sorry', '..spread', '..thank', '..that', '..the', '..there', '..this', '..to', '..tv', '..united', '..was', '..we', '..who', '..why', '..willing', '..years', '.33000', '.a', '.about', '.adds', '.after', '.again', '.agricultural', '.alabama', '.alex', '.all', '.almost', '.also', '.alternative', '.amendment.', '.amounts', '.an', '.and', '.another', '.are', '.as', '.asking', '.at', '.average', '.back', '.bad', '.based', '.be', '.became', '.because', '.best', '.better', '.between']


In [8]:
tweets = [re.sub(r"https://.*|“|”|@", "", doc) for doc in tweets]

In [9]:
tweets = [re.sub(r"[\)\(\.\,;:!?\+\-\_\#\'\*\§\$\%\&]", "", doc) for doc in tweets]

In [10]:
tweets = [nltk.word_tokenize(doc) for doc in tweets]

In [11]:
print(sorted(set([word for tweet in tweets for word in tweet]))[:100])

["''", '0', '03', '09', '1', '1/1024th', '1/2', '10', '100', '1000', '1000/24th', '10000', '100000', '1000000', '1036', '104th', '105', '107', '10th', '11', '11000000', '1112', '1130', '11th', '12', '122', '125th', '12th', '13', '133000', '135', '138', '14', '145', '14th', '15', '150', '1500', '150th', '157005000', '158000000', '15th', '16', '160th', '17', '170', '17000', '170000', '18', '180', '1800', '1874', '18959495168', '18th', '19', '191', '1951', '196000', '1969', '1970s', '1972', '1976', '1977', '1980', '1984', '1990', '1994', '1997', '1998', '19th', '1st', '2', '20', '200', '2000', '20000', '2001', '2002', '2005', '2010', '2011', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2020takebackthehouse', '2021', '2024', '205', '20th', '21', '21st', '22', '223306', '23', '232', '24']


# After all, we can finally start with the modeling part!  
(If you want to have a look at the help page, just execute the following chunk)

In [12]:
help(Doc2Vec)

Help on class Doc2Vec in module gensim.models.doc2vec:

class Doc2Vec(gensim.models.base_any2vec.BaseWordEmbeddingsModel)
 |  Class for training, using and evaluating neural networks described in
 |  `Distributed Representations of Sentences and Documents <http://arxiv.org/abs/1405.4053v2>`_.
 |  
 |  Some important internal attributes are the following:
 |  
 |  Attributes
 |  ----------
 |  wv : :class:`~gensim.models.keyedvectors.Word2VecKeyedVectors`
 |      This object essentially contains the mapping between words and embeddings. After training, it can be used
 |      directly to query those embeddings in various ways. See the module level docstring for examples.
 |  
 |  docvecs : :class:`~gensim.models.keyedvectors.Doc2VecKeyedVectors`
 |      This object contains the paragraph vectors. Remember that the only difference between this model and
 |      :class:`~gensim.models.word2vec.Word2Vec` is that besides the word vectors we also include paragraph embeddings
 |      to captur

### First, we determine the number of CPUs that are available on our machine  
(The more cores are available, the faster we can train our model)

In [13]:
import multiprocessing
cpus = multiprocessing.cpu_count()
print(cpus)

8


### Prepare the data set

In [14]:
tagged_tweets = [TaggedDocument(words = d, tags = ["doc_" + str(i)]) for i, d in enumerate(tweets)]

### Display a "tagged" tweet

In [15]:
tagged_tweets[0]

TaggedDocument(words=['at', 'the', 'request', 'of', 'senthomtillis', 'i', 'have', 'declared', 'a', 'major', 'disaster', 'for', 'the', 'great', 'state', 'of', 'north', 'carolina', 'to', 'help', 'with', 'damages', 'from', 'hurricane', 'dorian', 'assistance', 'now', 'unlocked', 'to', 'recover', 'stronger', 'than', 'ever', 'thom', 'loves', 'nc', 'and', 'so', 'do', 'i'], tags=['doc_0'])

### Optional Task:  
Think about assigning multiple tags to each of the tweets.  
This could be interesting, if we had tweets from different politicians and wanted to learn additional representations for their style of tweeting.

In [16]:
two_tagged_tweets = [TaggedDocument(words = d, tags = ["doc_" + str(i), "donald_trump"]) for i, d in enumerate(tweets)]
two_tagged_tweets[0]

TaggedDocument(words=['at', 'the', 'request', 'of', 'senthomtillis', 'i', 'have', 'declared', 'a', 'major', 'disaster', 'for', 'the', 'great', 'state', 'of', 'north', 'carolina', 'to', 'help', 'with', 'damages', 'from', 'hurricane', 'dorian', 'assistance', 'now', 'unlocked', 'to', 'recover', 'stronger', 'than', 'ever', 'thom', 'loves', 'nc', 'and', 'so', 'do', 'i'], tags=['doc_0', 'donald_trump'])

### Set up the model parameters

In [17]:
d2v_model = Doc2Vec(dm = 1, dm_concat = 0, vector_size = 100, alpha = 0.025, min_alpha = 0.0001, 
                    window = 5, min_count = 5, sample = 0.001, negative = 5, workers = cpus - 1)

### Initialize the model with our twitter data:

In [18]:
d2v_model.build_vocab(documents = tagged_tweets, update = False, progress_per = 10000)

### Train our model:  
(Hint: If you want to compre the runtime of the model for different number of cores or epochs, just put "%timeit" in front of the command  
 in the next chunk. You will then get an evaluation of how long the process takes.)

In [20]:
%time d2v_model.train(documents = tagged_tweets, total_examples = d2v_model.corpus_count, epochs = 20)

Wall time: 4.16 s


### Chose a document and display it as a text

In [30]:
" ".join(tagged_tweets[10].words)

'the washington times ukraine envoy blows ‘ massive hole ’ into democrat accusations republicans at hearing find no trump pressure the ukrainian president also strongly stated that no pressure was put on him case closed'

### Display the most similar tweets to the one you chose

In [47]:
ids_similar = d2v_model.docvecs.most_similar(["doc_10"], topn = 3)
ids_similar

[('doc_3177', 0.8913683295249939),
 ('doc_1821', 0.870488166809082),
 ('doc_2688', 0.8682408332824707)]

In [37]:
[" ".join(tweet.words) for tweet in tagged_tweets if tweet.tags[0] in [id[0] for id in ids_similar]]

['john solomon factual errors and major omissions in the mueller report show that it is totally biased against trump',
 'john solomon as russia collusion fades ukrainian plot to help clinton emerges seanhannity foxnews',
 'former fbi top lawyer james baker just admitted involvement in fisa warrant and further admitted there were irregularities in the way the russia probe was handled they relied heavily on the unverified trump dossier paid for by the dnc amp clinton campaign amp funded through a']

In [50]:
d2v_model.docvecs.similarity("doc_10", "doc_1821")

0.8704882

### Now: Explore your model!

In [29]:
print(d2v_model.wv.most_similar(positive = ["germany"]))
print(d2v_model.wv.most_similar(positive = ["clinton"]))
print(d2v_model.wv.most_similar(positive = ["democrats"]))
print(d2v_model.wv.most_similar(positive = ["mexico"]))
print(d2v_model.wv.most_similar(positive = ["china"]))
print(d2v_model.wv.most_similar(positive = ["mexico", "trade"], negative = ["wall"]))

[('afghanistan', 0.9131811261177063), ('cases', 0.9091619253158569), ('hispanics', 0.8991455435752869), ('decisions', 0.8955312967300415), ('european', 0.8937901258468628), ('parts', 0.8920109272003174), ('ahead', 0.884681224822998), ('lowering', 0.8811957836151123), ('records', 0.8771611452102661), ('technology', 0.8753639459609985)]
[('hillary', 0.9773930311203003), ('crooked', 0.9718740582466125), ('dnc', 0.8628735542297363), ('deleted', 0.8587695360183716), ('comey', 0.8489434719085693), ('james', 0.8456957340240479), ('bob', 0.8333559036254883), ('ig', 0.8320258855819702), ('dirty', 0.8250784277915955), ('13', 0.818506121635437)]
[('dems', 0.9202031493186951), ('radical', 0.8052688837051392), ('left', 0.747321367263794), ('they', 0.7006311416625977), ('end', 0.6940463185310364), ('congresswomen', 0.6931781768798828), ('trying', 0.6682571172714233), ('facts', 0.6605888605117798), ('thinking', 0.6578940749168396), ('hearings', 0.6559553146362305)]
[('large', 0.7701966762542725), ('i

### Explore the possibilities the model by e.g. switching from skip-gram to cbow, using averaging instead of concatenation, chosing a larger embedding size, more negative examples, etc.

### Try using ``gensim.models.phrases`` in order to form bigrams

In [167]:
from gensim.models.phrases import Phrases, Phraser

phrases = Phrases(tweets, min_count=100, threshold=10)
bigram = Phraser(phrases)

In [172]:
sorted(list(bigram.phrasegrams.items()), reverse = True)

[((b'\xe2\x80\x99', b't'), 40.14282388966591),
 ((b'\xe2\x80\x99', b's'), 39.051824673367356),
 ((b'witch', b'hunt'), 24.661771250556296),
 ((b'will', b'be'), 15.202834497541446),
 ((b'united', b'states'), 98.71593416819549),
 ((b'thank', b'you'), 49.87788116523749),
 ((b'our', b'country'), 27.60977332033378),
 ((b'fake', b'news'), 74.46145685997172),
 ((b'don', b'\xe2\x80\x99'), 16.771136693276688)]

In [176]:
tweets = list(bigram[tweets])

In [193]:
w2v_model.wv["clinton"]

array([-2.9408352 , -2.9448547 , -0.0644585 ,  2.5617635 ,  0.11626738,
        0.9916419 , -0.982047  , -0.36746535,  2.282364  ,  4.807505  ,
       -0.57372135, -0.1776367 ,  3.6928544 , -0.07633026, -0.3486168 ,
       -1.1801438 , -2.0244286 ,  3.9215038 ,  5.7360845 ,  1.1472751 ,
       -0.6600361 ,  3.4406743 ,  0.42411965,  2.2777426 ,  4.44871   ,
       -0.51950186, -3.310043  , -0.11698956,  1.4302472 , -1.8335285 ,
       -1.2929436 , -3.6467178 ,  0.5088786 ,  0.97771716, -2.7727382 ,
       -1.151348  , -2.8728178 ,  0.28081998,  0.15078373,  0.28773436,
        2.4083438 ,  1.5199484 ,  0.94393754, -3.9986844 , -0.9181879 ,
        2.4555998 , -0.99208266,  0.16184539, -2.7211847 , -0.95676094,
       -1.9936352 ,  0.18663087,  0.7953555 , -1.7466047 ,  1.345637  ,
        0.13392718, -0.39346752, -1.2971301 ,  1.736669  ,  1.0345834 ,
        3.0065439 , -0.919131  , -0.42990413,  1.9712603 , -1.983691  ,
       -0.59394854, -1.0466216 ,  2.0349941 ,  2.022733  , -0.22