# Day 2 - Exercise 1 - Word2Vec

## Necessary imports

In order to handle the data properly we have to import the data and the modules we need:

In [5]:
# modules
import pandas as pd
import numpy as np
import re
import nltk
from gensim.models import Word2Vec

First of all, you need to download the data set "trump.csv" from the GitHub repository https://github.com/assenmacher-mat/nlp_notebooks.

__If you are running this notebook on colab ( https://colab.research.google.com/ ), you also need to run the next chunk in order to upload the data to colab.  
Choose it in the upload window and in it will be available on colab from now on.__  
(If you are running this notebook locally on your machine, you can skip the execution of this chunk)

In [124]:
from google.colab import files
uploaded = files.upload()

### Import the data set

__If you are running this notebook locally on your machine, you might need to adjust the path (depending on where you've saved the data).__  
(If you are running this notebook on colab, you can can leave the path unchanged)

In [6]:
tweet_data = pd.read_csv("trump.csv")

### Next, just have a look at the data set in order to see what's inside

In [7]:
tweet_data.loc[:3,:]

Unnamed: 0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str
0,Twitter for iPhone,At the request of @SenThomTillis I have declar...,10-04-2019 21:59:44,8562,36356,False,1180241114403610626
1,Twitter for iPhone,Under my Administration Medicare Advantage pre...,10-04-2019 21:57:17,15248,54729,False,1180240498478534658
2,Twitter for iPhone,WOW this is big stuff! https://t.co/H12yxMfua3,10-04-2019 19:46:59,15655,50526,False,1180207709985165313
3,Twitter for iPhone,“I think it’s outrages that a Whistleblower is...,10-04-2019 14:12:23,19441,73966,False,1180123504924151809


### Extract the tweets to a list of texts:

In [8]:
tweets_raw = [tweet for tweet in list(tweet_data.text)]

### Display one exemplary tweet:

In [9]:
print(tweets_raw[1])

Under my Administration Medicare Advantage premiums next year will be their lowest in the last 13 years. We are providing GREAT healthcare to our Seniors. We cannot let the radical socialists take that away through Medicare for All!


### Perform the (most) basic preprocessing steps before we continue:
    - everything to lowercase
    - delete url adresses
    - delete other unwanted tokens
    - tokenizing
Depite thorough preprocessing is _essential_ for a good model, we won't spend much time on it here,  
as the main focus in this tutorial is on the modeling part.

In [10]:
tweets = [doc.lower() for doc in tweets_raw]                                          # everything to lowercase

### Print a list of unique tokens that are at the moment present in our corpus
### Based on this, we can identify which tokens occur the we potentially want to exclude

In [11]:
print(sorted(set(nltk.word_tokenize(" ".join(tweets))))[:100])

['!', '#', '$', '%', '&', "'", "''", "'case", "'collusion", "'could", "'crisis", "'forgotten", "'god", "'right", "'s", "'spying", "'ve", '(', ')', '+', '-', '--', '.', '..', '...', '..again', '..all', '..also', '..amounts', '..are', '..between', '..breaking', '..but', '..call', '..came', '..chairman', '..comcast', '..congresswomen', '..deferral', '..despite', '..if', '..mexico', '..much', '..my', '..news', '..nice', '..not', '..now', '..on', '..other', '..saying', '..shouting', '..sorry', '..spread', '..thank', '..that', '..the', '..there', '..this', '..to', '..tv', '..united', '..was', '..we', '..who', '..why', '..willing', '..years', '.33000', '.a', '.about', '.adds', '.after', '.again', '.agricultural', '.alabama', '.alex', '.all', '.almost', '.also', '.alternative', '.amendment.', '.amounts', '.an', '.and', '.another', '.are', '.as', '.asking', '.at', '.average', '.back', '.bad', '.based', '.be', '.became', '.because', '.best', '.better', '.between']


In [12]:
tweets = [re.sub(r"https://.*|“|”|@", "", doc) for doc in tweets]                      # url adresses
tweets = [re.sub(r"[\)\(\.\,;:!?\+\-\_\#\'\*\§\$\%\&]", "", doc) for doc in tweets]    # other unwanted tokens
tweets = [nltk.word_tokenize(doc) for doc in tweets]                                   # tokenization

# After all, we can finally start with the modeling part!  
(If you want to have a look at the help page, just execute the following chunk)

In [71]:
help(Word2Vec)

### First, we determine the number of CPUs that are available on our machine  
(The more cores are available, the faster we can train our model)

In [17]:
import multiprocessing
cpus = multiprocessing.cpu_count()
print(cpus)

8


### Set up the model parameters

In [18]:
w2v_model = Word2Vec(sg = 0, cbow_mean = 1, size = 100, alpha = 0.025, min_alpha = 0.0001, 
                     window = 5, min_count = 5, sample = 0.001, negative = 5, workers = cpus - 1)

### Initialize the model with our twitter data:

In [19]:
w2v_model.build_vocab(sentences = tweets, update = False, progress_per = 10000)

### Train our model:  
(Hint: If you want to compare the runtime of the model for different number of cores or epochs, just put "%timeit" in front of the command  
 in the next chunk. You will then get an evaluation of how long the process takes.)

In [20]:
w2v_model.train(sentences = tweets, total_examples = w2v_model.corpus_count, epochs = 100)

(7089354, 10534100)

### Now: Explore your model!

In [23]:
print(w2v_model.wv.most_similar(positive = ["germany"]))
print(w2v_model.wv.most_similar(positive = ["clinton"]))
print(w2v_model.wv.most_similar(positive = ["democrats"]))
print(w2v_model.wv.most_similar(positive = ["mexico"]))
print(w2v_model.wv.most_similar(positive = ["china"]))
print(w2v_model.wv.most_similar(positive = ["mexico", "trade"], negative = ["wall"]))

[('biarritz', 0.5301661491394043), ('g7', 0.46447229385375977), ('paying', 0.43259501457214355), ('200', 0.4279639720916748), ('france', 0.4116533696651459), ('countries', 0.40288442373275757), ('interest', 0.39133787155151367), ('warren', 0.37755119800567627), ('suggested', 0.372894823551178), ('eu', 0.37280911207199097)]
[('hillary', 0.6644464731216431), ('crooked', 0.6265661716461182), ('dnc', 0.5168795585632324), ('foundation', 0.4984697103500366), ('deleted', 0.4910370111465454), ('antitrump', 0.4390960931777954), ('33000', 0.43572330474853516), ('russians', 0.4113352298736572), ('campaign', 0.3964681625366211), ('acid', 0.3951607644557953)]
[('dems', 0.7258269786834717), ('democrat', 0.4641879200935364), ('they', 0.4560657739639282), ('we', 0.3688978850841522), ('losers', 0.3611629009246826), ('people', 0.325318306684494), ('others', 0.3235127925872803), ('fix', 0.3084690570831299), ('migrants', 0.3068027198314667), ('dnc', 0.2996412515640259)]
[('illegals', 0.4281608760356903), 

### Explore the possibilities the model by e.g. switching from skip-gram to cbow, using averaging instead of concatenation, chosing a larger embedding size, more negative examples, etc.

### Try using ``gensim.models.phrases`` in order to form bigrams

In [34]:
from gensim.models.phrases import Phrases, Phraser

phrases = Phrases(tweets, min_count=20, threshold=10)
bigram = Phraser(phrases)

In [35]:
sorted(list(bigram.phrasegrams.items()), reverse = True)

[((b'\xe2\x80\x99', b've'), 37.407930594794614),
 ((b'\xe2\x80\x99', b't'), 50.150730122546605),
 ((b'\xe2\x80\x99', b's'), 49.4195657370932),
 ((b'\xe2\x80\x99', b're'), 40.11865020311306),
 ((b'\xe2\x80\x99', b'm'), 30.050559826469744),
 ((b'\xe2\x80\x99', b'll'), 24.282340912410536),
 ((b'york', b'times'), 395.1158645276292),
 ((b'years', b'ago'), 42.35973092799266),
 ((b'would', b'be'), 18.73117701107837),
 ((b'working', b'hard'), 67.47640791476408),
 ((b'won', b'\xe2\x80\x99'), 16.404070305272896),
 ((b'witch', b'hunt'), 468.57365376056964),
 ((b'will', b'soon'), 18.06519967400163),
 ((b'will', b'never'), 10.333906593178899),
 ((b'will', b'be'), 22.72071968863337),
 ((b'white', b'house'), 405.0390956994731),
 ((b'we', b'need'), 13.451875227576162),
 ((b'we', b'are'), 12.779281466197354),
 ((b'wasn', b'\xe2\x80\x99'), 17.087573234659263),
 ((b'washington', b'post'), 498.48575712143924),
 ((b'very', b'well'), 16.03628892232897),
 ((b'very', b'important'), 24.038235111854743),
 ((b'v

In [36]:
bigram_tweets = list(bigram[tweets])

### Retrain your model based on the new corpus containing bigrams

In [64]:
bi_model = Word2Vec(sg = 0, cbow_mean = 1, size = 100, alpha = 0.025, min_alpha = 0.0001, 
                     window = 5, min_count = 5, sample = 0.001, negative = 5, workers = cpus - 1)
bi_model.build_vocab(sentences = bigram_tweets, update = False, progress_per = 10000)
bi_model.train(sentences = bigram_tweets, total_examples = bi_model.corpus_count, epochs = 100)

(6668007, 9777800)

### Select one of the bigrams and compute the cosine similarity with the sum of the corresponding vectors from the unigram model  
(e.g. $similarity[v_{united\_states}, (v_{united} + v_{states})]$)  

Should we expect a high or a low similarity?

In [65]:
import math

unigrams = w2v_model.wv["united"] + w2v_model.wv["states"]
bigram = bi_model.wv["united_states"]

sum(bigram * unigrams) / (math.sqrt(sum(bigram**2)) * math.sqrt(sum(unigrams**2)))

0.18019650461807343

### Explore the embeddings for the bigrams

In [67]:
print(bi_model.wv.most_similar(positive = ["united_states"]))
print(bi_model.wv.most_similar(positive = ["mueller_report"]))
print(bi_model.wv.most_similar(positive = ["north_carolina"]))

[('us', 0.535220742225647), ('importance', 0.37955600023269653), ('increase', 0.37705570459365845), ('usa', 0.3671112656593323), ('venezuelan', 0.35408592224121094), ('venezuela', 0.35022133588790894), ('chinese', 0.32250237464904785), ('republican_party', 0.32117512822151184), ('mexico', 0.31740838289260864), ('billions_of', 0.31637147068977356)]
[('unhappy', 0.5052096247673035), ('no_collusion', 0.4682365953922272), ('mueller', 0.4426061809062958), ('no_obstruction', 0.4348973333835602), ('fact_that', 0.43326354026794434), ('obstruction', 0.43227651715278625), ('dossier', 0.42278459668159485), ('russian', 0.4212345480918884), ('robert_mueller', 0.409782350063324), ('ig', 0.396261602640152)]
[('dan_bishop', 0.6345086097717285), ('nc03', 0.6164795160293579), ('murphy', 0.6152015924453735), ('greg', 0.5877571105957031), ('pennsylvania', 0.54262775182724), ('wisconsin', 0.5342702865600586), ('rally', 0.49785950779914856), ('tuesday', 0.49734747409820557), ('georgia', 0.49626803398132324)

### Optional task: Run the ``Phraser`` again, but this time on the corpus which already contains the bigrams.
### This allows the model to build meaningful trigrams, like e.g. _"new_\__york_\__times"_

In [69]:
phrases_tri = Phrases(bigram_tweets, min_count=20, threshold=10)
trigram = Phraser(phrases_tri)

sorted(list(trigram.phrasegrams.items()), reverse = True)

In [70]:
trigram_tweets = list(trigram[tweets])