# Day 2 - Exercise 3 - FastText

## Necessary imports

In order to handle the data properly we have to import the data and the modules we need:

In [1]:
# modules
import pandas as pd
import numpy as np
import re
import nltk
from gensim.models.fasttext import FastText

First of all, you need to download the data set "tweets.csv" from the GitHub repository https://github.com/assenmacher-mat/nlp_notebooks.

__If you are running this notebook on colab ( https://colab.research.google.com/ ), you also need to run the next chunk in order to upload the data to colab.  
Choose it in the upload window and in it will be available on colab from now on.__  
(If you are running this notebook locally on your machine, you can skip the execution of this chunk)

In [124]:
from google.colab import files
uploaded = files.upload()

### Import the data set

__If you are running this notebook locally on your machine, you might need to adjust the path (depending on where you've saved the data).__  
(If you are running this notebook on colab, you can can leave the path unchanged)

In [2]:
tweet_data = pd.read_csv("trump.csv")

### Next, just have a look at the data set in order to see what's inside

In [3]:
tweet_data.loc[:3,:]

Unnamed: 0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str
0,Twitter for iPhone,At the request of @SenThomTillis I have declar...,10-04-2019 21:59:44,8562,36356,False,1180241114403610626
1,Twitter for iPhone,Under my Administration Medicare Advantage pre...,10-04-2019 21:57:17,15248,54729,False,1180240498478534658
2,Twitter for iPhone,WOW this is big stuff! https://t.co/H12yxMfua3,10-04-2019 19:46:59,15655,50526,False,1180207709985165313
3,Twitter for iPhone,“I think it’s outrages that a Whistleblower is...,10-04-2019 14:12:23,19441,73966,False,1180123504924151809


### Extract the tweets to a list of texts:

In [4]:
tweets_raw = [tweet for tweet in list(tweet_data.text)]

### Display one exemplary tweet:

In [5]:
print(tweets_raw[1])

Under my Administration Medicare Advantage premiums next year will be their lowest in the last 13 years. We are providing GREAT healthcare to our Seniors. We cannot let the radical socialists take that away through Medicare for All!


### Perform the basic preprocessing steps before we continue:
    - everything to lowercase
    - expand contractions
    - delete url adresses
    - delete other unwanted tokens
    - tokenizing

In [6]:
tweets = [doc.lower() for doc in tweets_raw]

In [7]:
conts = [(r"don't", "do not"), (r"isn't", "is not")]
def expand(text, contractions = conts):
    for c in contractions:
        t = re.sub(c[0], c[1], text)
    return t

In [8]:
print(tweets[33])
print(expand(tweets[33]))

leader mccarthy we look forward to you soon becoming speaker of the house. the do nothing dems don’t have a chance! https://t.co/uwpdgjg99f
leader mccarthy we look forward to you soon becoming speaker of the house. the do nothing dems don’t have a chance! https://t.co/uwpdgjg99f


### Print a list of unique tokens that are at the moment present in our corpus
### Based on this, we can identify which tokens occur the we potentially want to exclude

In [9]:
print(sorted(set(nltk.word_tokenize(" ".join(tweets))))[:100])

['!', '#', '$', '%', '&', "'", "''", "'case", "'collusion", "'could", "'crisis", "'forgotten", "'god", "'right", "'s", "'spying", "'ve", '(', ')', '+', '-', '--', '.', '..', '...', '..again', '..all', '..also', '..amounts', '..are', '..between', '..breaking', '..but', '..call', '..came', '..chairman', '..comcast', '..congresswomen', '..deferral', '..despite', '..if', '..mexico', '..much', '..my', '..news', '..nice', '..not', '..now', '..on', '..other', '..saying', '..shouting', '..sorry', '..spread', '..thank', '..that', '..the', '..there', '..this', '..to', '..tv', '..united', '..was', '..we', '..who', '..why', '..willing', '..years', '.33000', '.a', '.about', '.adds', '.after', '.again', '.agricultural', '.alabama', '.alex', '.all', '.almost', '.also', '.alternative', '.amendment.', '.amounts', '.an', '.and', '.another', '.are', '.as', '.asking', '.at', '.average', '.back', '.bad', '.based', '.be', '.became', '.because', '.best', '.better', '.between']


In [10]:
tweets = [re.sub(r"https://.*|“|”|@", "", doc) for doc in tweets]

In [11]:
tweets = [re.sub(r"[\)\(\.\,;:!?\+\-\_\#\'\*\§\$\%\&]", "", doc) for doc in tweets]

In [12]:
tweets = [nltk.word_tokenize(doc) for doc in tweets]

In [13]:
print(sorted(set([word for tweet in tweets for word in tweet]))[:100])

["''", '0', '03', '09', '1', '1/1024th', '1/2', '10', '100', '1000', '1000/24th', '10000', '100000', '1000000', '1036', '104th', '105', '107', '10th', '11', '11000000', '1112', '1130', '11th', '12', '122', '125th', '12th', '13', '133000', '135', '138', '14', '145', '14th', '15', '150', '1500', '150th', '157005000', '158000000', '15th', '16', '160th', '17', '170', '17000', '170000', '18', '180', '1800', '1874', '18959495168', '18th', '19', '191', '1951', '196000', '1969', '1970s', '1972', '1976', '1977', '1980', '1984', '1990', '1994', '1997', '1998', '19th', '1st', '2', '20', '200', '2000', '20000', '2001', '2002', '2005', '2010', '2011', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2020takebackthehouse', '2021', '2024', '205', '20th', '21', '21st', '22', '223306', '23', '232', '24']


# After all, we can finally start with the modeling part!  
(If you want to have a look at the help page, just execute the following chunk)

In [14]:
help(FastText)

Help on class FastText in module gensim.models.fasttext:

class FastText(gensim.models.base_any2vec.BaseWordEmbeddingsModel)
 |  Train, use and evaluate word representations learned using the method
 |  described in `Enriching Word Vectors with Subword Information <https://arxiv.org/abs/1607.04606>`_, aka FastText.
 |  
 |  The model can be stored/loaded via its :meth:`~gensim.models.fasttext.FastText.save` and
 |  :meth:`~gensim.models.fasttext.FastText.load` methods, or loaded from a format compatible with the original
 |  Fasttext implementation via :meth:`~gensim.models.fasttext.FastText.load_fasttext_format`.
 |  
 |  Attributes
 |  ----------
 |  wv : :class:`~gensim.models.keyedvectors.FastTextKeyedVectors`
 |      This object essentially contains the mapping between words and embeddings. These are similar to the embeddings
 |      computed in the :class:`~gensim.models.word2vec.Word2Vec`, however here we also include vectors for n-grams.
 |      This allows the model to compute

### First, we determine the number of CPUs that are available on our machine  
(The more cores are available, the faster we can train our model)

In [15]:
import multiprocessing
cpus = multiprocessing.cpu_count()
print(cpus)

8


### Set up the model parameters

In [16]:
ft_model = FastText(sg = 0, cbow_mean = 1, size = 100, alpha = 0.025, min_alpha = 0.0001, min_n = 3, max_n = 5,
                    window = 5, min_count = 5, sample = 0.001, negative = 5, workers = cpus - 1)

### Initialize the model with our twitter data:

In [18]:
ft_model.build_vocab(sentences = tweets, update = False, progress_per = 10000)

### Train our model:  
(Hint: If you want to compre the runtime of the model for different number of cores or epochs, just put "%timeit" in front of the command  
 in the next chunk. You will then get an evaluation of how long the process takes.)

In [19]:
ft_model.train(sentences = tweets, total_examples = ft_model.corpus_count, epochs = 100)

In [20]:
"example" in ft_model.wv.vocab

False

In [21]:
ft_model.wv["example"]

array([ 0.12539083, -1.2392306 ,  0.70108515,  0.46797296,  0.8685001 ,
       -0.95858866,  0.33056358,  1.0688312 ,  0.1591232 , -0.87566835,
       -0.8907124 , -0.9613014 ,  0.01457906,  0.18087572,  0.13759342,
        0.04009672,  0.59700274, -0.21330199, -0.82340777, -0.2022132 ,
       -0.9784647 ,  0.3887765 , -0.04538687,  0.21417032, -0.52254695,
        0.1549123 , -0.5647041 ,  0.801213  , -0.3639199 , -0.97105604,
        0.0258203 ,  1.505556  , -0.07820963, -0.543697  , -0.04845   ,
       -0.42218333, -0.9409691 ,  0.16737086, -0.8970547 , -0.6361535 ,
       -0.86656135,  0.22954378,  0.02725276, -1.3792439 ,  1.2192922 ,
       -0.28787604,  0.9299774 ,  0.14909214, -0.7312445 ,  1.2596277 ,
       -1.0304984 , -0.26936567, -1.1585648 , -0.65198857,  0.0481079 ,
       -0.08059145, -0.38537207, -0.6508886 ,  0.27607346, -0.14374235,
       -1.2735661 , -1.2851834 ,  0.20870413, -0.6537805 ,  0.24656989,
       -0.38214043, -1.047298  , -0.1520072 , -1.1581744 ,  0.24

In [22]:
ft_model.wv.most_similar(positive = ["example"])

[('simple', 0.6218103170394897),
 ('completely', 0.41479939222335815),
 ('complete', 0.37091994285583496),
 ('exactly', 0.37088626623153687),
 ('exact', 0.3245594799518585),
 ('completed', 0.3221316933631897),
 ('apple', 0.3216664791107178),
 ('nice', 0.3191584348678589),
 ('texas', 0.31394606828689575),
 ('simply', 0.2968178391456604)]

### Now: Explore your model!

In [23]:
print(ft_model.wv.most_similar(positive = ["germany"]))
print(ft_model.wv.most_similar(positive = ["clinton"]))
print(ft_model.wv.most_similar(positive = ["democrats"]))
print(ft_model.wv.most_similar(positive = ["mexico"]))
print(ft_model.wv.most_similar(positive = ["china"]))
print(ft_model.wv.most_similar(positive = ["mexico", "trade"], negative = ["wall"]))

[('many', 0.7087990045547485), ('countries', 0.4370884299278259), ('interest', 0.4294402599334717), ('currency', 0.4294150769710541), ('any', 0.42037394642829895), ('fentanyl', 0.41558194160461426), ('interesting', 0.40786013007164), ('company', 0.4027581810951233), ('european', 0.39604395627975464), ('6', 0.3857548236846924)]
[('crooked', 0.6381134986877441), ('hilton', 0.5869311690330505), ('hillary', 0.5161709189414978), ('dnc', 0.4843655526638031), ('33000', 0.46502190828323364), ('acid', 0.4564186930656433), ('classified', 0.4558030962944031), ('campaign', 0.4372970461845398), ('deleted', 0.4247213900089264), ('climate', 0.4107172191143036)]
[('democracy', 0.8606078624725342), ('democrat', 0.8518213033676147), ('democratic', 0.8404862284660339), ('dems', 0.7639465928077698), ('they', 0.4987036883831024), ('demean', 0.4773533046245575), ('demand', 0.4628133773803711), ('debt', 0.44863998889923096), ('committees', 0.3985235095024109), ('means', 0.3982166647911072)]
[('rico', 0.44321

### Explore the possibilities the model by e.g. switching from skip-gram to cbow, using averaging instead of concatenation, chosing a larger embedding size, more negative examples, etc.

### Try using ``gensim.models.phrases`` in order to form bigrams

In [167]:
from gensim.models.phrases import Phrases, Phraser

phrases = Phrases(tweets, min_count=100, threshold=10)
bigram = Phraser(phrases)

In [172]:
sorted(list(bigram.phrasegrams.items()), reverse = True)

[((b'\xe2\x80\x99', b't'), 40.14282388966591),
 ((b'\xe2\x80\x99', b's'), 39.051824673367356),
 ((b'witch', b'hunt'), 24.661771250556296),
 ((b'will', b'be'), 15.202834497541446),
 ((b'united', b'states'), 98.71593416819549),
 ((b'thank', b'you'), 49.87788116523749),
 ((b'our', b'country'), 27.60977332033378),
 ((b'fake', b'news'), 74.46145685997172),
 ((b'don', b'\xe2\x80\x99'), 16.771136693276688)]

In [176]:
tweets = list(bigram[tweets])

In [193]:
w2v_model.wv["clinton"]

array([-2.9408352 , -2.9448547 , -0.0644585 ,  2.5617635 ,  0.11626738,
        0.9916419 , -0.982047  , -0.36746535,  2.282364  ,  4.807505  ,
       -0.57372135, -0.1776367 ,  3.6928544 , -0.07633026, -0.3486168 ,
       -1.1801438 , -2.0244286 ,  3.9215038 ,  5.7360845 ,  1.1472751 ,
       -0.6600361 ,  3.4406743 ,  0.42411965,  2.2777426 ,  4.44871   ,
       -0.51950186, -3.310043  , -0.11698956,  1.4302472 , -1.8335285 ,
       -1.2929436 , -3.6467178 ,  0.5088786 ,  0.97771716, -2.7727382 ,
       -1.151348  , -2.8728178 ,  0.28081998,  0.15078373,  0.28773436,
        2.4083438 ,  1.5199484 ,  0.94393754, -3.9986844 , -0.9181879 ,
        2.4555998 , -0.99208266,  0.16184539, -2.7211847 , -0.95676094,
       -1.9936352 ,  0.18663087,  0.7953555 , -1.7466047 ,  1.345637  ,
        0.13392718, -0.39346752, -1.2971301 ,  1.736669  ,  1.0345834 ,
        3.0065439 , -0.919131  , -0.42990413,  1.9712603 , -1.983691  ,
       -0.59394854, -1.0466216 ,  2.0349941 ,  2.022733  , -0.22