# Canada's Response to Economic Impacts of COVID-19
  
Amanda Cheney  
Metis Final Project  
Part 2 of 4  
December 8, 2020    

**Objective**  
Unsupervised learning and natural language processing to identify two sets of clusters of Twitter conversations about the Canadian Emergency Response Benefit (CERB) and Canadian Recovery Benefit (CRB) programs to address unemployment and economic impacts of the COVID-19 pandemic. 

1. One that captures the contours of everyday user conversations. 

2.  Another that highlights clusters of conversation that are really, really dense, have users make collaborative efforts to shape public opinion and perception.  

**Data Sources**  
250,000+ tweets from March 1 - December 1, 2020, collected using snscrape.  

**This Notebook**  
Clean and preprocess tweets using SpaCy. Create document embeddings using FastText.

## Imports

In [1]:
import pandas as pd
import pickle
import numpy as np
import re

#FastText
import fasttext as ft

#VADER 
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Avoid warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

## Import data

In [3]:
with open('cerb_df.pickle', 'rb') as read_file:
    df = pickle.load(read_file)

In [4]:
df1 = df.copy()

In [5]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 240808 entries, 0 to 255326
Data columns (total 24 columns):
 #   Column           Non-Null Count   Dtype              
---  ------           --------------   -----              
 0   url              240808 non-null  object             
 1   date             240808 non-null  datetime64[ns, UTC]
 2   content          240808 non-null  object             
 3   renderedContent  240808 non-null  object             
 4   id               240808 non-null  int64              
 5   user             240808 non-null  object             
 6   outlinks         240808 non-null  object             
 7   tcooutlinks      240808 non-null  object             
 8   replyCount       240808 non-null  int64              
 9   retweetCount     240808 non-null  int64              
 10  likeCount        240808 non-null  int64              
 11  quoteCount       240808 non-null  int64              
 12  conversationId   240808 non-null  int64              
 13 

## Implement preprocessing to return clean data 

Note that my preprocessing is fairly limited because the embeddings library I later use to create my word vectors contains #hashtag content, punctuation and stop words, so it was not necessary to do much more than remove numbers and extraneous symbols and @usernames.

In [4]:
from preprocessing_functions import remove_usernames, make_alphabetic, extract_hashtags, make_vader_score

In [7]:
df1['no_user'] = [remove_usernames(post) for post in df1['content']]

In [95]:
df1[['clean', 'clean_split']] = [make_alphabetic(post) for post in df1['no_user']]

In [96]:
df1.head()

Unnamed: 0,url,date,content,renderedContent,id,user,outlinks,tcooutlinks,replyCount,retweetCount,...,quotedTweet,mentionedUsers,user_name,simple_date,location,no_user,clean,hashtags,spacy_doc,clean_split
0,https://twitter.com/SandraLynnColl3/status/133...,2020-12-01 00:12:39+00:00,"@mini_bubbly In our Entire extended family, on...","@mini_bubbly In our Entire extended family, on...",1333564633320480769,"{'username': 'SandraLynnColl3', 'displayname':...",[],[],0,0,...,,"[{'username': 'mini_bubbly', 'displayname': '🇨...",SandraLynnColl3,2020-12-01,Canada,"In our Entire extended family, only one Niece ...",In our Entire extended family only one Niece l...,[],"(In, our, Entire, extended, family, only, one,...","[In, our, Entire, extended, family, only, one,..."
2,https://twitter.com/bunmzi/status/133356432966...,2020-12-01 00:11:27+00:00,@MrStache9 Many Canadians dont realize this. L...,@MrStache9 Many Canadians dont realize this. L...,1333564329665441793,"{'username': 'bunmzi', 'displayname': 'Mikel A...",[],[],0,1,...,,"[{'username': 'MrStache9', 'displayname': 'Mr ...",bunmzi,2020-12-01,,Many Canadians dont realize this. Look at the ...,Many Canadians dont realize this Look at the s...,[],"(Many, Canadians, do, nt, realize, this, Look,...","[Many, Canadians, dont, realize, this, Look, a..."
3,https://twitter.com/D313131Daniel/status/13335...,2020-12-01 00:08:32+00:00,@nationalpost @TheGrowthOp So that won’t cost ...,@nationalpost @TheGrowthOp So that won’t cost ...,1333563595607699464,"{'username': 'D313131Daniel', 'displayname': '...",[],[],0,0,...,,"[{'username': 'nationalpost', 'displayname': '...",D313131Daniel,2020-12-01,"Kingston, Ontario",So that won’t cost me taxes for cerb payments ...,So that wont cost me taxes for cerb payments f...,[],"(So, that, wo, nt, cost, me, taxes, for, cerb,...","[So, that, wont, cost, me, taxes, for, cerb, p..."
4,https://twitter.com/Paulbyjove1/status/1333563...,2020-12-01 00:07:28+00:00,@exposforever @MJosling53 @erinotoole @PierreP...,@exposforever @MJosling53 @erinotoole @PierreP...,1333563328040464385,"{'username': 'Paulbyjove1', 'displayname': 'Pa...",[],[],0,0,...,,"[{'username': 'exposforever', 'displayname': '...",Paulbyjove1,2020-12-01,,False. The NUMBER 1 ROLE OF GOVT is to PROTECT...,False The NUMBER ROLE OF GOVT is to PROTECT TH...,[],"(False, The, NUMBER, ROLE, OF, GOVT, is, to, P...","[False, The, NUMBER, ROLE, OF, GOVT, is, to, P..."
5,https://twitter.com/GavinBamber/status/1333563...,2020-12-01 00:07:00+00:00,@journo_dale Something like over $400 million ...,@journo_dale Something like over $400 million ...,1333563211216424961,"{'username': 'GavinBamber', 'displayname': 'Ga...",[],[],0,0,...,,"[{'username': 'journo_dale', 'displayname': 'D...",GavinBamber,2020-12-01,North Vancouver,Something like over $400 million was mistakenl...,Something like over million was mistakenly sen...,[],"(Something, like, over, million, was, mistaken...","[Something, like, over, million, was, mistaken..."


In [89]:
df1.iloc[22222].content

'@88_leanna @Sheliaelaine2 @thecooperpiercy @KCTenants This says Canada’s real unemployment dropped to 9% in September\nhttps://t.co/EuT9CzOWGV\n\nAnd this says the US real unemployment is at 12.8% in September \n\nhttps://t.co/WXkxKrBVX4'

In [97]:
df1.iloc[22222].clean

'This says Canadas real unemployment dropped to in September And this says the US real unemployment is at in September '

The embeddings library that I will use to create my word -> document embeddings can handle hashtag content, so I have not removed them from the substance of the tweet texts, but do want to make a note of them for future EDA.

In [14]:
df1['hashtags'] = [extract_hashtags(post) for post in df1['no_user']]

In [15]:
df1.iloc[55555:55560]

Unnamed: 0,url,date,content,renderedContent,id,user,outlinks,tcooutlinks,replyCount,retweetCount,...,media,retweetedTweet,quotedTweet,mentionedUsers,user_name,simple_date,location,no_user,clean,hashtags
59429,https://twitter.com/monkik0u/status/1301533824...,2020-09-03 14:53:39+00:00,Honestly about #cerb i hadnt rly gotten anythi...,Honestly about #cerb i hadnt rly gotten anythi...,1301533824547827713,"{'username': 'monkik0u', 'displayname': 'kiki'...",[],[],1,0,...,,,,,monkik0u,2020-09-03,🇨🇦,Honestly about #cerb i hadnt rly gotten anythi...,Honestly about cerb i hadnt rly gotten anythin...,"[cerb, cra]"
59430,https://twitter.com/quwiyu_rasaq/status/130153...,2020-09-03 14:52:52+00:00,Click on this and no how to make money https:/...,Click on this and no how to make money fiverr....,1301533628422389760,"{'username': 'quwiyu_rasaq', 'displayname': 'R...",[https://www.fiverr.com/s2/19bd53591f?utm_sour...,[https://t.co/P9uhOExxAt],0,0,...,,,,,quwiyu_rasaq,2020-09-03,,Click on this and no how to make money https:/...,Click on this and no how to make money BBNaiji...,"[BBNaijia2020, BBNaija, bbnajia2020, BBNaijia,..."
59431,https://twitter.com/TothJC/status/130153348261...,2020-09-03 14:52:18+00:00,@Travisdhanraj Yup no CERB yet!!!,@Travisdhanraj Yup no CERB yet!!!,1301533482619998208,"{'username': 'TothJC', 'displayname': 'Jenn', ...",[],[],0,0,...,,,,"[{'username': 'Travisdhanraj', 'displayname': ...",TothJC,2020-09-03,"Mississauga, Ontario",Yup no CERB yet!!!,Yup no CERB yet,[]
59432,https://twitter.com/ali_alitajdin/status/13015...,2020-09-03 14:52:04+00:00,***Capabilities***\nHave you identified your s...,***Capabilities***\nHave you identified your s...,1301533425116090374,"{'username': 'ali_alitajdin', 'displayname': '...",[http://Www.aptm.ca],[https://t.co/iETd4r81EW],0,0,...,[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,,ali_alitajdin,2020-09-03,"Montréal, Québec",***Capabilities*** Have you identified your st...,Capabilities Have you identified your strength...,"[CERB, thursdaymorning, EnoughIsEnough, busine..."
59433,https://twitter.com/Peter93443314/status/13015...,2020-09-03 14:52:01+00:00,@My_opinions74 @shaggysmith @cmaconthehill @Ca...,@My_opinions74 @shaggysmith @cmaconthehill @Ca...,1301533414378614786,"{'username': 'Peter93443314', 'displayname': '...",[],[],0,0,...,,,,"[{'username': 'My_opinions74', 'displayname': ...",Peter93443314,2020-09-03,,I'm back at work in a restaurant and only doin...,Im back at work in a restaurant and only doing...,[]


## Sentiment Analysis

In [91]:
sid = SentimentIntensityAnalyzer()

In [98]:
df1['list_vader_scores'] = df1['clean'].apply(lambda content: sid.polarity_scores(content))

In [100]:
df1['compound']  = df1['list_vader_scores'].apply(lambda score_dict: score_dict['compound'])

In [128]:
scores = make_vader_score(df1,'compound')

In [129]:
df1['vader_score'] = scores 

In [131]:
df1.iloc[50:55]

Unnamed: 0,url,date,content,renderedContent,id,user,outlinks,tcooutlinks,replyCount,retweetCount,...,simple_date,location,no_user,clean,hashtags,spacy_doc,clean_split,list_vader_scores,compound,vader_score
52,https://twitter.com/CafeLakeSide_/status/13335...,2020-11-30 23:12:14+00:00,@CP24 @JustinTrudeau I should not have said th...,@CP24 @JustinTrudeau I should not have said th...,1333549429111070728,"{'username': 'CafeLakeSide_', 'displayname': '...",[],[],0,0,...,2020-11-30,"Toronto, Ontario",I should not have said that. I am happy for he...,I should not have said that I am happy for her...,[],"(I, should, not, have, said, that, I, am, happ...","[I, should, not, have, said, that, I, am, happ...","{'neg': 0.0, 'neu': 0.835, 'pos': 0.165, 'comp...",0.8658,Positive
53,https://twitter.com/MichelletypoQ/status/13335...,2020-11-30 23:11:33+00:00,@VanasseSimon @samifouad Did you even read the...,@VanasseSimon @samifouad Did you even read the...,1333549255965851648,"{'username': 'MichelletypoQ', 'displayname': '...",[],[],1,0,...,2020-11-30,Canada,Did you even read the article you posted? Ther...,Did you even read the article you posted There...,[],"(Did, you, even, read, the, article, you, post...","[Did, you, even, read, the, article, you, post...","{'neg': 0.081, 'neu': 0.739, 'pos': 0.18, 'com...",0.6705,Positive
54,https://twitter.com/JacquieNlgirl/status/13335...,2020-11-30 23:11:12+00:00,@mini_bubbly I did collect as my industry wash...,@mini_bubbly I did collect as my industry wash...,1333549166992125953,"{'username': 'JacquieNlgirl', 'displayname': '...",[],[],1,1,...,2020-11-30,,I did collect as my industry wash shut down fo...,I did collect as my industry wash shut down fo...,[],"(I, did, collect, as, my, industry, wash, shut...","[I, did, collect, as, my, industry, wash, shut...","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,Neutral
55,https://twitter.com/SandyHasCandy/status/13335...,2020-11-30 23:10:54+00:00,@NEWSTALK1010 Could someone please read the Co...,@NEWSTALK1010 Could someone please read the Co...,1333549093134725120,"{'username': 'SandyHasCandy', 'displayname': '...",[],[],0,0,...,2020-11-30,,Could someone please read the Constitution to ...,Could someone please read the Constitution to ...,[],"(Could, someone, please, read, the, Constituti...","[Could, someone, please, read, the, Constituti...","{'neg': 0.085, 'neu': 0.772, 'pos': 0.143, 'co...",0.296,Positive
56,https://twitter.com/AnneElk80/status/133354898...,2020-11-30 23:10:28+00:00,@CTVNews Congratulations to Minister Freeland ...,@CTVNews Congratulations to Minister Freeland ...,1333548984921698304,"{'username': 'AnneElk80', 'displayname': 'Anne...",[],[],0,0,...,2020-11-30,"Dartford, UK",Congratulations to Minister Freeland for a fir...,Congratulations to Minister Freeland for a fir...,[],"(Congratulations, to, Minister, Freeland, for,...","[Congratulations, to, Minister, Freeland, for,...","{'neg': 0.0, 'neu': 0.862, 'pos': 0.138, 'comp...",0.7506,Positive


In [132]:
with open('df1_pre_spacy.pkl', 'wb') as write_file:
    pickle.dump(df1, write_file)

In [133]:
df1.shape

(240808, 32)

## Tokenize with Spacy

In [220]:
%%time
spacy_docs = list(nlp.pipe(df1['clean']))

CPU times: user 10min 58s, sys: 3min 12s, total: 14min 10s
Wall time: 15min 28s


In [252]:
spacy_docs[0:5]

[In our Entire extended family only one Niece lost her job she was in the Travel Industry She enrolled in classes online and the CERB allowed her to do this,
 Many Canadians dont realize this Look at the squealing when its time to pay taxes on those CERB money next spring,
 So that wont cost me taxes for cerb payments for years,
 False The NUMBER ROLE OF GOVT is to PROTECT THE CITIZENS And per the rest Canada is doing well per AAA credit rating unemployment amp job recovery Know what would help tho If the idiot Con Premiers would DO THEIR JOBS amp not rely on cheesecake slogans amp prayer,
 Something like over million was mistakenly sent out for CERB so yes they overspent on the CERB But you knew that]

In [224]:
len(spacy_docs)

240808

In [237]:
df1['spacy_doc'] = spacy_docs

In [261]:
df1.head()

Unnamed: 0,url,date,content,renderedContent,id,user,outlinks,tcooutlinks,replyCount,retweetCount,...,simple_date,location,no_user,clean,hashtags,clean_split,list_vader_scores,compound,vader_score,spacy_doc
0,https://twitter.com/SandraLynnColl3/status/133...,2020-12-01 00:12:39+00:00,"@mini_bubbly In our Entire extended family, on...","@mini_bubbly In our Entire extended family, on...",1333564633320480769,"{'username': 'SandraLynnColl3', 'displayname':...",[],[],0,0,...,2020-12-01,Canada,"In our Entire extended family, only one Niece ...",In our Entire extended family only one Niece l...,[],"[In, our, Entire, extended, family, only, one,...","{'neg': 0.073, 'neu': 0.927, 'pos': 0.0, 'comp...",-0.3182,Negative,"(In, our, Entire, extended, family, only, one,..."
2,https://twitter.com/bunmzi/status/133356432966...,2020-12-01 00:11:27+00:00,@MrStache9 Many Canadians dont realize this. L...,@MrStache9 Many Canadians dont realize this. L...,1333564329665441793,"{'username': 'bunmzi', 'displayname': 'Mikel A...",[],[],0,1,...,2020-12-01,,Many Canadians dont realize this. Look at the ...,Many Canadians dont realize this Look at the s...,[],"[Many, Canadians, dont, realize, this, Look, a...","{'neg': 0.065, 'neu': 0.935, 'pos': 0.0, 'comp...",-0.1027,Negative,"(Many, Canadians, do, nt, realize, this, Look,..."
3,https://twitter.com/D313131Daniel/status/13335...,2020-12-01 00:08:32+00:00,@nationalpost @TheGrowthOp So that won’t cost ...,@nationalpost @TheGrowthOp So that won’t cost ...,1333563595607699464,"{'username': 'D313131Daniel', 'displayname': '...",[],[],0,0,...,2020-12-01,"Kingston, Ontario",So that won’t cost me taxes for cerb payments ...,So that wont cost me taxes for cerb payments f...,[],"[So, that, wont, cost, me, taxes, for, cerb, p...","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,Neutral,"(So, that, wo, nt, cost, me, taxes, for, cerb,..."
4,https://twitter.com/Paulbyjove1/status/1333563...,2020-12-01 00:07:28+00:00,@exposforever @MJosling53 @erinotoole @PierreP...,@exposforever @MJosling53 @erinotoole @PierreP...,1333563328040464385,"{'username': 'Paulbyjove1', 'displayname': 'Pa...",[],[],0,0,...,2020-12-01,,False. The NUMBER 1 ROLE OF GOVT is to PROTECT...,False The NUMBER ROLE OF GOVT is to PROTECT TH...,[],"[False, The, NUMBER, ROLE, OF, GOVT, is, to, P...","{'neg': 0.102, 'neu': 0.689, 'pos': 0.209, 'co...",0.6774,Positive,"(False, The, NUMBER, ROLE, OF, GOVT, is, to, P..."
5,https://twitter.com/GavinBamber/status/1333563...,2020-12-01 00:07:00+00:00,@journo_dale Something like over $400 million ...,@journo_dale Something like over $400 million ...,1333563211216424961,"{'username': 'GavinBamber', 'displayname': 'Ga...",[],[],0,0,...,2020-12-01,North Vancouver,Something like over $400 million was mistakenl...,Something like over million was mistakenly sen...,[],"[Something, like, over, million, was, mistaken...","{'neg': 0.068, 'neu': 0.763, 'pos': 0.169, 'co...",0.339,Positive,"(Something, like, over, million, was, mistaken..."


Spacy has not actually returned a list for each item in `df1['clean']`, but we need spacy docs to be in list form in order to create the word emeddings vectors. To do that, iterate through `df1['spacy_doc']` to make a list of the tokens for each tweet. 

In [262]:
row_list = []
for index, rows in df1.iterrows():
    my_list = [str(x) for x in rows.spacy_doc]
    row_list.append(my_list)

Add this to the dataframe

In [263]:
df1['for_vecs'] = row_list

In [264]:
df1.head()

Unnamed: 0,url,date,content,renderedContent,id,user,outlinks,tcooutlinks,replyCount,retweetCount,...,location,no_user,clean,hashtags,clean_split,list_vader_scores,compound,vader_score,spacy_doc,for_vecs
0,https://twitter.com/SandraLynnColl3/status/133...,2020-12-01 00:12:39+00:00,"@mini_bubbly In our Entire extended family, on...","@mini_bubbly In our Entire extended family, on...",1333564633320480769,"{'username': 'SandraLynnColl3', 'displayname':...",[],[],0,0,...,Canada,"In our Entire extended family, only one Niece ...",In our Entire extended family only one Niece l...,[],"[In, our, Entire, extended, family, only, one,...","{'neg': 0.073, 'neu': 0.927, 'pos': 0.0, 'comp...",-0.3182,Negative,"(In, our, Entire, extended, family, only, one,...","[In, our, Entire, extended, family, only, one,..."
2,https://twitter.com/bunmzi/status/133356432966...,2020-12-01 00:11:27+00:00,@MrStache9 Many Canadians dont realize this. L...,@MrStache9 Many Canadians dont realize this. L...,1333564329665441793,"{'username': 'bunmzi', 'displayname': 'Mikel A...",[],[],0,1,...,,Many Canadians dont realize this. Look at the ...,Many Canadians dont realize this Look at the s...,[],"[Many, Canadians, dont, realize, this, Look, a...","{'neg': 0.065, 'neu': 0.935, 'pos': 0.0, 'comp...",-0.1027,Negative,"(Many, Canadians, do, nt, realize, this, Look,...","[Many, Canadians, do, nt, realize, this, Look,..."
3,https://twitter.com/D313131Daniel/status/13335...,2020-12-01 00:08:32+00:00,@nationalpost @TheGrowthOp So that won’t cost ...,@nationalpost @TheGrowthOp So that won’t cost ...,1333563595607699464,"{'username': 'D313131Daniel', 'displayname': '...",[],[],0,0,...,"Kingston, Ontario",So that won’t cost me taxes for cerb payments ...,So that wont cost me taxes for cerb payments f...,[],"[So, that, wont, cost, me, taxes, for, cerb, p...","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,Neutral,"(So, that, wo, nt, cost, me, taxes, for, cerb,...","[So, that, wo, nt, cost, me, taxes, for, cerb,..."
4,https://twitter.com/Paulbyjove1/status/1333563...,2020-12-01 00:07:28+00:00,@exposforever @MJosling53 @erinotoole @PierreP...,@exposforever @MJosling53 @erinotoole @PierreP...,1333563328040464385,"{'username': 'Paulbyjove1', 'displayname': 'Pa...",[],[],0,0,...,,False. The NUMBER 1 ROLE OF GOVT is to PROTECT...,False The NUMBER ROLE OF GOVT is to PROTECT TH...,[],"[False, The, NUMBER, ROLE, OF, GOVT, is, to, P...","{'neg': 0.102, 'neu': 0.689, 'pos': 0.209, 'co...",0.6774,Positive,"(False, The, NUMBER, ROLE, OF, GOVT, is, to, P...","[False, The, NUMBER, ROLE, OF, GOVT, is, to, P..."
5,https://twitter.com/GavinBamber/status/1333563...,2020-12-01 00:07:00+00:00,@journo_dale Something like over $400 million ...,@journo_dale Something like over $400 million ...,1333563211216424961,"{'username': 'GavinBamber', 'displayname': 'Ga...",[],[],0,0,...,North Vancouver,Something like over $400 million was mistakenl...,Something like over million was mistakenly sen...,[],"[Something, like, over, million, was, mistaken...","{'neg': 0.068, 'neu': 0.763, 'pos': 0.169, 'co...",0.339,Positive,"(Something, like, over, million, was, mistaken...","[Something, like, over, million, was, mistaken..."


In [265]:
with open('df1_w_vecs.pkl', 'wb') as write_file:
    pickle.dump(df1, write_file)

-----

## Transfer learning to generate embeddings

To generate word embeddings which I then transformed into document embeddings, I used transfer learning from [Fredric Godin's Twitter Embeddings library](https://github.com/FredericGodin/TwitterEmbeddings). Godin offers two sets of embeddings, one deploying Word2Vec and the other FastText. I chose the FastText option because over 25,000 of my tokens are missing from the embeddings library. This  number is far too large to make replacements by hand or simply drop. Therefore, FastText, which can handle vocabulary that is not in the library or that it hasn't seen before was a much better option. 

------

In [2]:
with open('df1_w_vecs.pkl', 'rb') as read_file:
    df1 = pickle.load(read_file)

In [3]:
df1.shape

(240808, 33)

In [5]:
# df1.head()

-------

FastText

In [4]:
model_path = '/Users/AmandaCheney/Downloads/fasttext_twitter_raw.bin'

In [5]:
%%time
ft_model = ft.load_model(model_path)

CPU times: user 26.6 s, sys: 38.3 s, total: 1min 4s
Wall time: 2min 38s




In [8]:
len(ft_model.words)

3818774

Taking a look at random slice of the words in the model reveals the robustness of FastText. It contains terms with hashtags and at signs as well as words with apostrophes and punctuation.  

In [9]:
ft_model.words[1110:1120]

['gentleman',
 'granny',
 'Steam',
 "everybody's",
 'Goodnight!',
 'rank',
 '@AP:',
 'Olive',
 'MAY',
 '#design']

Here's what the model has for Canada's nearest neighbors.

In [10]:
ft_model.get_nearest_neighbors('Canada')

[(0.8826490640640259, 'Canada}'),
 (0.8428902626037598, 'Canada;'),
 (0.8328725099563599, 'Canada.)'),
 (0.818520188331604, 'Canada/US'),
 (0.815723717212677, 'Canada~'),
 (0.8113458752632141, 'Canada-3'),
 (0.810616672039032, 'Canada/USA'),
 (0.8073190450668335, 'Canada-'),
 (0.8052198886871338, 'Canadaland'),
 (0.7982547879219055, 'Canada--')]

Let's see how to model handles a random hashtag I observed in the data.

In [394]:
ft_model.get_nearest_neighbors("#forgetunemployment")

[(0.886472225189209, '#unemployment'),
 (0.8833913207054138, '#funemployment'),
 (0.8503758907318115, '#youthunemployment'),
 (0.7962139248847961, '#Unemployment'),
 (0.7865886688232422, '#selfemployment'),
 (0.7707514762878418, 'unemployment'),
 (0.7688292860984802, '#employment'),
 (0.768085777759552, 'ESLemployment'),
 (0.7653712034225464, '#ExtendUnemployment'),
 (0.7538452744483948, '#employmentlaw')]

We get comporable results with and without the hashtag.

In [393]:
ft_model.get_nearest_neighbors('forgetunemployment')

[(0.8567148447036743, 'unemployment'),
 (0.8388547301292419, '#unemployment'),
 (0.8345044851303101, '#funemployment'),
 (0.8146728873252869, 'ESLemployment'),
 (0.8087956309318542, 'unemployment?'),
 (0.8072128891944885, '#youthunemployment'),
 (0.8067687749862671, 'unemployment,'),
 (0.8031488656997681, 'unemployment.'),
 (0.8021907210350037, 'unemployment"'),
 (0.7957878708839417, 'unemployment!')]

## Generate embeddings

For each tweet document, I start by creating an empty vector of length 400, as this is the size of the pre-trained model. Then for each word/token in the tweet, I use the FastText model to create an embedding, which is then added to the document vector. After creating embeddings for every word in the tweet, I create a document embedding but diving the vector by its norm.

In [11]:
%%time
ft_doc_vectors = []
for doc in df1['for_vecs']:
    vector = np.zeros(400)
    for word in doc:
        try:
            vector = np.add(ft_model[word], vector)
        except KeyError:
            vector = np.add(vector, np.zeros(400))
    vector = vector / np.linalg.norm(vector)
    ft_doc_vectors.append(vector)

CPU times: user 2min 16s, sys: 32.2 s, total: 2min 48s
Wall time: 4min 30s


Making sure that there are not empty document vectors.

In [12]:
filtered_docs = [str(doc.sum()) != 'nan' for doc in ft_doc_vectors]

In [13]:
len(filtered_docs)

240808

In [16]:
doc_vectors = np.array(ft_doc_vectors)[filtered_docs]

In [17]:
len(doc_vectors)

240808

Having a look at what the document vectors look like, I limit to a length of 100, as the total vector length is 400.

In [18]:
doc_vectors[0][:100]

array([-0.003058  , -0.10491421,  0.03684855, -0.012267  , -0.00919887,
       -0.08231937, -0.0335966 , -0.02481934, -0.02251031,  0.03845023,
       -0.0002641 , -0.05904942, -0.08859264,  0.0153966 , -0.00884376,
        0.05222693,  0.0093316 , -0.04216076,  0.04405529, -0.03084019,
       -0.03554146, -0.01006025,  0.02306784, -0.03808308, -0.05935347,
       -0.03907043,  0.02869161,  0.05053029, -0.04654245,  0.03511478,
        0.04197958, -0.1183917 , -0.00134065, -0.05671608, -0.17690775,
       -0.08795649,  0.07938551,  0.00104496,  0.01450981, -0.05626698,
       -0.06712566,  0.1045928 ,  0.01922257,  0.03041594,  0.06595997,
       -0.03783401, -0.00904829, -0.03726572,  0.03684051,  0.00326786,
       -0.02243839,  0.03751467, -0.05560829,  0.07148362, -0.00089446,
        0.00708171, -0.0165192 ,  0.01236192, -0.15906664, -0.03559208,
       -0.05971879,  0.00336894,  0.01338454,  0.0325871 , -0.07559386,
       -0.05471139,  0.05658214,  0.0127135 ,  0.00343086, -0.03

In [19]:
with open('final_doc_vectors.pickle', 'wb') as to_write:
    pickle.dump(doc_vectors, to_write)