# Canada's Response to Economic Impacts of COVID-19
  
Amanda Cheney  
Metis Final Project  
Part 1 of 4  
December 8, 2020    

**Objective**  
Unsupervised learning and natural language processing to identify two sets of clusters of Twitter conversations about the Canadian Emergency Response Benefit (CERB) and Canadian Recovery Benefit(CRB) programs to address unemployment and economic impacts of the COVID-19 pandemic. 

1. One that captures the contours of everyday user conversations. 

2.  Another that highlights clusters of conversation that are really, really dense, have users make collaborative efforts to shape public opinion and perception.  

**Data Sources**  
250,000+ tweets from March 1 - December 1, 2020, collected using snscrape.  

**This Notebook**  
Collects tweets, intial EDA and cleaning. 

## Imports

In [1]:
import json
import csv
import pandas as pd
import pickle

## Twitter Query

The Twitter API limits developers to tweets from within the past week. Yet I wanted to collect tweets from as far back as March 2020, when the pandemic first began and these programs were first rolled out, to the present. So for my historical tweet collection, I used [snscrape](https://github.com/JustAnotherArchivist/snscrape).  Snscrape offers somewhat less  metadata than the Twitter API itself, however, it still offered me more than enough for the purposes of this analysis.

To ensure that my Twitter results did not contain too many extraneous tweets, I used a relatively complex search query. It gathers tweets since March 1, 2020, filters out retweets and content associated with the term CERB and CRB unrelated to these Canadian temporary social security and financial aid programs.

snscape call is made at command line `snscrape --jsonl --max-results 2000000 --since 2020-03-01 twitter-search '((CRB or #CRB) -clubes -coupe -SuperCoupe -SUPER_COUPE -football -belouizdad -chabab -USMA -brasileño -Brasileirão -Algérie -Railway -@IR_CRB -@RailMinIndia -@PiyushGoyal) OR CERB OR #CERB OR (#EI (Canada OR #onpoli OR #cdnpoli)) OR (Canada unemployment) OR ((#CRA or cra) (#covid19canada OR #covid19relief OR covid OR covid19 OR coronavirus or pandemic)) -filter:retweets' > cerb_no_retweets.json`

In [3]:
cerb_df = pd.read_json('cerb_no_retweets.json', lines=True)

In [4]:
cerb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 255327 entries, 0 to 255326
Data columns (total 21 columns):
 #   Column           Non-Null Count   Dtype              
---  ------           --------------   -----              
 0   url              255327 non-null  object             
 1   date             255327 non-null  datetime64[ns, UTC]
 2   content          255327 non-null  object             
 3   renderedContent  255327 non-null  object             
 4   id               255327 non-null  int64              
 5   user             255327 non-null  object             
 6   outlinks         255327 non-null  object             
 7   tcooutlinks      255327 non-null  object             
 8   replyCount       255327 non-null  int64              
 9   retweetCount     255327 non-null  int64              
 10  likeCount        255327 non-null  int64              
 11  quoteCount       255327 non-null  int64              
 12  conversationId   255327 non-null  int64              
 13 

Note that `retweetedTweet` has 0 non-null rows, which we should expect having filtered out retweets. 

In [5]:
cerb_df.shape

(255327, 21)

Next step is to limit to English language only tweets. Although Canada is officially a bilingual country, the French language equivalents of CERB and CRB have different names and acronyms - Prestation canadienne d’urgence (PCU) and Prestation canadienne de la relance économique (PCRE). Sifting through the noise of Twitter for these is beyond the scope of this project but would be an excellent avenue of future work.

In [6]:
cerb_df = cerb_df[cerb_df['lang']=='en']

In [7]:
cerb_df.shape

(240808, 21)

In [8]:
cerb_df.head()

Unnamed: 0,url,date,content,renderedContent,id,user,outlinks,tcooutlinks,replyCount,retweetCount,...,quoteCount,conversationId,lang,source,sourceUrl,sourceLabel,media,retweetedTweet,quotedTweet,mentionedUsers
0,https://twitter.com/SandraLynnColl3/status/133...,2020-12-01 00:12:39+00:00,"@mini_bubbly In our Entire extended family, on...","@mini_bubbly In our Entire extended family, on...",1333564633320480769,"{'username': 'SandraLynnColl3', 'displayname':...",[],[],0,0,...,0,1333538159846715392,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,,,,"[{'username': 'mini_bubbly', 'displayname': '🇨..."
2,https://twitter.com/bunmzi/status/133356432966...,2020-12-01 00:11:27+00:00,@MrStache9 Many Canadians dont realize this. L...,@MrStache9 Many Canadians dont realize this. L...,1333564329665441793,"{'username': 'bunmzi', 'displayname': 'Mikel A...",[],[],0,1,...,0,1333526557697187847,en,"<a href=""https://mobile.twitter.com"" rel=""nofo...",https://mobile.twitter.com,Twitter Web App,,,,"[{'username': 'MrStache9', 'displayname': 'Mr ..."
3,https://twitter.com/D313131Daniel/status/13335...,2020-12-01 00:08:32+00:00,@nationalpost @TheGrowthOp So that won’t cost ...,@nationalpost @TheGrowthOp So that won’t cost ...,1333563595607699464,"{'username': 'D313131Daniel', 'displayname': '...",[],[],0,0,...,0,1333472122270851073,en,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,"[{'username': 'nationalpost', 'displayname': '..."
4,https://twitter.com/Paulbyjove1/status/1333563...,2020-12-01 00:07:28+00:00,@exposforever @MJosling53 @erinotoole @PierreP...,@exposforever @MJosling53 @erinotoole @PierreP...,1333563328040464385,"{'username': 'Paulbyjove1', 'displayname': 'Pa...",[],[],0,0,...,0,1333380605426495489,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,,,,"[{'username': 'exposforever', 'displayname': '..."
5,https://twitter.com/GavinBamber/status/1333563...,2020-12-01 00:07:00+00:00,@journo_dale Something like over $400 million ...,@journo_dale Something like over $400 million ...,1333563211216424961,"{'username': 'GavinBamber', 'displayname': 'Ga...",[],[],0,0,...,0,1333549485725798400,en,"<a href=""https://mobile.twitter.com"" rel=""nofo...",https://mobile.twitter.com,Twitter Web App,,,,"[{'username': 'journo_dale', 'displayname': 'D..."


Having a look at a random sample of content.

In [9]:
cerb_df.iloc[10].content

"@RosieBarton Mr. O'Toole is not a leader, you need to actually care about Cdn's, I heard him say the CERB was a bad idea being released when it did. Won't answer what he would cut even though he says he agrees with half the measures. He's actually a bit of a joke!"

In [10]:
cerb_df.iloc[10].user

{'username': 'dmc1701',
 'displayname': 'David McRae',
 'id': 238865666,
 'description': 'I am of the Odawa First Nation with some Scottish blood thrown in for good measure, I love people, politics! Hate Racists/Bullies!',
 'rawDescription': 'I am of the Odawa First Nation with some Scottish blood thrown in for good measure, I love people, politics! Hate Racists/Bullies!',
 'descriptionUrls': [],
 'verified': False,
 'created': '2011-01-16T06:21:46+00:00',
 'followersCount': 2563,
 'friendsCount': 1421,
 'statusesCount': 31945,
 'favouritesCount': 34332,
 'listedCount': 23,
 'mediaCount': 205,
 'location': 'Ottawa, Ontario Canada',
 'protected': False,
 'linkUrl': None,
 'linkTcourl': None,
 'profileImageUrl': 'https://pbs.twimg.com/profile_images/1148297030529425408/aIQau2tO_normal.jpg',
 'profileBannerUrl': 'https://pbs.twimg.com/profile_banners/238865666/1496371695',
 'url': 'https://twitter.com/dmc1701'}

Note that all this user information is in json dictionary form and can be extracted.

In [11]:
cerb_df.iloc[10].user['username']

'dmc1701'

Let's create a list of all usernames that can then be incorporated into the dataframe. 

In [12]:
user_name_list = [cerb_df.iloc[i].user['username'] for i in range(len(cerb_df))]

In [13]:
cerb_df['user_name'] = user_name_list

In [14]:
cerb_df.tail()

Unnamed: 0,url,date,content,renderedContent,id,user,outlinks,tcooutlinks,replyCount,retweetCount,...,conversationId,lang,source,sourceUrl,sourceLabel,media,retweetedTweet,quotedTweet,mentionedUsers,user_name
255321,https://twitter.com/Son5000Yp/status/123397408...,2020-03-01 04:35:22+00:00,@GretaThunberg I am certainly disgusted by the...,@GretaThunberg I am certainly disgusted by the...,1233974083169247232,"{'username': 'Son5000Yp', 'displayname': 'Yvet...",[],[],0,0,...,1233775476998582278,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,,,,"[{'username': 'GretaThunberg', 'displayname': ...",Son5000Yp
255322,https://twitter.com/thekaptainbee/status/12339...,2020-03-01 03:51:20+00:00,@CaracciGmc @AdamParkhomenko -Lowest black une...,@CaracciGmc @AdamParkhomenko -Lowest black une...,1233963002250039296,"{'username': 'thekaptainbee', 'displayname': '...",[],[],1,0,...,1224907023990562816,en,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,"[{'username': 'CaracciGmc', 'displayname': 'Gi...",thekaptainbee
255324,https://twitter.com/patrick26539519/status/123...,2020-03-01 01:36:07+00:00,@JustinTrudeau And unemployment rates through ...,@JustinTrudeau And unemployment rates through ...,1233928976541736961,"{'username': 'patrick26539519', 'displayname':...",[],[],0,0,...,1233890614879703041,en,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,"[{'username': 'JustinTrudeau', 'displayname': ...",patrick26539519
255325,https://twitter.com/RAndrewsCWE/status/1233925...,2020-03-01 01:21:03+00:00,@RachelNotley Who do you think pays for our he...,@RachelNotley Who do you think pays for our he...,1233925183477448705,"{'username': 'RAndrewsCWE', 'displayname': 'Ry...",[],[],0,0,...,1233866203568852992,en,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,"[{'username': 'RachelNotley', 'displayname': '...",RAndrewsCWE
255326,https://twitter.com/gmelaabez/status/123391390...,2020-03-01 00:36:15+00:00,@diggingforoil @sammideedub @paulparrazzilvr @...,@diggingforoil @sammideedub @paulparrazzilvr @...,1233913906604756993,"{'username': 'gmelaabez', 'displayname': 'Gene...",[],[],2,0,...,1233827163666669568,en,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,"[{'username': 'diggingforoil', 'displayname': ...",gmelaabez


## EDA and Cleaning

Although our data does not appear to contain re-tweets, let's look at how many of the tweets have been retweeted by other users. About 79% of tweets appear to have no retweets.

In [15]:
cerb_df['retweetCount'].value_counts(normalize=True)

0      0.793898
1      0.091625
2      0.033558
3      0.017794
4      0.011021
         ...   
383    0.000004
443    0.000004
442    0.000004
186    0.000004
572    0.000004
Name: retweetCount, Length: 463, dtype: float64

**Unique Content**  
A crude measure to see if we are getting much duplicate content - this may not be because of retweets, but for other reasons such as bots etc.

In [17]:
print("Number of tweets with unique content:", len(cerb_df['content'].unique()))
print('Number of tweets with non-unique content:', cerb_df.shape[0]-len(cerb_df['content'].unique()))

Number of tweets with unique content: 239291
Number of tweets with non-unique content: 1517


**Unique Timestamps**  
One prominent indicator of bot nets is that each bot account tweets content at the exact same time. Since our time_date info goes down to the millisecond we can measure this fairly precisely. Though of course there are also naturally coincidences - I will return to this question in later notebooks.

In [19]:
print("Number of tweets with unique timestamps:", len(cerb_df['date'].unique()))
print('Number of tweets with non-unique timestamps:', cerb_df.shape[0]-len(cerb_df['date'].unique()))

Number of tweets with unique timestamps: 235211
Number of tweets with non-unique timestamps: 5597


Create a new simple date column.

In [16]:
cerb_df['simple_date'] = cerb_df.date.dt.date

In [18]:
cerb_df.head()

Unnamed: 0,url,date,content,renderedContent,id,user,outlinks,tcooutlinks,replyCount,retweetCount,...,lang,source,sourceUrl,sourceLabel,media,retweetedTweet,quotedTweet,mentionedUsers,user_name,simple_date
0,https://twitter.com/SandraLynnColl3/status/133...,2020-12-01 00:12:39+00:00,"@mini_bubbly In our Entire extended family, on...","@mini_bubbly In our Entire extended family, on...",1333564633320480769,"{'username': 'SandraLynnColl3', 'displayname':...",[],[],0,0,...,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,,,,"[{'username': 'mini_bubbly', 'displayname': '🇨...",SandraLynnColl3,2020-12-01
2,https://twitter.com/bunmzi/status/133356432966...,2020-12-01 00:11:27+00:00,@MrStache9 Many Canadians dont realize this. L...,@MrStache9 Many Canadians dont realize this. L...,1333564329665441793,"{'username': 'bunmzi', 'displayname': 'Mikel A...",[],[],0,1,...,en,"<a href=""https://mobile.twitter.com"" rel=""nofo...",https://mobile.twitter.com,Twitter Web App,,,,"[{'username': 'MrStache9', 'displayname': 'Mr ...",bunmzi,2020-12-01
3,https://twitter.com/D313131Daniel/status/13335...,2020-12-01 00:08:32+00:00,@nationalpost @TheGrowthOp So that won’t cost ...,@nationalpost @TheGrowthOp So that won’t cost ...,1333563595607699464,"{'username': 'D313131Daniel', 'displayname': '...",[],[],0,0,...,en,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,"[{'username': 'nationalpost', 'displayname': '...",D313131Daniel,2020-12-01
4,https://twitter.com/Paulbyjove1/status/1333563...,2020-12-01 00:07:28+00:00,@exposforever @MJosling53 @erinotoole @PierreP...,@exposforever @MJosling53 @erinotoole @PierreP...,1333563328040464385,"{'username': 'Paulbyjove1', 'displayname': 'Pa...",[],[],0,0,...,en,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,,,,"[{'username': 'exposforever', 'displayname': '...",Paulbyjove1,2020-12-01
5,https://twitter.com/GavinBamber/status/1333563...,2020-12-01 00:07:00+00:00,@journo_dale Something like over $400 million ...,@journo_dale Something like over $400 million ...,1333563211216424961,"{'username': 'GavinBamber', 'displayname': 'Ga...",[],[],0,0,...,en,"<a href=""https://mobile.twitter.com"" rel=""nofo...",https://mobile.twitter.com,Twitter Web App,,,,"[{'username': 'journo_dale', 'displayname': 'D...",GavinBamber,2020-12-01


Let's have a look at location data. Data acquired from snscrape does not include same geolocation data as one could acquire from Tweepy, however only 1-2% of tweets are understood to have geolocation data, but users can also post their location in user information, so let's see what it looks like.

In [20]:
locations = [cerb_df.iloc[i].user['location'] for i in range(len(cerb_df))]

In [21]:
len(locations)

240808

In [22]:
from collections import Counter

In [23]:
loc_counts = Counter(locations)

In [24]:
len(loc_counts)

16928

That's nearly 17,000 unique locations that appear too messy to engineer into a useful feature for the time being.

In [25]:
loc_counts

Counter({'Canada': 20716,
         '': 65564,
         'Kingston, Ontario': 497,
         'North Vancouver': 87,
         'Calgary, Alberta': 3054,
         'British Columbia Canada': 32,
         'Fredericton New Brunswick': 1,
         'Ottawa, Ontario Canada': 24,
         'Safe refuge for some Albertans': 162,
         'Ontario, Canada': 6404,
         'Rothesay, NB, Canada': 1,
         'Winnipeg, Manitoba': 1467,
         'he / she': 4,
         'Calgary': 868,
         'mississauga': 9,
         'Kelowna, British Columbia': 160,
         'Toronto, Ontario': 10325,
         '... da\' "Bert"': 76,
         'Greater Vancouver, British Columbia': 71,
         'Saskatoon': 113,
         'CANADA █ ♥ █': 16,
         'Northwestern Ontario': 8,
         'Winnipeg, MB': 256,
         'Ottawa, ON': 313,
         'Dartford, UK': 25,
         '🇨🇦': 441,
         'ヘソでも噛んで死んじゃえばぁ?': 475,
         'Alberta, Canada': 2125,
         'Amherstburg, Ontario ': 40,
         'Ottawa': 1414,
         

Just in case I want to look at this later I will incorporate it as a new dataframe column. 

In [26]:
cerb_df['location'] = locations

One more look at a random slice of the data.

In [27]:
cerb_df.iloc[72000:72010]

Unnamed: 0,url,date,content,renderedContent,id,user,outlinks,tcooutlinks,replyCount,retweetCount,...,source,sourceUrl,sourceLabel,media,retweetedTweet,quotedTweet,mentionedUsers,user_name,simple_date,location
76690,https://twitter.com/JoyceMurray/status/1295557...,2020-08-18 03:05:18+00:00,‘@Bill_Morneau - capable and caring - has serv...,‘@Bill_Morneau - capable and caring - has serv...,1295557353539022848,"{'username': 'JoyceMurray', 'displayname': 'Jo...",[],[],2,4,...,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,"[{'username': 'Bill_Morneau', 'displayname': '...",JoyceMurray,2020-08-18,Vancouver Quadra
76691,https://twitter.com/BigAlx/status/129555715940...,2020-08-18 03:04:31+00:00,I just signed in support of converting the CER...,I just signed in support of converting the CER...,1295557159401459718,"{'username': 'BigAlx', 'displayname': 'Alex Mo...",[https://www.leahgazan.ca/basicincomemotion?re...,[https://t.co/SWYtQEtgU8],0,0,...,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,,BigAlx,2020-08-18,
76692,https://twitter.com/gla4refugees/status/129555...,2020-08-18 03:04:27+00:00,@JustinTrudeau @Bill_Morneau “Bill Money” than...,@JustinTrudeau @Bill_Morneau “Bill Money” than...,1295557141986607104,"{'username': 'gla4refugees', 'displayname': 'B...",[],[],0,0,...,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,"[{'username': 'JustinTrudeau', 'displayname': ...",gla4refugees,2020-08-18,"Quebec, England"
76693,https://twitter.com/minsooklee/status/12955565...,2020-08-18 03:01:56+00:00,"We are in a global pandemic, 3 million Canadia...","We are in a global pandemic, 3 million Canadia...",1295556508139298816,"{'username': 'minsooklee', 'displayname': 'Min...",[https://twitter.com/PnPCBC/status/12955080447...,[https://t.co/zYYJq67FvW],1,7,...,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,{'url': 'https://twitter.com/PnPCBC/status/129...,,minsooklee,2020-08-18,Toronto
76694,https://twitter.com/NewthAndrea/status/1295555...,2020-08-18 02:58:58+00:00,@204Girl0574 I am a small business owner and I...,@204Girl0574 I am a small business owner and I...,1295555760701747201,"{'username': 'NewthAndrea', 'displayname': 'An...",[],[],1,3,...,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,"[{'username': '204Girl0574', 'displayname': '🇨...",NewthAndrea,2020-08-18,"Brighton, Ontario"
76695,https://twitter.com/A44037682871/status/129555...,2020-08-18 02:58:00+00:00,@battousai1130 @LaurenToronto4 I hear you. No ...,@battousai1130 @LaurenToronto4 I hear you. No ...,1295555517864120322,"{'username': 'A44037682871', 'displayname': 'I...",[],[],1,0,...,"<a href=""https://mobile.twitter.com"" rel=""nofo...",https://mobile.twitter.com,Twitter Web App,,,,"[{'username': 'battousai1130', 'displayname': ...",A44037682871,2020-08-18,The hinterland
76696,https://twitter.com/bryanyuBC/status/129555465...,2020-08-18 02:54:34+00:00,Why the end of CERB could jolt consumer spendi...,Why the end of CERB could jolt consumer spendi...,1295554653069795328,"{'username': 'bryanyuBC', 'displayname': 'brya...",[https://www.theglobeandmail.com/business/arti...,[https://t.co/RHgQ6DvNrX],0,0,...,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,,bryanyuBC,2020-08-18,
76697,https://twitter.com/liminal67/status/129555441...,2020-08-18 02:53:38+00:00,I just signed in support of converting the CER...,I just signed in support of converting the CER...,1295554418994274307,"{'username': 'liminal67', 'displayname': 'Davi...",[https://www.leahgazan.ca/basicincomemotion?re...,[https://t.co/HsUbilQeUw],0,1,...,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,,,,,liminal67,2020-08-18,"Toronto, Canada"
76698,https://twitter.com/NewthAndrea/status/1295553...,2020-08-18 02:51:51+00:00,@204Girl0574 There are checks and balances. I...,@204Girl0574 There are checks and balances. I...,1295553971168456705,"{'username': 'NewthAndrea', 'displayname': 'An...",[],[],1,0,...,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,"[{'username': '204Girl0574', 'displayname': '🇨...",NewthAndrea,2020-08-18,"Brighton, Ontario"
76699,https://twitter.com/NNorma192/status/129555378...,2020-08-18 02:51:07+00:00,@kimcreynolds1 @TheJasonPugh New Brunswickers!...,@kimcreynolds1 @TheJasonPugh New Brunswickers!...,1295553786539397125,"{'username': 'NNorma192', 'displayname': 'NNor...",[],[],2,1,...,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,"[{'username': 'kimcreynolds1', 'displayname': ...",NNorma192,2020-08-18,


This gives us a fairly good sense of what the data looks like from a high level. Let's pickle the dataframe to use in the next notebook for text preprocessing and tokenization to extract word embeddings.

In [28]:
with open('cerb_df.pickle', 'wb') as to_write:
    pickle.dump(cerb_df, to_write)