# Data Mining ⛏

**Purpose:** Collect all relevant Tweet's pertaining to the reopening of schools in the COVID-19 pandemic between Jan. 1, 2020 and Sept. 15, 2020.

**Pipeline:**
1. Connect to Twitter's Search Tweets API, to the `full archive` endpoint
2. Go province by province<sup>1</sup> and:
    1. Collect all tweets that mention that an education minister
    2. Collect all tweets that contain a dedicated list of keywords/hashtags
3. Store collection of tweets in Pandas dataframe, and only keep relevant features (data, geocode, text, author, *etc.*)
4. Add an extra column that is the cleaned tweet text.
5. Save dataframe to CSV
6. Solve the pandemic 🎊


<sup>1</sup> For more information on what tweets are geocoded, see [Twitter's geofiltering guide](https://developer.twitter.com/en/docs/tutorials/filtering-tweets-by-location)

In [3]:
import pandas as pd
import numpy as np
from searchtweets import collect_results, gen_rule_payload, load_credentials, ResultStream

premium_search_args = load_credentials(filename="../secrets/new_secret.yaml",yaml_key="search_tweets_api",env_overwrite=False)


Grabbing bearer token from OAUTH


## Location Filtering Rules

**Query Rules:** Each aspect of the query (mentions, keywords, hashtags, geo, etc...) should be encapsulated in their own brackets. Each part of the query, *aside from geo*, only needs one part to be satisfied, so those are all ORed together. Since geo must be satisfied, the rest of the query is put in brackets and geo is appended at the end.

**IMPORTANT** This does not work with the `sandbox` API tier so we need to pony up for `premium` first.

To collect tweets from province $X$, search for tweets where the account profile has location containing $X$ **OR** geocoded tweets that fall in $X$ 

Note: the `geo` attribute is deprecated and is ignored accordingly. For geocoded tweets only the `place` attribute will be used.

In [4]:
# Need to validate that these work
# places = {
#     "AB":'place_contains:", AB" OR place_contains:"Alberta" OR (profile_region:alberta) OR (bio_location:alberta OR bio_location:",AB")',
#     "BC":'place_contains:", BC" OR place_contains:"British Columbia" OR (profile_region:"british columbia") OR (bio_location:"british columbia" OR bio_location:",BC")',
#     "MB":'place_contains:", MB" OR place_contains:"Manitoba" OR (profile_region:manitoba) OR (bio_location:manitoba OR bio_location:",MB")',
#     "NB":'place_contains:", NB" OR place_contains:"New Brunswick" OR (profile_region:"new brunswick") OR (bio_location:"new brunswick" OR bio_location:",NB")',
#     "NL":'place_contains:", NL" OR place_contains:"Newfoundland and Labrador" OR (profile_region:"newfoundland and labrador") OR (bio_location:"newfoundland and labrador" OR bio_location:",NL")',
#     "NT":'place_contains:", NT" OR place_contains:"Northwest Territories" OR (profile_region:"northwest territories") OR (bio_location:"northwest territories" OR bio_location:",NT")',
#     "NS":'place_contains:", NS" OR place_contains:"Nova Scotia" OR (profile_region:"nova scotia") OR (bio_location:"nova scotia" OR bio_location:",NS")',
#     "NU":'place_contains:", NU" OR place_contains:"Nunavut" OR (profile_region:nunavut) OR (bio_location:nunavut OR bio_location:",NU")',
#     "ON":'place_contains:", ON" OR place_contains:"Ontario" OR (profile_region:ontario) OR (bio_location:ontario OR bio_location:",ON")',
#     "PEI":'place_contains:", PEI" OR place_contains:"Prince Edward Island" OR (profile_region:"prince edward island") OR (bio_location:"prince edward island" OR bio_location:",PEI")',
#     "QC":'place_contains:", QC" OR place_contains:"Quebec" OR (profile_region:qu\u00e9be) OR (bio_location:qu\u00e9be OR bio_location:",QC")',
#     "SK":'place_contains:", SK" OR place_contains:"Saskatchewan" OR (profile_region:saskatchewan) OR (bio_location:saskatchewan OR bio_location:",SK")',
#     "YT":'place_contains:", YT" OR place_contains:"Yukon" OR (profile_region:yukon) OR (bio_location:yukon OR bio_location:",YT")'
# }

# has geo AND one of these place markers
country = '((has:geo OR has:profile_geo) (place_country:CA OR profile_country:CA))'

In [5]:
edu_minister_dict = {
    "AB": "@davideggenAB",
    "BC": "@Rob_Fleming",
    "MB": "@mingoertzen",
    "NB": "@DominicCardy",
    "NL": "@BrianWarr709",
    "NT": "@RJSimpson_NWT",
    "NS": "@zachchurchill",
    "ON": "@Sflecce",
    "PEI": "@bradtrivers",
    "QC": "@jfrobergeQc",
    "SK": "@GordWyant",
    "YT": "@TracyMcPheeRS"
}

education_ministers = [val for _,val in edu_minister_dict.items()]

education_ministers = " OR ".join(education_ministers)
education_ministers = f"({education_ministers})"
print(education_ministers)

(@davideggenAB OR @Rob_Fleming OR @mingoertzen OR @DominicCardy OR @BrianWarr709 OR @RJSimpson_NWT OR @zachchurchill OR @Sflecce OR @bradtrivers OR @jfrobergeQc OR @GordWyant OR @TracyMcPheeRS)


In [6]:
premier_dict = {
    "AB": "@jkenney",
    "BC": "@jjhorgan",
    "MB": "@BrianPallister",
    "NB": "@blainehiggs",
    "NL": "@PremierofNL",
    "NT": "@CCochrane_NWT",
    "NS": "@StephenMcNeil",
    "NU": "@JSavikataaq",
    "ON": "@fordnation",
    "PEI": "@dennyking",
    "QC": "@francoislegault",
    "SK": "@PremierScottMoe",
    "YT": "@Premier_Silver"
}

premiers = " OR ".join([val for _,val in premier_dict.items()])
premiers = f"(({premiers}) ((covid OR covid-19 OR coronavirus) (school OR childcare OR child)))"
print(premiers)

((@jkenney OR @jjhorgan OR @BrianPallister OR @blainehiggs OR @PremierofNL OR @CCochrane_NWT OR @StephenMcNeil OR @JSavikataaq OR @fordnation OR @dennyking OR @francoislegault OR @PremierScottMoe OR @Premier_Silver) ((covid OR covid-19 OR coronavirus) (school OR childcare OR child)))


In [7]:
keywords = [
    '"back to school"',
    '(school OR classroom OR child OR children) (covid OR covid-19 OR coronavirus OR risk OR open OR reopen OR safe OR safety OR safely)',
]
keywords = "("+") OR (".join(keywords)+")"
keywords = f"({keywords})"
keywords

'(("back to school") OR ((school OR classroom OR child OR children) (covid OR covid-19 OR coronavirus OR risk OR open OR reopen OR safe OR safety OR safely)))'

In [8]:
hashtags = [
    '#safeseptember',
    '#unsafeseptember',
    '#backtoschool']
hashtags = "("+" OR ".join(hashtags)+")"
hashtags

'(#safeseptember OR #unsafeseptember OR #backtoschool)'

#### Sample Tweets

From: 
* March: 9, 16, 23
* April: 6, 13, 20
* May: 4, 11, 18
* June: 1, 8, 29
* July: 6, 13, 20
    

In [19]:
import os
import json


def create_query(filters,geo="",lang="en"):
    """
        Takes in a place filter (fully formed), a single mention (@someone) and a list of keywords and forms a
        query for Twitter's historical search API
    """
    lang = f"lang:{lang}"
    filter_str = " OR ".join(filters)
    query = f"({filter_str}) {lang} {geo}"
    return query.strip()


def make_query(query,from_date,to_date):
    rule = gen_rule_payload(query,
                        from_date=from_date, #UTC 2018-10-21 00:00
                        to_date=to_date,
                        results_per_call=500)
    rs = ResultStream(rule_payload=rule,
                  max_pages=1,
#                   max_results=10**10,
                  **premium_search_args)
    return list(rs.stream())
    

query = create_query([keywords,premiers,hashtags,education_ministers],country)
from_date = "2020-06-08"
to_date = "2020-06-09"
fp = "../data/raw_data/{}_{}.json".format(from_date,to_date)
tweets = None
if os.path.isfile(fp):
    print(fp)
    with open(fp) as fin:
        tweets = json.load(fin)
else:
    tweets = return_tweets(query,from_date=from_date,to_date=to_date)
    with open(fp, 'w') as fout:
        json.dump(tweets,fout,indent=4)
        
tweets

../data/raw_data/2020-06-08_2020-06-09.json


[{'created_at': 'Mon Jun 08 23:34:45 +0000 2020',
  'id': 1270137220297457665,
  'id_str': '1270137220297457665',
  'text': 'Instant improvement. Open in September without Applied level. All kids in Academic program, studying at academic le… https://t.co/aRvbTFL63c',
  'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
  'truncated': True,
  'in_reply_to_status_id': None,
  'in_reply_to_status_id_str': None,
  'in_reply_to_user_id': None,
  'in_reply_to_user_id_str': None,
  'in_reply_to_screen_name': None,
  'user': {'id': 2790351250,
   'id_str': '2790351250',
   'name': 'Doug Little 🌹✊ 🍊',
   'screen_name': 'jdouglaslittle',
   'location': 'Toronto, and Vancouver Canada',
   'url': 'http://www.thelittleeducationreport.ca',
   'description': 'Education Writer blog http://thelittleeducationreport.ca\n\n Former trustee & OSSTF staff political action. Retired history teacher M Ed. Critical supporter  NDP.',
   'translator_type': 'none',
   

## Process Tweets

Feature constructing, tweet cleaning, etc...

In [94]:
import re

def clean_tweet(text,extended_tweet):
    if pd.isna(extended_tweet):
        return pd.Series([clean_text(text), text])
    to_dict = dict(extended_tweet)
    return pd.Series([clean_text(to_dict["full_text"]),to_dict["full_text"]])

rex = re.compile(r'<a.*?>(.*?)</a>',re.S|re.M)
def clean_source(source):
    match = rex.match(source)
    return match.groups()[0].strip()

clean_user = lambda x : x["screen_name"]

def clean_entities(entities):
    hashtags = [h["text"] for h in entities["hashtags"]] if entities["hashtags"] else ""
    urls = [h["expanded_url"] for h in entities["urls"]] if entities["urls"] else np.nan
    mentions = [h["screen_name"] for h in entities["user_mentions"]] if entities["user_mentions"] else np.nan
    return hashtags,urls,mentions

def clean_location(place):
    place = dict(place)
    split = [l.strip() for l in place["full_name"].split(",")]
    if len(split) == 2:
        return tuple(split)
    ## AFAIK the only time there's more than 1 comma in a place field is when the place is labelled 'unorganized'
    elif len(split) > 2:
        return np.nan,split[-1]
    else:
        return split[0],np.nan

In [97]:
from text_cleaning import clean_text
clean_fp = "../data/processed_data/{}_{}.csv".format(from_date,to_date)
cov_tweets = pd.DataFrame(tweets)
cov_tweets = cov_tweets[['id','user','created_at', 'source', 'text','extended_tweet','place','entities','favorite_count', 'retweet_count']]
# Get twitter handle from user
cov_tweets["user"] = cov_tweets["user"].apply(clean_user)
# clean the tweet text
cov_tweets[["text","extended_tweet"]] = cov_tweets[["text","extended_tweet"]].apply(lambda x: clean_tweet(*x),axis=1)
cov_tweets = cov_tweets.rename({"text": "clean_text","extended_tweet":"original_text"},axis=1)
# Get the city/province from the location data
cov_tweets[["city","province"]] = cov_tweets[["place"]].apply(lambda x : clean_location(*x),axis=1,result_type="expand")
cov_tweets = cov_tweets.drop("place",axis=1)
# Through what medium did they post the tweet?
cov_tweets["source"] = cov_tweets["source"].apply(clean_source)
# Extract tweet entities (hashtags, linked urls, etc...)
cov_tweets[["hashtags","urls","mentions"]] = cov_tweets[["entities"]].apply(lambda x : clean_entities(x["entities"]),result_type="expand",axis=1)
cov_tweets = cov_tweets.drop("entities",axis=1)
cov_tweets.to_csv(clean_fp)
cov_tweets.head(50)

Unnamed: 0,id,user,created_at,source,clean_text,original_text,favorite_count,retweet_count,city,province,hashtags,urls,mentions
0,1270137220297457665,jdouglaslittle,Mon Jun 08 23:34:45 +0000 2020,Twitter for Android,"[instant, improvement, open, september, withou...",Instant improvement. Open in September without...,0,0,Vancouver,British Columbia,,[https://twitter.com/i/web/status/127013722029...,
1,1270119672336093185,ShaneWenzel,Mon Jun 08 22:25:02 +0000 2020,Twitter for iPhone,"[bellis1994, toadamvaughan, somebody, send, ba...",@bellis1994 @TOAdamVaughan Somebody should sen...,1,0,Calgary,Alberta,,,"[bellis1994, TOAdamVaughan]"
2,1270113767981027330,aforgrave,Mon Jun 08 22:01:34 +0000 2020,Twitter for iPhone,"[samoosterhoff, deepakanandmpp, sflecce, pleas...",@samoosterhoff @DeepakAnandMPP @Sflecce Please...,2,2,Belleville,Ontario,,[https://twitter.com/i/web/status/127011376798...,"[samoosterhoff, DeepakAnandMPP, Sflecce]"
3,1270105966441385984,MsBelvitt,Mon Jun 08 21:30:34 +0000 2020,Twitter for iPhone,"[covid-19, middle, devine, leadership, potenti...",Before COVID-19 hit CA&amp;US I was in the mid...,2,0,Kitchener,Ontario,,[https://twitter.com/i/web/status/127010596644...,
4,1270100052745183232,pairsonnalitesN,Mon Jun 08 21:07:04 +0000 2020,dlvr.it,"[fight, stigma, covid-19, pod, inhumane, say, ...",Fighting Stigma : Covid-19 pods are 'inhumane'...,0,0,Varennes,Québec,,[https://twitter.com/i/web/status/127010005274...,
5,1270100051709198336,pairsonnalitesN,Mon Jun 08 21:07:04 +0000 2020,dlvr.it,"[fight, stigma, ireland, must, completely, eli...",Fighting Stigma : Ireland must completely elim...,0,0,Varennes,Québec,,[https://twitter.com/i/web/status/127010005170...,
6,1270098922799869958,PHump72,Mon Jun 08 21:02:35 +0000 2020,Twitter for iPhone,"[stuff, make, people, sick, every, single, res...",This is the stuff that makes people sick. How ...,0,0,Pickering,Ontario,,[https://twitter.com/i/web/status/127009892279...,
7,1270094829217857536,tbcschools_STMT,Mon Jun 08 20:46:19 +0000 2020,Instagram,"[virtual, bike, rodeos, virtual, bike, rodeo, ...",About Virtual Bike Rodeos\nVirtual Bike Rodeos...,1,0,Thunder Bay,Ontario,,[https://twitter.com/i/web/status/127009482921...,
8,1270082574459613190,REngString,Mon Jun 08 19:57:37 +0000 2020,Twitter for iPhone,"[please, forward, anyone, know, provide, food,...",Please forward to anyone you know who is provi...,0,2,Saskatoon,Saskatchewan,,[https://www.healthyschoolfood.ca/post/a-surve...,
9,1270078688520835072,MsYouDoYou,Mon Jun 08 19:42:10 +0000 2020,Twitter for iPhone,"[stay, facebook, week, argue, stick, mom, say,...",I stayed off Facebook for a week and now I’m a...,7,0,Nanaimo,British Columbia,,[https://twitter.com/i/web/status/127007868852...,
