# Data Mining ⛏

**Purpose:** Collect all relevant Tweet's pertaining to the reopening of schools in the COVID-19 pandemic between Jan. 1, 2020 and Sept. 15, 2020.

**Pipeline:**
1. Connect to Twitter's Search Tweets API, to the `full archive` endpoint
2. Go province by province<sup>1</sup> and:
    1. Collect all tweets that mention that an education minister
    2. Collect all tweets that contain a dedicated list of keywords/hashtags
3. Store collection of tweets in Pandas dataframe, and only keep relevant features (data, geocode, text, author, *etc.*)
4. Add an extra column that is the cleaned tweet text.
5. Save dataframe to CSV
6. Solve the pandemic 🎊


<sup>1</sup> For more information on what tweets are geocoded, see [Twitter's geofiltering guide](https://developer.twitter.com/en/docs/tutorials/filtering-tweets-by-location)

In [1]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm 
from searchtweets import collect_results, gen_rule_payload, load_credentials, ResultStream

premium_search_args = load_credentials(filename="../secrets/new_secret.yaml",yaml_key="search_tweets_api",env_overwrite=False)


Grabbing bearer token from OAUTH


## Location Filtering Rules

**Query Rules:** Each aspect of the query (mentions, keywords, hashtags, geo, etc...) should be encapsulated in their own brackets. Each part of the query, *aside from geo*, only needs one part to be satisfied, so those are all ORed together. Since geo must be satisfied, the rest of the query is put in brackets and geo is appended at the end.

**IMPORTANT** This does not work with the `sandbox` API tier so we need to pony up for `premium` first.

To collect tweets from province $X$, search for tweets where the account profile has location containing $X$ **OR** geocoded tweets that fall in $X$ 

Note: the `geo` attribute is deprecated and is ignored accordingly. For geocoded tweets only the `place` attribute will be used.

In [2]:
# has geo AND one of these place markers
country = '((has:geo OR has:profile_geo) (place_country:CA OR profile_country:CA))'

### Keyword Strategy
*TODO: UPDATE*

3 conditions that a tweet must satisfy
1. It needs to be about the covid-19 pandemic (covid OR covid-19 OR coronavirus OR pandemic OR lockdown)
2. It needs to be about children/parental anxiety (child OR children OR kid OR LO OR toddler OR parent OR family)
3. It needs to be about school/the back to school season (school OR risk OR open OR reopen OR safe OR safety OR safely, etc..)

In [3]:
covid_filters = ["covid",
                 "covid-19",
                 "coronavirus",
                 "pandemic",
                 "lockdown",
                 "shutdown",
                 "closure",
                 "closures",
                 "open",
                 "reopen",
                 "risk",
                 "safe",
                 "safety",
                 "safely"]

covid_filters = "(("+") OR (".join(covid_filters)+"))"

school_filters = ["school",
          "schools",
          "preschools",
          "preschool",
          "daycare",
          "childcare",
          "class",
          "classroom",
          "classrooms",
          "cohort",
          "(online OR distance OR remote) learning"]

school_filters = "(("+") OR (".join(school_filters)+"))"

child_filters = ["child",
                 "children",
                 "toddler",
                 "toddlers",
                 "kid",
                 "kids",
                 "mom",
                 "moms",
                 "mother",
                 "mothers",
                 "dad",
                 "dads",
                 "father",
                 "fathers",
                 "parent",
                 "parents"]

child_filters = "(("+") OR (".join(child_filters)+"))"

keywords = "("+" ".join([covid_filters,child_filters,school_filters])+")"
keywords



'(((covid) OR (covid-19) OR (coronavirus) OR (pandemic) OR (lockdown) OR (shutdown) OR (closure) OR (closures) OR (open) OR (reopen) OR (risk) OR (safe) OR (safety) OR (safely)) ((child) OR (children) OR (toddler) OR (toddlers) OR (kid) OR (kids) OR (mom) OR (moms) OR (mother) OR (mothers) OR (dad) OR (dads) OR (father) OR (fathers) OR (parent) OR (parents)) ((school) OR (schools) OR (preschools) OR (preschool) OR (daycare) OR (childcare) OR (class) OR (classroom) OR (classrooms) OR (cohort) OR ((online OR distance OR remote) learning)))'

In [4]:
hashtags = [
    '#safeseptember',
    '#safeseptemberAB',
    '#safeseptemberBC',
    '#SafeSeptemberMB',
    '#safeseptemberNB',
    '#safeseptemberNL',
    '#safeseptemberNS',
    '#safeseptemberON',
    '#safeseptemberPEI',
    '#safeseptemberQC',
    '#safeseptemberSK',
    '#safeseptemberYT',
    '#unsafeseptember',
    '#unsafeseptemberAB',
    '#unsafeseptemberBC',
    '#unsafeseptemberMB',
    '#unsafeseptemberNS',
    '#unsafeseptemberON',
    '#unsafeseptemberQC',
]

hashtags = "("+" OR ".join(hashtags)+")"
hashtags


'(#safeseptember OR #safeseptemberAB OR #safeseptemberBC OR #SafeSeptemberMB OR #safeseptemberNB OR #safeseptemberNL OR #safeseptemberNS OR #safeseptemberON OR #safeseptemberPEI OR #safeseptemberQC OR #safeseptemberSK OR #safeseptemberYT OR #unsafeseptember OR #unsafeseptemberAB OR #unsafeseptemberBC OR #unsafeseptemberMB OR #unsafeseptemberNS OR #unsafeseptemberON OR #unsafeseptemberQC)'

#### Sample Tweets

From: 
* March: 8, 20
* April: 8, 20
* May: 8, 20
* June: 8, 20
* July: 8, 20
* August: 8, 20
    

In [5]:
import os
import json


def create_query(filters,geo="",lang="en"):
    """
        Takes in a list of fully formed filters that can be satisfied in disjunction.
    """
    lang = f"lang:{lang}"
    filter_str = " OR ".join(filters)
    query = f"({filter_str}) {lang} {geo}"
    return query.strip()


def return_tweets(query,from_date,to_date,f_name=None,max_pages=1):
    name = f"{from_date}_{to_date}" if not f_name else f"{f_name}-{from_date}_{to_date}"
    fp = "../data/raw_data/{}.json".format(name)
    if os.path.isfile(fp):
        with open(fp) as fin:
            return json.load(fin),name
    print(f"Making request: {name}")
    rule = gen_rule_payload(query,
                        from_date=from_date, #UTC 2018-10-21 00:00
                        to_date=to_date,
                        results_per_call=500)
    rs = ResultStream(rule_payload=rule,
                  max_pages=max_pages,
                  max_results=10**10,
                  **premium_search_args)
    tweets = list(rs.stream())
    with open(fp, 'w') as fout:
        json.dump(tweets,fout,indent=4)
    return tweets,name

In [13]:
query = create_query([keywords,hashtags],country)
from_date = "2020-02-15"
to_date = "2020-03-15"
print(query,len(query))


((((covid) OR (covid-19) OR (coronavirus) OR (pandemic) OR (lockdown) OR (shutdown) OR (closure) OR (closures) OR (open) OR (reopen) OR (risk) OR (safe) OR (safety) OR (safely)) ((child) OR (children) OR (toddler) OR (toddlers) OR (kid) OR (kids) OR (mom) OR (moms) OR (mother) OR (mothers) OR (dad) OR (dads) OR (father) OR (fathers) OR (parent) OR (parents)) ((school) OR (schools) OR (preschools) OR (preschool) OR (daycare) OR (childcare) OR (class) OR (classroom) OR (classrooms) OR (cohort) OR ((online OR distance OR remote) learning))) OR (#safeseptember OR #safeseptemberAB OR #safeseptemberBC OR #SafeSeptemberMB OR #safeseptemberNB OR #safeseptemberNL OR #safeseptemberNS OR #safeseptemberON OR #safeseptemberPEI OR #safeseptemberQC OR #safeseptemberSK OR #safeseptemberYT OR #unsafeseptember OR #unsafeseptemberAB OR #unsafeseptemberBC OR #unsafeseptemberMB OR #unsafeseptemberNS OR #unsafeseptemberON OR #unsafeseptemberQC)) lang:en ((has:geo OR has:profile_geo) (place_country:CA OR pro

In [14]:
tweets,f_name = return_tweets(query,from_date=from_date,to_date=to_date,max_pages=290)
f_name,len(tweets)


('2020-02-15_2020-03-15', 13871)

## Process Tweets

Feature constructing, tweet cleaning, etc...

In [35]:
from utils import DTYPE, PARSE_DATES, PROVINCES,CONVERTERS
from text_cleaning import clean_text
from unidecode import unidecode
from math import isnan
import re

decode = lambda x : unidecode(x) if type(x) is str else x

def clean_tweet(text,extended_tweet,retweeted_status=None):
    if retweeted_status and type(retweeted_status) is dict:
        retweeted_status = dict(retweeted_status)
        cleaned = clean_tweet(retweeted_status.get("text"),retweeted_status.get("extended_tweet"))[:-1]
        #remove hashtags

        return (*cleaned,True)
    if pd.isna(extended_tweet):
        return clean_text(text), text, False
    to_dict = dict(extended_tweet)
    return clean_text(to_dict["full_text"]),to_dict["full_text"], False

rex = re.compile(r'<a.*?>(.*?)</a>',re.S|re.M)
def clean_source(source):
    if source:
        match = rex.match(source)
        return match.groups()[0].strip()
    return np.nan

clean_user = lambda x : x["screen_name"] if x["screen_name"] else None

def clean_entities(entities):
    hashtags = [h["text"] for h in entities["hashtags"]] if entities["hashtags"] else np.nan
    urls = [h["expanded_url"] for h in entities["urls"]] if entities["urls"] else np.nan
    mentions = [h["screen_name"] for h in entities["user_mentions"]] if entities["user_mentions"] else np.nan
    return hashtags,urls,mentions

in_province = lambda prov : prov in PROVINCES

def check_user(user):
    user = dict(user)
    if "derived" in user and "locations" in user["derived"]:
        loc = dict(user)["derived"]["locations"][0]
        long_lat = loc.get("geo").get("coordinates")
        city = loc.get("locality",np.nan)
        prov = loc.get("region",np.nan)
        city, prov = decode(city), decode(prov)
        loc_tup = (city, prov,*long_lat)
        return loc_tup
    return (np.nan,np.nan,np.nan,np.nan)
    
def clean_location(place,user):
    if place:
        place = dict(place)
        long_lat = place["bounding_box"]["coordinates"][0][0]
        split = [decode(l.strip()) for l in place["full_name"].split(",")]
        user_loc = check_user(user)
        if len(split) == 2:
            return tuple(split+long_lat) if in_province(split[-1]) else user_loc
        ## AFAIK the only time there's more than 1 comma in a place field is when the place is labelled 'unorganized'
        elif len(split) > 2:
            # If the tweet location object is having problems and we can derive a user location, do so.
            if not user_loc.count(np.nan) or not in_province(split[-1]):
                return user_loc
            return (np.nan,split[-1],*long_lat)
        else:
            # If the tweet location object is having problems and we can derive a user location, do so.
            if not user_loc.count(np.nan):
                return user_loc
            return (split[0],np.nan,*long_lat)
    else:
        return check_user(user)

def JSON_to_CSV(f_name):
    clean_fp = "../data/processed_data/{}.csv".format(f_name)
    # IF the file csv already exists don't waste time doing all the cleaning again
    if os.path.isfile(clean_fp):
        return pd.read_csv(clean_fp,
                           index_col=0,
                           header=0,
                           dtype=DTYPE,
                           converters=CONVERTERS,
                           parse_dates=PARSE_DATES)
    # If the zipped file already exists unzip it, write the plain file to the local storage and return the dataframe
    if os.path.isfile(clean_fp+".gz"):
        cov_tweets = pd.read_csv(clean_fp+".gz",
                                 index_col=0,
                                 header=0,
                                 compression='gzip',
                                 dtype=DTYPE,
                                 converters=CONVERTERS,
                                 parse_dates=PARSE_DATES)
        cov_tweets.to_csv(clean_fp)
        return cov_tweets
    cov_tweets = pd.read_json("../data/raw_data/{}.json".format(f_name))
    cov_tweets = cov_tweets[['id','user','created_at', 'source', 'text','extended_tweet','retweeted_status','place','entities','favorite_count', 'retweet_count']].set_index("id")
    # Get twitter handle from user
    cov_tweets["screen_name"] = cov_tweets["user"].apply(clean_user)
    # clean the tweet text
    cov_tweets[["text","extended_tweet","is_retweet"]] = cov_tweets[["text","extended_tweet","retweeted_status"]].apply(lambda x: clean_tweet(*x),axis=1,result_type="expand")
    cov_tweets = cov_tweets.rename({"text": "clean_text","extended_tweet":"original_text"},axis=1)
    # Get the city/province from the location data
    cov_tweets[["city","province","longitude","latitude"]] = cov_tweets[["place","user"]].apply(lambda x : clean_location(*x),axis=1,result_type="expand")
    cov_tweets = cov_tweets.drop(["place","user"],axis=1)
    # Through what medium did they post the tweet?
    cov_tweets["source"] = cov_tweets["source"].apply(clean_source)
    # Extract tweet entities (hashtags, linked urls, etc...)
    cov_tweets[["hashtags","urls","mentions"]] = cov_tweets[["entities"]].apply(lambda x : clean_entities(x["entities"]),result_type="expand",axis=1)
    cov_tweets = cov_tweets.drop("entities",axis=1)
    cov_tweets = cov_tweets[["created_at","screen_name","source","clean_text","original_text","is_retweet","favorite_count","retweet_count","hashtags","urls","mentions","city","province","longitude","latitude"]]
    cov_tweets.to_csv(clean_fp+".gz",compression='gzip')
    cov_tweets.to_csv(clean_fp)
    return cov_tweets
        

In [34]:
cov_tweets = JSON_to_CSV(f_name)
cov_tweets.head()

Unnamed: 0_level_0,created_at,screen_name,source,clean_text,original_text,is_retweet,favorite_count,retweet_count,hashtags,urls,mentions,city,province,longitude,latitude
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1238978134785101824,2020-03-14 23:59:41+00:00,wandreef,Twitter for Android,trudeau say canada pull stop respond covid kno...,"Trudeau says Canada is ""pulling out all the st...",True,0,0,,,[ABC],Slave Lake,Alberta,-114.76896,55.28344
1238977998805766145,2020-03-14 23:59:08+00:00,TheDevinaKaur,Twitter Web App,diary single school closure dear friend past w...,Diary of a single mom #Covid19 school closure...,True,0,0,"[Covid19, SexyBrilliant]",,[TheDevinaKaur],,,-113.64258,60.10867
1238977939166744577,2020-03-14 23:58:54+00:00,DarrylSeguin,Twitter for iPhone,chief medical officer health hinshaw ask kid s...,Chief Medical Officer of Health Hinshaw asked ...,True,0,0,,,[MattWolfAB],Lethbridge,Alberta,-112.81856,49.69999
1238977881881169921,2020-03-14 23:58:40+00:00,claramanoucheka,Twitter for Android,parent vermont school fundraising campaign jan...,Parents at two Vermont schools set up a fundra...,True,0,0,,,[NikkiHaley],,,-113.64258,60.10867
1238977677186314240,2020-03-14 23:57:51+00:00,unjadedmb,Twitter for iPhone,trudeau say canada pull stop respond covid kno...,"Trudeau says Canada is ""pulling out all the st...",True,0,0,,,[ABC],Winnipeg,Manitoba,-97.14704,49.88440
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1228472099464810498,2020-02-15 00:12:27+00:00,DianneWatts4BC,Twitter for iPhone,still discuss pilot do already safety kid rid ...,Why is this still being discussed and piloted ...,False,27,4,,[https://twitter.com/i/web/status/122847209946...,,Surrey,British Columbia,-122.82509,49.10635
1228470535530647552,2020-02-15 00:06:14+00:00,camille4change,Twitter for iPhone,student tcdsb would nice walk place know safe ...,"to me, a student in the TCDSB, this would be s...",True,0,0,,,[leahbanning],Hamilton,Ontario,-79.84963,43.25011
1228470466668564481,2020-02-15 00:05:57+00:00,Mom_ASDadvocate,Twitter for iPhone,student tcdsb would nice walk place know safe ...,"to me, a student in the TCDSB, this would be s...",True,0,0,,,[leahbanning],Toronto,Ontario,-79.41630,43.70011
1228470050996113408,2020-02-15 00:04:18+00:00,4Everanimalz1,Twitter for iPad,improve many priority pilot project important ...,Improving #RoadSafety in #Canada is one of our...,True,0,0,"[RoadSafety, Canada, seatbelts]",,[Transport_gc],Calgary,Alberta,-114.08529,51.05011


In [37]:
from glob import glob

date_frames = glob("../data/raw_data/2*.json")
date_frames = [d.split("/")[-1].split(".")[0] for d in date_frames]
list(map(JSON_to_CSV,tqdm(date_frames)))


HBox(children=(FloatProgress(value=0.0, max=8.0), HTML(value='')))




[                                   created_at      screen_name  \
 id                                                               
 1238978134785101824 2020-03-14 23:59:41+00:00         wandreef   
 1238977998805766145 2020-03-14 23:59:08+00:00    TheDevinaKaur   
 1238977939166744577 2020-03-14 23:58:54+00:00     DarrylSeguin   
 1238977881881169921 2020-03-14 23:58:40+00:00  claramanoucheka   
 1238977677186314240 2020-03-14 23:57:51+00:00        unjadedmb   
 ...                                       ...              ...   
 1228472099464810498 2020-02-15 00:12:27+00:00   DianneWatts4BC   
 1228470535530647552 2020-02-15 00:06:14+00:00   camille4change   
 1228470466668564481 2020-02-15 00:05:57+00:00  Mom_ASDadvocate   
 1228470050996113408 2020-02-15 00:04:18+00:00    4Everanimalz1   
 1228469111451242497 2020-02-15 00:00:34+00:00     Transport_gc   
 
                                   source  \
 id                                         
 1238978134785101824  Twitter for And

## Scraping Politician Mentions
Must @ a politician (premier or education minister) and be pertinent to covid AND school reopenings

In [7]:
edu_minister_dict = {
    "AB": "@davideggenAB",
    "BC": "@Rob_Fleming",
    "MB": "@mingoertzen",
    "NB": "@DominicCardy",
    "NL": "@BrianWarr709",
    "NT": "@RJSimpson_NWT",
    "NS": "@zachchurchill",
    "ON": "@Sflecce",
    "PEI": "@bradtrivers",
    "QC": "@jfrobergeQc",
    "SK": "@GordWyant",
    "YT": "@TracyMcPheeRS"
}

premier_dict = {
    "AB": "@jkenney",
    "BC": "@jjhorgan",
    "MB": "@BrianPallister",
    "NB": "@blainehiggs",
    "NL": "@PremierofNL",
    "NT": "@CCochrane_NWT",
    "NS": "@StephenMcNeil",
    "NU": "@JSavikataaq",
    "ON": "@fordnation",
    "PEI": "@dennyking",
    "QC": "@francoislegault",
    "SK": "@PremierScottMoe",
    "YT": "@Premier_Silver"
}

politicians = " OR ".join([val for _,val in list(premier_dict.items())+list(edu_minister_dict.items())])

politicians = f"(({politicians}) ({covid_filters} {child_filters}))"
pol_query = create_query([politicians],country)
from_date = "2020-02-15"
to_date = "2020-08-23"
print(pol_query,len(pol_query))


(((@jkenney OR @jjhorgan OR @BrianPallister OR @blainehiggs OR @PremierofNL OR @CCochrane_NWT OR @StephenMcNeil OR @JSavikataaq OR @fordnation OR @dennyking OR @francoislegault OR @PremierScottMoe OR @Premier_Silver OR @davideggenAB OR @Rob_Fleming OR @mingoertzen OR @DominicCardy OR @BrianWarr709 OR @RJSimpson_NWT OR @zachchurchill OR @Sflecce OR @bradtrivers OR @jfrobergeQc OR @GordWyant OR @TracyMcPheeRS) (((covid) OR (covid-19) OR (coronavirus) OR (pandemic) OR (lockdown) OR (shutdown) OR (closure) OR (closures) OR (open) OR (reopen) OR (risk) OR (safe) OR (safety) OR (safely)) ((child) OR (children) OR (toddler) OR (toddlers) OR (kid) OR (kids) OR (mom) OR (moms) OR (mother) OR (mothers) OR (dad) OR (dads) OR (father) OR (fathers) OR (parent) OR (parents))))) lang:en ((has:geo OR has:profile_geo) (place_country:CA OR profile_country:CA)) 854


In [None]:
tweets,f_name = return_tweets(pol_query,from_date=from_date,to_date=to_date,max_pages=300,f_name="pol_mention")
f_name,len(tweets)