# Data Mining ⛏

**Purpose:** Collect all relevant Tweet's pertaining to the reopening of schools in the COVID-19 pandemic between Jan. 1, 2020 and Sept. 15, 2020.

**Pipeline:**
1. Connect to Twitter's Search Tweets API, to the `full archive` endpoint
2. Go province by province<sup>1</sup> and:
    1. Collect all tweets that mention that an education minister
    2. Collect all tweets that contain a dedicated list of keywords/hashtags
3. Store collection of tweets in Pandas dataframe, and only keep relevant features (data, geocode, text, author, *etc.*)
4. Add an extra column that is the cleaned tweet text.
5. Save dataframe to CSV
6. Solve the pandemic 🎊


<sup>1</sup> For more information on what tweets are geocoded, see [Twitter's geofiltering guide](https://developer.twitter.com/en/docs/tutorials/filtering-tweets-by-location)

In [2]:
import pandas as pd
import numpy as np
from searchtweets import collect_results, gen_rule_payload, load_credentials, ResultStream

premium_search_args = load_credentials(filename="../secrets/new_secret.yaml",yaml_key="search_tweets_api",env_overwrite=False)


Grabbing bearer token from OAUTH


## Location Filtering Rules

**Query Rules:** Each aspect of the query (mentions, keywords, hashtags, geo, etc...) should be encapsulated in their own brackets. Each part of the query, *aside from geo*, only needs one part to be satisfied, so those are all ORed together. Since geo must be satisfied, the rest of the query is put in brackets and geo is appended at the end.

**IMPORTANT** This does not work with the `sandbox` API tier so we need to pony up for `premium` first.

To collect tweets from province $X$, search for tweets where the account profile has location containing $X$ **OR** geocoded tweets that fall in $X$ 

Note: the `geo` attribute is deprecated and is ignored accordingly. For geocoded tweets only the `place` attribute will be used.

In [3]:
# has geo AND one of these place markers
country = '((has:geo OR has:profile_geo) (place_country:CA OR profile_country:CA))'

### Keyword Strategy
*TODO: UPDATE*

3 conditions that a tweet must satisfy
1. It needs to be about the covid-19 pandemic (covid OR covid-19 OR coronavirus OR pandemic OR lockdown)
2. It needs to be about children/parental anxiety (child OR children OR kid OR LO OR toddler OR parent OR family)
3. It needs to be about school/the back to school season (school OR risk OR open OR reopen OR safe OR safety OR safely, etc..)

In [23]:
covid_filters = ["covid",
                 "covid-19",
                 "coronavirus",
                 "pandemic",
                 "lockdown",
                 "shutdown",
                 "closure",
                 "closures",
                 "open",
                 "reopen",
                 "risk",
                 "safe",
                 "safety",
                 "safely"]

covid_filters = "(("+") OR (".join(covid_filters)+"))"

school_filters = ["school",
          "schools",
          "preschools",
          "preschool",
          "daycare",
          "childcare",
          "class",
          "classroom",
          "classrooms",
          "cohort",
          "(online OR distance OR remote) learning"]

school_filters = "(("+") OR (".join(school_filters)+"))"

child_filters = ["child",
                 "children",
                 "toddler",
                 "toddlers",
                 "kid",
                 "kids",
                 "mom",
                 "moms",
                 "mother",
                 "mothers",
                 "dad",
                 "dads",
                 "father",
                 "fathers",
                 "parent",
                 "parents"]

child_filters = "(("+") OR (".join(child_filters)+"))"

keywords = "("+" ".join([covid_filters,child_filters,school_filters])+")"
keywords



'(((covid) OR (covid-19) OR (coronavirus) OR (pandemic) OR (lockdown) OR (shutdown) OR (closure) OR (closures) OR (open) OR (reopen) OR (risk) OR (safe) OR (safety) OR (safely)) ((child) OR (children) OR (toddler) OR (toddlers) OR (kid) OR (kids) OR (mom) OR (moms) OR (mother) OR (mothers) OR (dad) OR (dads) OR (father) OR (fathers) OR (parent) OR (parents)) ((school) OR (schools) OR (preschools) OR (preschool) OR (daycare) OR (childcare) OR (class) OR (classroom) OR (classrooms) OR (cohort) OR ((online OR distance OR remote) learning)))'

In [24]:
hashtags = [
    '#safeseptember',
    '#safeseptemberAB',
    '#safeseptemberBC',
    '#SafeSeptemberMB',
    '#safeseptemberNB',
    '#safeseptemberNL',
    '#safeseptemberNS',
    '#safeseptemberON',
    '#safeseptemberPEI',
    '#safeseptemberQC',
    '#safeseptemberSK',
    '#safeseptemberYT',
    '#unsafeseptember',
    '#unsafeseptemberAB',
    '#unsafeseptemberBC',
    '#unsafeseptemberMB',
    '#unsafeseptemberNS',
    '#unsafeseptemberON',
    '#unsafeseptemberQC',
]

hashtags = "("+" OR ".join(hashtags)+")"
hashtags


'(#safeseptember OR #safeseptemberAB OR #safeseptemberBC OR #SafeSeptemberMB OR #safeseptemberNB OR #safeseptemberNL OR #safeseptemberNS OR #safeseptemberON OR #safeseptemberPEI OR #safeseptemberQC OR #safeseptemberSK OR #safeseptemberYT OR #unsafeseptember OR #unsafeseptemberAB OR #unsafeseptemberBC OR #unsafeseptemberMB OR #unsafeseptemberNS OR #unsafeseptemberON OR #unsafeseptemberQC)'

#### Sample Tweets

From: 
* March: 8, 20
* April: 8, 20
* May: 8, 20
* June: 8, 20
* July: 8, 20
* August: 8, 20
    

In [25]:
import os
import json


def create_query(filters,geo="",lang="en"):
    """
        Takes in a list of fully formed filters that can be satisfied in disjunction.
    """
    lang = f"lang:{lang}"
    filter_str = " OR ".join(filters)
    query = f"({filter_str}) {lang} {geo}"
    return query.strip()


def return_tweets(query,from_date,to_date,f_name=None):
    name = f"{from_date}_{to_date}" if not f_name else f"{f_name}-{from_date}_{to_date}"
    fp = "../data/raw_data/{}.json".format(name)
    if os.path.isfile(fp):
        with open(fp) as fin:
            return json.load(fin),name
    print("Making request")
    rule = gen_rule_payload(query,
                        from_date=from_date, #UTC 2018-10-21 00:00
                        to_date=to_date,
                        results_per_call=500)
    rs = ResultStream(rule_payload=rule,
                  max_pages=1,
#                   max_results=10**10,
                  **premium_search_args)
    tweets = list(rs.stream())
    with open(fp, 'w') as fout:
        json.dump(tweets,fout,indent=4)
    return tweets,name

In [26]:
query = create_query([keywords,hashtags],country)
from_date = "2020-04-20"
to_date = "2020-04-21"
print(query,len(query))


((((covid) OR (covid-19) OR (coronavirus) OR (pandemic) OR (lockdown) OR (shutdown) OR (closure) OR (closures) OR (open) OR (reopen) OR (risk) OR (safe) OR (safety) OR (safely)) ((child) OR (children) OR (toddler) OR (toddlers) OR (kid) OR (kids) OR (mom) OR (moms) OR (mother) OR (mothers) OR (dad) OR (dads) OR (father) OR (fathers) OR (parent) OR (parents)) ((school) OR (schools) OR (preschools) OR (preschool) OR (daycare) OR (childcare) OR (class) OR (classroom) OR (classrooms) OR (cohort) OR ((online OR distance OR remote) learning))) OR (#safeseptember OR #safeseptemberAB OR #safeseptemberBC OR #SafeSeptemberMB OR #safeseptemberNB OR #safeseptemberNL OR #safeseptemberNS OR #safeseptemberON OR #safeseptemberPEI OR #safeseptemberQC OR #safeseptemberSK OR #safeseptemberYT OR #unsafeseptember OR #unsafeseptemberAB OR #unsafeseptemberBC OR #unsafeseptemberMB OR #unsafeseptemberNS OR #unsafeseptemberON OR #unsafeseptemberQC)) lang:en ((has:geo OR has:profile_geo) (place_country:CA OR pro

In [43]:
tweets,f_name = return_tweets(query,from_date=from_date,to_date=to_date,f_name="new_search")
f_name


'new_search-2020-04-20_2020-04-21'

## Process Tweets

Feature constructing, tweet cleaning, etc...

In [44]:
import re
from utils import PROVINCES
from unidecode import unidecode
from math import isnan

decode = lambda x : unidecode(x) if type(x) is str else x

def clean_tweet(text,extended_tweet,retweeted_status=None):
    if retweeted_status and type(retweeted_status) is dict:
        retweeted_status = dict(retweeted_status)
        cleaned = clean_tweet(retweeted_status.get("text"),retweeted_status.get("extended_tweet"))[:-1]
        return (*cleaned,True)
    if pd.isna(extended_tweet):
        return clean_text(text), text, False
    to_dict = dict(extended_tweet)
    return clean_text(to_dict["full_text"]),to_dict["full_text"], False

rex = re.compile(r'<a.*?>(.*?)</a>',re.S|re.M)
def clean_source(source):
    match = rex.match(source)
    return match.groups()[0].strip()

clean_user = lambda x : x["screen_name"] if x["screen_name"] else None

def clean_entities(entities):
    hashtags = [h["text"] for h in entities["hashtags"]] if entities["hashtags"] else np.nan
    urls = [h["expanded_url"] for h in entities["urls"]] if entities["urls"] else np.nan
    mentions = [h["screen_name"] for h in entities["user_mentions"]] if entities["user_mentions"] else np.nan
    return hashtags,urls,mentions

in_province = lambda prov : prov in PROVINCES

def check_user(user):
    user = dict(user)
    if "derived" in user and "locations" in user["derived"]:
        loc = dict(user)["derived"]["locations"][0]
        long_lat = loc.get("geo").get("coordinates")
        city = loc.get("locality",np.nan)
        prov = loc.get("region",np.nan)
        city, prov = decode(city), decode(prov)
        loc_tup = (city, prov,*long_lat)
        return loc_tup
    return (np.nan,np.nan,np.nan,np.nan)
    
def clean_location(place,user):
    if place:
        place = dict(place)
        long_lat = place["bounding_box"]["coordinates"][0][0]
        split = [decode(l.strip()) for l in place["full_name"].split(",")]
        user_loc = check_user(user)
        if len(split) == 2:
            return tuple(split+long_lat) if in_province(split[-1]) else user_loc
        ## AFAIK the only time there's more than 1 comma in a place field is when the place is labelled 'unorganized'
        elif len(split) > 2:
            # If the tweet location object is having problems and we can derive a user location, do so.
            if not user_loc.count(np.nan) or not in_province(split[-1]):
                return user_loc
            return (np.nan,split[-1],*long_lat)
        else:
            # If the tweet location object is having problems and we can derive a user location, do so.
            if not user_loc.count(np.nan):
                return user_loc
            return (split[0],np.nan,*long_lat)
    else:
        return check_user(user)
        

In [45]:
from text_cleaning import clean_text
clean_fp = "../data/processed_data/{}.csv".format(f_name)
cov_tweets = pd.DataFrame(tweets)
cov_tweets = cov_tweets[['id','user','created_at', 'source', 'text','extended_tweet','retweeted_status','place','entities','favorite_count', 'retweet_count']].set_index("id")
# Get twitter handle from user
cov_tweets["screen_name"] = cov_tweets["user"].apply(clean_user)
# clean the tweet text
cov_tweets[["text","extended_tweet","is_retweet"]] = cov_tweets[["text","extended_tweet","retweeted_status"]].apply(lambda x: clean_tweet(*x),axis=1,result_type="expand")
cov_tweets = cov_tweets.rename({"text": "clean_text","extended_tweet":"original_text"},axis=1)
# Get the city/province from the location data
cov_tweets[["city","province","longitude","latitude"]] = cov_tweets[["place","user"]].apply(lambda x : clean_location(*x),axis=1,result_type="expand")
cov_tweets = cov_tweets.drop(["place","user"],axis=1)
# Through what medium did they post the tweet?
cov_tweets["source"] = cov_tweets["source"].apply(clean_source)
# Extract tweet entities (hashtags, linked urls, etc...)
cov_tweets[["hashtags","urls","mentions"]] = cov_tweets[["entities"]].apply(lambda x : clean_entities(x["entities"]),result_type="expand",axis=1)
cov_tweets = cov_tweets.drop("entities",axis=1)
cov_tweets = cov_tweets[["created_at","screen_name","source","clean_text","original_text","is_retweet","favorite_count","retweet_count","hashtags","urls","mentions","city","province","longitude","latitude"]]
cov_tweets.to_csv(clean_fp)
cov_tweets.head()

Unnamed: 0_level_0,created_at,screen_name,source,clean_text,original_text,is_retweet,favorite_count,retweet_count,hashtags,urls,mentions,city,province,longitude,latitude
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1252384749127004161,Mon Apr 20 23:52:47 +0000 2020,DebbieC50234788,Twitter for Android,"[fear, life, school, parent, worry, whether, c...",No kid should fear for their life at school &a...,True,0,0,,,[JoeBiden],Colwood,British Columbia,-123.48591,48.43293
1252384493907832834,Mon Apr 20 23:51:46 +0000 2020,MrsSmithOttawa,Twitter for Android,"[qualified, teacher, teach, child, grade, leve...",I am a qualified teacher. I have taught my own...,True,0,0,,,[ESL_fairy],Ottawa,Ontario,-75.69812,45.41117
1252384195487232006,Mon Apr 20 23:50:35 +0000 2020,parentaction4ed,Twitter for iPhone,"[dear, kid, hero, fight, lose, everything, sch...",Dear Kids\n\nYou are all the heroes of this fi...,True,0,0,,,[GaritoMaria],Hamilton,Ontario,-79.84963,43.25011
1252383526973870080,Mon Apr 20 23:47:55 +0000 2020,Chadicg,Twitter for Android,"[marcus, rashford, raise, PS20, million, help,...",Marcus Rashford has now raised £20 million to ...,True,0,0,,,[TheManUtdWay],Toronto,Ontario,-79.4163,43.70011
1252383000039264257,Mon Apr 20 23:45:50 +0000 2020,Grace39029880,Twitter for iPhone,"[school, daycare, close, care, eye, identify, ...","With schools+daycares closed, caring eyes that...",True,0,0,"[childabuse, neglect, Children]",,[AlexMunter],Hamilton,Ontario,-79.84963,43.25011


## Scraping Politician Mentions
Must @ a politician (premier or education minister) and be pertinent to covid AND school reopenings

In [43]:
edu_minister_dict = {
    "AB": "@davideggenAB",
    "BC": "@Rob_Fleming",
    "MB": "@mingoertzen",
    "NB": "@DominicCardy",
    "NL": "@BrianWarr709",
    "NT": "@RJSimpson_NWT",
    "NS": "@zachchurchill",
    "ON": "@Sflecce",
    "PEI": "@bradtrivers",
    "QC": "@jfrobergeQc",
    "SK": "@GordWyant",
    "YT": "@TracyMcPheeRS"
}

premier_dict = {
    "AB": "@jkenney",
    "BC": "@jjhorgan",
    "MB": "@BrianPallister",
    "NB": "@blainehiggs",
    "NL": "@PremierofNL",
    "NT": "@CCochrane_NWT",
    "NS": "@StephenMcNeil",
    "NU": "@JSavikataaq",
    "ON": "@fordnation",
    "PEI": "@dennyking",
    "QC": "@francoislegault",
    "SK": "@PremierScottMoe",
    "YT": "@Premier_Silver"
}

politicians = " OR ".join([val for _,val in list(premier_dict.items())+list(edu_minister_dict.items())])

politicians = f"(({politicians}) ({covid_filters} {child_filters}))"
query = create_query([politicians],country)
from_date = "2020-08-17"
to_date = "2020-08-18"
print(query,len(query))


(((@jkenney OR @jjhorgan OR @BrianPallister OR @blainehiggs OR @PremierofNL OR @CCochrane_NWT OR @StephenMcNeil OR @JSavikataaq OR @fordnation OR @dennyking OR @francoislegault OR @PremierScottMoe OR @Premier_Silver OR @davideggenAB OR @Rob_Fleming OR @mingoertzen OR @DominicCardy OR @BrianWarr709 OR @RJSimpson_NWT OR @zachchurchill OR @Sflecce OR @bradtrivers OR @jfrobergeQc OR @GordWyant OR @TracyMcPheeRS) (((covid) OR (covid-19) OR (coronavirus) OR (pandemic) OR (lockdown) OR (shutdown) OR (closure) OR (closures) OR (open) OR (reopen) OR (risk) OR (safe) OR (safety) OR (safely)) ((child) OR (children) OR (toddler) OR (toddlers) OR (kid) OR (kids) OR (mom) OR (moms) OR (dad) OR (dads) OR (parent) OR (parents))))) lang:en ((has:geo OR has:profile_geo) (place_country:CA OR profile_country:CA)) 804
