# Data Mining ⛏

**Purpose:** Collect all relevant Tweet's pertaining to the reopening of schools in the COVID-19 pandemic between Jan. 1, 2020 and Sept. 15, 2020.

**Pipeline:**
1. Connect to Twitter's Search Tweets API, to the `full archive` endpoint
2. Go province by province<sup>1</sup> and:
    1. Collect all tweets that mention that an education minister
    2. Collect all tweets that contain a dedicated list of keywords/hashtags
3. Store collection of tweets in Pandas dataframe, and only keep relevant features (data, geocode, text, author, *etc.*)
4. Add an extra column that is the cleaned tweet text.
5. Save dataframe to CSV
6. Solve the pandemic 🎊


<sup>1</sup> For more information on what tweets are geocoded, see [Twitter's geofiltering guide](https://developer.twitter.com/en/docs/tutorials/filtering-tweets-by-location)

In [1]:
import pandas as pd
import numpy as np
from searchtweets import collect_results, gen_rule_payload, load_credentials, ResultStream

premium_search_args = load_credentials(filename="../secrets/new_secret.yaml",yaml_key="search_tweets_api",env_overwrite=False)


Grabbing bearer token from OAUTH


## Location Filtering Rules

**Query Rules:** Each aspect of the query (mentions, keywords, hashtags, geo, etc...) should be encapsulated in their own brackets. Each part of the query, *aside from geo*, only needs one part to be satisfied, so those are all ORed together. Since geo must be satisfied, the rest of the query is put in brackets and geo is appended at the end.

**IMPORTANT** This does not work with the `sandbox` API tier so we need to pony up for `premium` first.

To collect tweets from province $X$, search for tweets where the account profile has location containing $X$ **OR** geocoded tweets that fall in $X$ 

Note: the `geo` attribute is deprecated and is ignored accordingly. For geocoded tweets only the `place` attribute will be used.

In [2]:
# Need to validate that these work
# places = {
#     "AB":'place_contains:", AB" OR place_contains:"Alberta" OR (profile_region:alberta) OR (bio_location:alberta OR bio_location:",AB")',
#     "BC":'place_contains:", BC" OR place_contains:"British Columbia" OR (profile_region:"british columbia") OR (bio_location:"british columbia" OR bio_location:",BC")',
#     "MB":'place_contains:", MB" OR place_contains:"Manitoba" OR (profile_region:manitoba) OR (bio_location:manitoba OR bio_location:",MB")',
#     "NB":'place_contains:", NB" OR place_contains:"New Brunswick" OR (profile_region:"new brunswick") OR (bio_location:"new brunswick" OR bio_location:",NB")',
#     "NL":'place_contains:", NL" OR place_contains:"Newfoundland and Labrador" OR (profile_region:"newfoundland and labrador") OR (bio_location:"newfoundland and labrador" OR bio_location:",NL")',
#     "NT":'place_contains:", NT" OR place_contains:"Northwest Territories" OR (profile_region:"northwest territories") OR (bio_location:"northwest territories" OR bio_location:",NT")',
#     "NS":'place_contains:", NS" OR place_contains:"Nova Scotia" OR (profile_region:"nova scotia") OR (bio_location:"nova scotia" OR bio_location:",NS")',
#     "NU":'place_contains:", NU" OR place_contains:"Nunavut" OR (profile_region:nunavut) OR (bio_location:nunavut OR bio_location:",NU")',
#     "ON":'place_contains:", ON" OR place_contains:"Ontario" OR (profile_region:ontario) OR (bio_location:ontario OR bio_location:",ON")',
#     "PEI":'place_contains:", PEI" OR place_contains:"Prince Edward Island" OR (profile_region:"prince edward island") OR (bio_location:"prince edward island" OR bio_location:",PEI")',
#     "QC":'place_contains:", QC" OR place_contains:"Quebec" OR (profile_region:qu\u00e9be) OR (bio_location:qu\u00e9be OR bio_location:",QC")',
#     "SK":'place_contains:", SK" OR place_contains:"Saskatchewan" OR (profile_region:saskatchewan) OR (bio_location:saskatchewan OR bio_location:",SK")',
#     "YT":'place_contains:", YT" OR place_contains:"Yukon" OR (profile_region:yukon) OR (bio_location:yukon OR bio_location:",YT")'
# }

# has geo AND one of these place markers
country = '((has:geo OR has:profile_geo) (place_country:CA OR profile_country:CA))'

In [3]:
edu_minister_dict = {
    "AB": "@davideggenAB",
    "BC": "@Rob_Fleming",
    "MB": "@mingoertzen",
    "NB": "@DominicCardy",
    "NL": "@BrianWarr709",
    "NT": "@RJSimpson_NWT",
    "NS": "@zachchurchill",
    "ON": "@Sflecce",
    "PEI": "@bradtrivers",
    "QC": "@jfrobergeQc",
    "SK": "@GordWyant",
    "YT": "@TracyMcPheeRS"
}

education_ministers = [val for _,val in edu_minister_dict.items()]

education_ministers = " OR ".join(education_ministers)
education_ministers = f"({education_ministers})"
print(education_ministers)

(@davideggenAB OR @Rob_Fleming OR @mingoertzen OR @DominicCardy OR @BrianWarr709 OR @RJSimpson_NWT OR @zachchurchill OR @Sflecce OR @bradtrivers OR @jfrobergeQc OR @GordWyant OR @TracyMcPheeRS)


In [4]:
premier_dict = {
    "AB": "@jkenney",
    "BC": "@jjhorgan",
    "MB": "@BrianPallister",
    "NB": "@blainehiggs",
    "NL": "@PremierofNL",
    "NT": "@CCochrane_NWT",
    "NS": "@StephenMcNeil",
    "NU": "@JSavikataaq",
    "ON": "@fordnation",
    "PEI": "@dennyking",
    "QC": "@francoislegault",
    "SK": "@PremierScottMoe",
    "YT": "@Premier_Silver"
}

premiers = " OR ".join([val for _,val in premier_dict.items()])
premiers = f"(({premiers}) ((covid OR covid-19 OR coronavirus) (school OR childcare OR child)))"
print(premiers)

((@jkenney OR @jjhorgan OR @BrianPallister OR @blainehiggs OR @PremierofNL OR @CCochrane_NWT OR @StephenMcNeil OR @JSavikataaq OR @fordnation OR @dennyking OR @francoislegault OR @PremierScottMoe OR @Premier_Silver) ((covid OR covid-19 OR coronavirus) (school OR childcare OR child)))


In [6]:
keywords = [
    '"back to school"',
    '(school OR classroom OR child OR children) (covid OR covid-19 OR coronavirus OR risk OR open OR reopen OR safe OR safety OR safely)',
]
keywords = "("+") OR (".join(keywords)+")"
keywords = f"({keywords})"
keywords

'(("back to school") OR ((school OR classroom OR child OR children) (covid OR covid-19 OR coronavirus OR risk OR open OR reopen OR safe OR safety OR safely)))'

In [7]:
hashtags = [
    '#safeseptember',
    '#unsafeseptember',
    '#backtoschool',
    '#BackToSchool2020'
]
hashtags = "("+" OR ".join(hashtags)+")"
hashtags

'(#safeseptember OR #unsafeseptember OR #backtoschool OR #BackToSchool2020)'

#### Sample Tweets

From: 
* March: 9, 16, 23
* April: 6, 13, 20
* May: 4, 11, 18
* June: 1, 8, 29
* July: 6, 13, 20
    

In [50]:
import os
import json


def create_query(filters,geo="",lang="en"):
    """
        Takes in a place filter (fully formed), a single mention (@someone) and a list of keywords and forms a
        query for Twitter's historical search API
    """
    lang = f"lang:{lang}"
    filter_str = " OR ".join(filters)
    query = f"({filter_str}) {lang} {geo}"
    return query.strip()


def return_tweets(query,from_date,to_date):
    fp = "../data/raw_data/{}_{}.json".format(from_date,to_date)
    if os.path.isfile(fp):
        with open(fp) as fin:
            return json.load(fin),fp
    rule = gen_rule_payload(query,
                        from_date=from_date, #UTC 2018-10-21 00:00
                        to_date=to_date,
                        results_per_call=500)
    rs = ResultStream(rule_payload=rule,
                  max_pages=1,
#                   max_results=10**10,
                  **premium_search_args)
    tweets = list(rs.stream())
    with open(fp, 'w') as fout:
        json.dump(tweets,fout,indent=4)
    return tweets,fp
    

query = create_query([keywords,premiers,hashtags,education_ministers],country)
from_date = "2020-08-08"
to_date = "2020-08-09"
tweets,fp = return_tweets(query,from_date=from_date,to_date=to_date)
print(fp)
tweets

../data/raw_data/2020-08-08_2020-08-09.json


[{'created_at': 'Sat Aug 08 23:59:56 +0000 2020',
  'id': 1292249214966206464,
  'id_str': '1292249214966206464',
  'text': 'RT @freedomgirl2011: They are losing control of the narrative on manufactured dangerous pandemic.\nNews 2 frighten parents..someday we shall…',
  'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>',
  'truncated': False,
  'in_reply_to_status_id': None,
  'in_reply_to_status_id_str': None,
  'in_reply_to_user_id': None,
  'in_reply_to_user_id_str': None,
  'in_reply_to_screen_name': None,
  'user': {'id': 753278309597847552,
   'id_str': '753278309597847552',
   'name': 'BEEHEMOTH ⏳',
   'screen_name': 'surveyorX',
   'location': '🇨🇦 Canada',
   'url': None,
   'description': '⚒️ The Architects of Free Trade ⚒️ Really Did Want a World Govt of Corporations http://bit.ly/2m5zJiR #NonPartisanDemocracy\nGAB/Parler: Bee@Beehemoth',
   'translator_type': 'none',
   'derived': {'locations': [{'country': 'Canada',
      'country_code': 'CA

## Process Tweets

Feature constructing, tweet cleaning, etc...

In [47]:
import re
from collections import Counter

def clean_tweet(text,extended_tweet):
    if pd.isna(extended_tweet):
        return pd.Series([clean_text(text), text])
    to_dict = dict(extended_tweet)
    return pd.Series([clean_text(to_dict["full_text"]),to_dict["full_text"]])

rex = re.compile(r'<a.*?>(.*?)</a>',re.S|re.M)
def clean_source(source):
    match = rex.match(source)
    return match.groups()[0].strip()

clean_user = lambda x : x["screen_name"] if x["screen_name"] else None

def clean_entities(entities):
    hashtags = [h["text"] for h in entities["hashtags"]] if entities["hashtags"] else ""
    urls = [h["expanded_url"] for h in entities["urls"]] if entities["urls"] else np.nan
    mentions = [h["screen_name"] for h in entities["user_mentions"]] if entities["user_mentions"] else np.nan
    return hashtags,urls,mentions

def check_user(user):
    user = dict(user)
    if "derived" in user and "locations" in user["derived"]:
        loc = dict(user)["derived"]["locations"][0]
        long_lat = loc.get("geo").get("coordinates")
        city = loc.get("locality",np.nan)
        prov = loc.get("region",np.nan)
        loc_tup = (loc.get("locality",np.nan),loc.get("region",np.nan),*long_lat)
        return loc_tup
    return (np.nan,np.nan,np.nan,np.nan)
    
def clean_location(place,user):
    if place:
        place = dict(place)
        long_lat = place["bounding_box"]["coordinates"][0][0]
        split = [l.strip() for l in place["full_name"].split(",")]
        if len(split) == 2:
            return tuple(split+long_lat)
        ## AFAIK the only time there's more than 1 comma in a place field is when the place is labelled 'unorganized'
        elif len(split) > 2:
            user_loc = check_user(user)
            # If the tweet location object is having problems and we can derive a user location, do so.
            if not user_loc.count(np.nan):
                return user_loc
            return (np.nan,split[-1],*long_lat)
        else:
            user_loc = check_user(user)
            # If the tweet location object is having problems and we can derive a user location, do so.
            if not user_loc.count(np.nan):
                return user_loc
            return (split[0],np.nan,*long_lat)
    else:
        return check_user(user)
        

In [51]:
from text_cleaning import clean_text
clean_fp = "../data/processed_data/{}_{}.csv".format(from_date,to_date)
cov_tweets = pd.DataFrame(tweets)
cov_tweets = cov_tweets[['id','user','created_at', 'source', 'text','extended_tweet','place','entities','favorite_count', 'retweet_count']].set_index("id")
# Get twitter handle from user
cov_tweets["screen_name"] = cov_tweets["user"].apply(clean_user)
# clean the tweet text
cov_tweets[["text","extended_tweet"]] = cov_tweets[["text","extended_tweet"]].apply(lambda x: clean_tweet(*x),axis=1)
cov_tweets = cov_tweets.rename({"text": "clean_text","extended_tweet":"original_text"},axis=1)
# Get the city/province from the location data
cov_tweets[["city","province","longitude","latitude"]] = cov_tweets[["place","user"]].apply(lambda x : clean_location(*x),axis=1,result_type="expand")
cov_tweets = cov_tweets.drop(["place","user"],axis=1)
# Through what medium did they post the tweet?
cov_tweets["source"] = cov_tweets["source"].apply(clean_source)
# Extract tweet entities (hashtags, linked urls, etc...)
cov_tweets[["hashtags","urls","mentions"]] = cov_tweets[["entities"]].apply(lambda x : clean_entities(x["entities"]),result_type="expand",axis=1)
cov_tweets = cov_tweets.drop("entities",axis=1)
cov_tweets = cov_tweets[["created_at","screen_name","source","clean_text","original_text","favorite_count","retweet_count","hashtags","urls","mentions","city","province","longitude","latitude"]]
cov_tweets.to_csv(clean_fp)
cov_tweets.head(50)

Unnamed: 0_level_0,created_at,screen_name,source,clean_text,original_text,favorite_count,retweet_count,hashtags,urls,mentions,city,province,longitude,latitude
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1292249214966206464,Sat Aug 08 23:59:56 +0000 2020,surveyorX,Twitter Web App,"[freedomgirl2011, lose, control, narrative, ma...",RT @freedomgirl2011: They are losing control o...,0,0,,,[freedomgirl2011],,,-113.64258,60.10867
1292249179167891458,Sat Aug 08 23:59:47 +0000 2020,Konstruktive_,Twitter for iPhone,"[safeseptember, miss, canada, plan, kid, back,...",RT @SafeSeptember: What's missing from Canada'...,0,0,,[https://www.cbc.ca/news/health/coronavirus-ca...,[SafeSeptember],Québec,Quebec,-71.21454,46.81228
1292249168774172672,Sat Aug 08 23:59:45 +0000 2020,siegburgerdaddy,Twitter for iPhone,"[c_tepperman, alberta, back-to-school, plan, s...",RT @c_tepperman: The Alberta back-to-school pl...,0,0,,,[c_tepperman],,Alberta,-117.469,52.28333
1292249164777050118,Sat Aug 08 23:59:44 +0000 2020,Blue_Star_mv1,Twitter for Android,"[torontostar, icymi, sick, kid, president, ron...",RT @TorontoStar: #ICYMI Sick Kids President Ro...,0,0,[ICYMI],,[TorontoStar],Mississauga,Ontario,-79.6583,43.5789
1292249109915742210,Sat Aug 08 23:59:31 +0000 2020,carabreac,Twitter for iPad,"[torontostar, sick, kid, president, ronald, co...",RT @TorontoStar: Sick Kids President Ronald Co...,0,0,,,[TorontoStar],,,-113.64258,60.10867
1292249109039063040,Sat Aug 08 23:59:31 +0000 2020,our_children,Twitter for iPhone,"[austinsaral, class, size, vital, curb, spread...",RT @austinsaral: Class sizes are vital to curb...,0,0,,,[austinsaral],Hamilton,Ontario,-79.84963,43.25011
1292249078223392768,Sat Aug 08 23:59:23 +0000 2020,FatFarmGma,Twitter for Android,"[cmoh_alberta, alberta_moms, mean, alberta, ch...",@CMOH_Alberta @alberta_moms\nWhat does this me...,0,0,,[https://www.cbsnews.com/amp/news/covid-19-kid...,"[CMOH_Alberta, alberta_moms]",,Alberta,-117.469,52.28333
1292249053212954628,Sat Aug 08 23:59:17 +0000 2020,RenataAncans,Twitter for iPhone,"[maritstiles, worth, remind, folk, low, number...",RT @maritstiles: Worth reminding folks that lo...,0,0,,,[maritstiles],Toronto,Ontario,-79.4163,43.70011
1292249048833900544,Sat Aug 08 23:59:16 +0000 2020,Col8675309,Twitter Web App,"[movingparadigms, believe, incumbent, question...","RT @MovingParadigms: "" I believe it is incumbe...",0,0,,,[MovingParadigms],,British Columbia,-125.0032,53.99983
1292249018760921093,Sat Aug 08 23:59:09 +0000 2020,alternativesart,Twitter Web App,"[torontostar, icymi, sick, kid, president, ron...",RT @TorontoStar: #ICYMI Sick Kids President Ro...,0,0,[ICYMI],,[TorontoStar],Hamilton,Ontario,-79.84963,43.25011
