# Data Mining ⛏

**Purpose:** Collect all relevant Tweet's pertaining to the reopening of schools in the COVID-19 pandemic between Jan. 1, 2020 and Sept. 15, 2020.

**Pipeline:**
1. Connect to Twitter's Search Tweets API, to the `full archive` endpoint
2. Go province by province<sup>1</sup> and:
    1. Collect all tweets that mention that an education minister
    2. Collect all tweets that contain a dedicated list of keywords/hashtags
3. Store collection of tweets in Pandas dataframe, and only keep relevant features (data, geocode, text, author, *etc.*)
4. Add an extra column that is the cleaned tweet text.
5. Save dataframe to CSV
6. Solve the pandemic 🎊


<sup>1</sup> For more information on what tweets are geocoded, see [Twitter's geofiltering guide](https://developer.twitter.com/en/docs/tutorials/filtering-tweets-by-location)

In [1]:
import pandas as pd
import numpy as np
from searchtweets import collect_results, gen_rule_payload, load_credentials, ResultStream

premium_search_args = load_credentials(filename="../secrets/secret.yaml",yaml_key="search_tweets_api",env_overwrite=False)

## Location Filtering Rules

**IMPORTANT** This does not work with the `sandbox` API tier so we need to pony up for `premium` first.

To collect tweets from province $X$, search for tweets where the account profile has location containing $X$ **OR** geocoded tweets that fall in $X$ 

Note: the `geo` attribute is deprecated and is ignored accordingly. For geocoded tweets only the `place` attribute will be used.

In [15]:
# Need to validate that these work
places = {
    "AB":'place_contains:", AB" OR place_contains:"Alberta" OR (profile_region:alberta) OR (bio_location:alberta OR bio_location:",AB")',
    "BC":'place_contains:", BC" OR place_contains:"British Columbia" OR (profile_region:"british columbia") OR (bio_location:"british columbia" OR bio_location:",BC")',
    "MB":'place_contains:", MB" OR place_contains:"Manitoba" OR (profile_region:manitoba) OR (bio_location:manitoba OR bio_location:",MB")',
    "NB":'place_contains:", NB" OR place_contains:"New Brunswick" OR (profile_region:"new brunswick") OR (bio_location:"new brunswick" OR bio_location:",NB")',
    "NL":'place_contains:", NL" OR place_contains:"Newfoundland and Labrador" OR (profile_region:"newfoundland and labrador") OR (bio_location:"newfoundland and labrador" OR bio_location:",NL")',
    "NT":'place_contains:", NT" OR place_contains:"Northwest Territories" OR (profile_region:"northwest territories") OR (bio_location:"northwest territories" OR bio_location:",NT")',
    "NS":'place_contains:", NS" OR place_contains:"Nova Scotia" OR (profile_region:"nova scotia") OR (bio_location:"nova scotia" OR bio_location:",NS")',
    "NU":'place_contains:", NU" OR place_contains:"Nunavut" OR (profile_region:nunavut) OR (bio_location:nunavut OR bio_location:",NU")',
    "ON":'place_contains:", ON" OR place_contains:"Ontario" OR (profile_region:ontario) OR (bio_location:ontario OR bio_location:",ON")',
    "PEI":'place_contains:", PEI" OR place_contains:"Prince Edward Island" OR (profile_region:"prince edward island") OR (bio_location:"prince edward island" OR bio_location:",PEI")',
    "QC":'place_contains:", QC" OR place_contains:"Quebec" OR (profile_region:qu\u00e9be) OR (bio_location:qu\u00e9be OR bio_location:",QC")',
    "SK":'place_contains:", SK" OR place_contains:"Saskatchewan" OR (profile_region:saskatchewan) OR (bio_location:saskatchewan OR bio_location:",SK")',
    "YT":'place_contains:", YT" OR place_contains:"Yukon" OR (profile_region:yukon) OR (bio_location:yukon OR bio_location:",YT")'
}

## Mentions and Keywords

In [65]:
mentions = {
    "AB": "@davideggenAB",
    "BC": "@Rob_Fleming",
    "MB": "@mingoertzen",
    "NB": "@DominicCardy",
    "NL": "@BrianWarr709",
    "NT": "@RJSimpson_NWT",
    "NS": "@zachchurchill",
    "ON": "@Sflecce",
    "PEI": "@bradtrivers",
    "QC": "@jfrobergeQc",
    "SK": "@GordWyant",
    "YT": "@TracyMcPheeRS"
}

keywords = [
    '"back to school"',
    '"child care" (contains:covid OR coronavirus)',
    '(contains:reopen contains:school)' # Will match any tweet that contains reopen and school in the same tweet (including words like [reopen]ing)
]
print("("+") OR (".join(keywords)+")")

("back to school") OR ("child care" (contains:covid OR coronavirus)) OR ((contains:reopen contains:school))


In [66]:
def create_query(mention,keywords,place=None,lang="en"):
    """
        Takes in a place filter (fully formed), a single mention (@someone) and a list of keywords and forms a
        query for Twitter's historical search API
    """
    keyword_str = ") OR (".join(keywords)
    query =  f"({mention}) OR ({keyword_str}) lang:{lang}"
    return query if not place else f"{query} ({place})"

def return_tweets(query):
    rule = gen_rule_payload(query,
                        from_date="2019-10-21", #UTC 2018-10-21 00:00
                        to_date="2020-07-15",
                        results_per_call=100)
    rs = ResultStream(rule_payload=rule,
                  max_pages=1,
#                   max_results=10**10,
                  **premium_search_args)
    return list(rs)
    

create_query(mentions["AB"],keywords,place=places["AB"])

'(@davideggenAB) OR ("back to school") OR ("child care" (contains:covid OR coronavirus)) OR ((contains:reopen contains:school)) lang:en (place_contains:", AB" OR place_contains:"Alberta" OR (profile_region:alberta) OR (bio_location:alberta OR bio_location:",AB"))'

In [59]:
import json
rule = gen_rule_payload("from:cjkraymond lang:en",
                        from_date="2019-10-21", #UTC 2018-10-21 00:00
                        to_date="2020-07-15",
                        results_per_call=100)
rs = ResultStream(rule_payload=rule,
                  max_pages=1,
#                   max_results=10**10,
                  **premium_search_args)

raw_data = list(rs.stream())
with open('../data/raw_data/test2.json', 'w') as fout:
    json.dump(raw_data,fout,indent=4)

In [60]:
cov_tweets = pd.DataFrame(raw_data)
cov_tweets = cov_tweets[['id','created_at', 'source', 'text','extended_tweet','favorite_count', 'retweet_count','geo']]
cov_tweets.head()

Unnamed: 0,id,created_at,source,text,extended_tweet,favorite_count,retweet_count,geo
0,1280308075279286273,Tue Jul 07 01:10:06 +0000 2020,"<a href=""https://mobile.twitter.com"" rel=""nofo...",RT @katecallen: Story: Clearview AI to pull ou...,,0,0,
1,1280190767731036160,Mon Jul 06 17:23:58 +0000 2020,"<a href=""https://mobile.twitter.com"" rel=""nofo...",RT @TDataScience: Understanding Word2Vec throu...,,0,0,
2,1280173907119616000,Mon Jul 06 16:16:58 +0000 2020,"<a href=""https://mobile.twitter.com"" rel=""nofo...",New article in @TDataScience discussing #Word...,{'full_text': 'New article in @TDataScience d...,1,0,
3,1278296070855045120,Wed Jul 01 11:55:07 +0000 2020,"<a href=""https://mobile.twitter.com"" rel=""nofo...",RT @schock: This is huge. The ACM is calling f...,,0,0,
4,1278050691488059393,Tue Jun 30 19:40:04 +0000 2020,"<a href=""https://mobile.twitter.com"" rel=""nofo...",@asifrazzaq1988 Glad you liked it :),,0,0,


In [63]:
# TODO: implement text cleaning_alg
from text_cleaning import clean_text

def clean_tweet(text,extended_tweet):
    if pd.isna(extended_tweet):
        return pd.Series([clean_text(text), text])
    to_dict = dict(extended_tweet)
    return pd.Series([clean_text(to_dict["full_text"]),to_dict["full_text"]])

cov_tweets[["text","extended_tweet"]] = cov_tweets.apply(lambda x: clean_tweet(x["text"],x["extended_tweet"]),axis=1)
cov_tweets["clean_text"] = cov_tweets["text"]
cov_tweets["original_text"] = cov_tweets["extended_tweet"]
cov_tweets = cov_tweets.drop(["text","extended_tweet"],axis=1)
cov_tweets.head()

Unnamed: 0,id,created_at,source,favorite_count,retweet_count,geo,clean_text,original_text
0,1280308075279286273,Tue Jul 07 01:10:06 +0000 2020,"<a href=""https://mobile.twitter.com"" rel=""nofo...",0,0,,rt @katecallen: story: clearview ai to pull ou...,RT @katecallen: Story: Clearview AI to pull ou...
1,1280190767731036160,Mon Jul 06 17:23:58 +0000 2020,"<a href=""https://mobile.twitter.com"" rel=""nofo...",0,0,,rt @tdatascience: understanding word2vec throu...,RT @TDataScience: Understanding Word2Vec throu...
2,1280173907119616000,Mon Jul 06 16:16:58 +0000 2020,"<a href=""https://mobile.twitter.com"" rel=""nofo...",1,0,,new article in @tdatascience discussing #word...,New article in @TDataScience discussing #Word...
3,1278296070855045120,Wed Jul 01 11:55:07 +0000 2020,"<a href=""https://mobile.twitter.com"" rel=""nofo...",0,0,,rt @schock: this is huge. the acm is calling f...,RT @schock: This is huge. The ACM is calling f...
4,1278050691488059393,Tue Jun 30 19:40:04 +0000 2020,"<a href=""https://mobile.twitter.com"" rel=""nofo...",0,0,,@asifrazzaq1988 glad you liked it :),@asifrazzaq1988 Glad you liked it :)
