# Data Mining ⛏

**Purpose:** Collect all relevant Tweet's pertaining to the reopening of schools in the COVID-19 pandemic between Jan. 1, 2020 and Sept. 15, 2020.

**Pipeline:**
1. Connect to Twitter's Search Tweets API, to the `full archive` endpoint
2. Go province by province<sup>1</sup> and:
    1. Collect all tweets that mention that an education minister
    2. Collect all tweets that contain a dedicated list of keywords/hashtags
3. Store collection of tweets in Pandas dataframe, and only keep relevant features (data, geocode, text, author, *etc.*)
4. Add an extra column that is the cleaned tweet text.
5. Save dataframe to CSV
6. Solve the pandemic 🎊


<sup>1</sup> For more information on what tweets are geocoded, see [Twitter's geofiltering guide](https://developer.twitter.com/en/docs/tutorials/filtering-tweets-by-location)

In [23]:
import pandas as pd
import numpy as np
from searchtweets import collect_results, gen_rule_payload, load_credentials, ResultStream

premium_search_args = load_credentials(filename="../secrets/secret.yaml",yaml_key="search_tweets_api",env_overwrite=False)

## Location Filtering Rules

**IMPORTANT** This does not work with the `sandbox` API tier so we need to pony up for `premium` first.

To collect tweets from province $X$, search for tweets where the account profile has location containing $X$ **OR** geocoded tweets that fall in $X$ 

Note: the `geo` attribute is deprecated and is ignored accordingly. For geocoded tweets only the `place` attribute will be used.

In [30]:
# Need to validate that these work
places = {
    "AB":'place_contains:", AB" OR place_contains:"Alberta" OR (profile_region:alberta) OR (bio_location:alberta OR bio_location:",AB")'
    "BC":'place_contains:", BC" OR place_contains:"British Columbia" OR (profile_region:"british columbia") OR (bio_location:"british columbia" OR bio_location:",BC")'
    "MB":'place_contains:", MB" OR place_contains:"Manitoba" OR (profile_region:manitoba) OR (bio_location:manitoba OR bio_location:",MB")'
    "NB":'place_contains:", NB" OR place_contains:"New Brunswick" OR (profile_region:"new brunswick") OR (bio_location:"new brunswick" OR bio_location:",NB")'
    "NL":'place_contains:", NL" OR place_contains:"Newfoundland and Labrador" OR (profile_region:"newfoundland and labrador") OR (bio_location:"newfoundland and labrador" OR bio_location:",NL")'
    "NT":'place_contains:", NT" OR place_contains:"Northwest Territories" OR (profile_region:"northwest territories") OR (bio_location:"northwest territories" OR bio_location:",NT")'
    "NS":'place_contains:", NS" OR place_contains:"Nova Scotia" OR (profile_region:"nova scotia") OR (bio_location:"nova scotia" OR bio_location:",NS")'
    "NU":'place_contains:", NU" OR place_contains:"Nunavut" OR (profile_region:nunavut) OR (bio_location:nunavut OR bio_location:",NU")'
    "ON":'place_contains:", ON" OR place_contains:"Ontario" OR (profile_region:ontario) OR (bio_location:ontario OR bio_location:",ON")'
    "PEI":'place_contains:", PEI" OR place_contains:"Prince Edward Island" OR (profile_region:"prince edward island") OR (bio_location:"prince edward island" OR bio_location:",PEI")'
    "QC":'place_contains:", QC" OR place_contains:"Quebec" OR (profile_region:qu\u00e9be) OR (bio_location:qu\u00e9be OR bio_location:",QC")'
    "SK":'place_contains:", SK" OR place_contains:"Saskatchewan" OR (profile_region:saskatchewan) OR (bio_location:saskatchewan OR bio_location:",SK")'
    "YT":'place_contains:", YT" OR place_contains:"Yukon" OR (profile_region:yukon) OR (bio_location:yukon OR bio_location:",YT")'
}

In [29]:
rule = gen_rule_payload(ON,
                        from_date="2019-10-21", #UTC 2018-10-21 00:00
                        to_date="2020-07-15",#UTC 2017-10-30 00:00
                        results_per_call=100)
rs = ResultStream(rule_payload=rule,
                  max_pages=1,
                  max_results=10**10,
                  **premium_search_args)
cov_tweets = pd.DataFrame(rs.stream())
cov_tweets.head()

HTTP Error code: 422: {"error":{"message":"There were errors processing your request: Reference to invalid field 'place_contains' (at position 11), Reference to invalid operator 'place_contains'. Operator is not available in current product or product packaging. Please refer to complete available operator list at http://t.co/operators. (at position 11)","sent":"2020-07-16T15:20:44+00:00","transactionId":"00e3407c00f8706e"}}
Request payload: {'query': '#COVID-19 place_contains:", ON"', 'maxResults': 100, 'toDate': '202007150000', 'fromDate': '201910210000'}


HTTPError: 

In [17]:
rule = gen_rule_payload("from:CJKRaymond",
                        from_date="2019-10-21", #UTC 2018-10-21 00:00
                        to_date="2020-07-15",#UTC 2017-10-30 00:00
                        results_per_call=100)
rs = ResultStream(rule_payload=rule,
                  max_pages=1,
                  max_results=10**10,
                  **premium_search_args)
print(rs.max_results)
tweets = pd.DataFrame(rs.stream())
tweets.head()

10000000000


In [24]:
tweets.columns
# tweets["geo"]

Index(['created_at', 'id', 'id_str', 'text', 'source', 'truncated',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'retweeted_status', 'is_quote_status', 'quote_count',
       'reply_count', 'retweet_count', 'favorite_count', 'entities',
       'favorited', 'retweeted', 'possibly_sensitive', 'filter_level', 'lang',
       'matching_rules', 'extended_tweet', 'display_text_range',
       'quoted_status_id', 'quoted_status_id_str', 'quoted_status',
       'quoted_status_permalink', 'extended_entities'],
      dtype='object')

In [22]:
t = list(rs.stream())
t

[{'created_at': 'Tue Jul 07 01:10:06 +0000 2020',
  'id': 1280308075279286273,
  'id_str': '1280308075279286273',
  'text': 'RT @katecallen: Story: Clearview AI to pull out of Canada and stop working with RCMP amid privacy investigation.  https://t.co/nF7KONZRXb',
  'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>',
  'truncated': False,
  'in_reply_to_status_id': None,
  'in_reply_to_status_id_str': None,
  'in_reply_to_user_id': None,
  'in_reply_to_user_id_str': None,
  'in_reply_to_screen_name': None,
  'user': {'id': 1171043334929997824,
   'id_str': '1171043334929997824',
   'name': 'Cameron Raymond',
   'screen_name': 'CJKRaymond',
   'location': None,
   'url': 'https://cameronraymond.me/',
   'description': 'Incoming Social Data Science MSc at the @UniofOxford and @oiioxford 🚀.  @queensu CS and Political Science alum 👨\u200d🎓',
   'translator_type': 'none',
   'protected': False,
   'verified': False,
   'followers_count': 17,
   'friends_coun