# Historic and Live Twitter Scraper

In this notebook, we will use the [Notable Fires data](../data/notable_fires_data.csv) from Wikipedia pulled in notebook [00_Wiki_Scrapes_Notable_Fires](./00_Wiki_Scrapes_Notable_Fires.ipynb). We will use the county and date range from Notable Fires in the state of California to decide which California Transit Twitter accounts to pull from and during which date ranges.

This notebook also makes use of the Twitter API: [**Tweepy**](https://www.tweepy.org/) and another Twitter wrapper called [GetOldTweets3 by Jefferson-Henrique](https://github.com/Jefferson-Henrique/GetOldTweets-python/tree/master/got3).

Use of the Tweepy API requires the user to have a twitter handle and a list with specified users to run. In this case, our twitter handle is [@giffordtompkins](https://twitter.com/giffordtompkins) and our list was called [cal-trans-official](https://twitter.com/giffordtompkins/lists/cal-trans-official). Users may specify multiple lists if they wish in the `"TWITTER_LISTS"` argument of the [credential file](../creds/samples/twitter-credentials-sample.json).

## OBJECTIVE 
The goal of this notebook is to pull a historical body of data in the form of tweets from California Transit twitter accounts to build our model upon.

In [2]:
import tweepy
import jsonpickle
import json
import datetime
import GetOldTweets3 as got
import time
import regex as re
import pandas as pd

## Load Tweepy Credentials
You can acquire a Twitter API Credential to be used with Tweepy [here](https://developer.twitter.com/en/docs/basics/getting-started).

*-- The following code is adapted from [Temple Moore](https://github.com/templecm4y/GA-Twitter-Road-Closures-Client-Project/blob/master/code/1_Tweepy-Historical-Tweets-Temple.ipynb).*

In [69]:
# load Twitter API credentials
with open('../creds/samples/twitter-credentials.json') as cred_data:
    info = json.load(cred_data)
    handle = info['TWITTER_HANDLE']
    consumer_key = info['CONSUMER_KEY']
    consumer_secret = info['CONSUMER_SECRET']
    access_token = info['ACCESS_TOKEN']
    access_secret = info['ACCESS_SECRET']
    twitter_lists = info['TWITTER_LISTS']

In [28]:
# Authenticate Tweepy connection with Twitter
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)


api = tweepy.API(auth)
user = api.get_user(handle)

try:
    api.verify_credentials()
    print("Authentication OK")
except:
    print("Error during authentication")

Authentication OK


In [29]:
# Confirm our connection and our access bandwidth.
user = api.me()
username = user.screen_name
print (user.name)
print(username)
print(api.rate_limit_status()['resources']['search'])

Gifford Tompkins III
giffordtompkins
{'/search/tweets': {'limit': 180, 'remaining': 180, 'reset': 1573666005}}


In [30]:
# Instantiate list of twitter accounts to scrape for historical tweets.
my_accounts = []

# Pull account names from twitter lists.
for twitter_list in twitter_lists:
    for member in tweepy.Cursor(api.list_members, username, twitter_list).items():
        my_accounts.append(member.screen_name)

# Show list
my_accounts

['CaltransDist6',
 'CaltransHQ',
 'CaltransDist3',
 'Caltrans9',
 'CaltransD4',
 'CaltransDist1',
 'CaltransD2',
 'CaltransD5',
 'Caltrans8mtn',
 'CaltransDist12',
 'CaltransDist10',
 'CaltransDist7',
 'Caltrans8']

In [31]:
# Pull CalTrans accounts specifically
cal_trans_accts = []
for account in my_accounts:
    if re.findall('\d+',account) != []:
        cal_trans_accts.append((int(re.findall('\d+',account)[0]),account))
cal_trans_accts
cal_trans_df = pd.DataFrame(cal_trans_accts,index=range(len(cal_trans_accts)),columns=['district','account'])
cal_trans_df.sort_values(by='district')

Unnamed: 0,district,account
4,1,CaltransDist1
5,2,CaltransD2
1,3,CaltransDist3
3,4,CaltransD4
6,5,CaltransD5
0,6,CaltransDist6
10,7,CaltransDist7
7,8,Caltrans8mtn
11,8,Caltrans8
2,9,Caltrans9


# Establishing GetOldTweets3 Paramaters
Using the [CalTrans districts data](../data/scrapes/cal-trans-districts.csv) and [notable fires](../data/scrapes/notable_fires_data.csv) datasets from notebooks [00 WikiScrapes CalTrans Districts](00_Wiki_Scrapes_Caltrans_Districts.ipynb) and [00 WikiScrapes Notable Fires](00_Wiki_Scrapes_Notable_Fires.ipynb), we can establish the twitter search parameters to pull the tweets we need to build our historical data set.

In [23]:
districts = pd.read_csv('../data/scrapes/counties_district_scrape.csv')
districts = districts.iloc[13:25,0:2]
districts.columns = ['district', 'counties']
districts

Unnamed: 0,district,counties
13,1,"Del Norte, Humboldt, Lake, Mendocino"
14,2,"Lassen, Modoc, Plumas, Shasta, Siskiyou, Teham..."
15,3,"Butte, Colusa, El Dorado, Glenn, Nevada, Place..."
16,4,"Alameda, Contra Costa, Marin, Napa, San Franci..."
17,5,"Monterey, San Benito, San Luis Obispo, Santa B..."
18,6,"Madera, Fresno, Tulare, Kings, Kern"
19,7,"Los Angeles, Ventura"
20,8,"Riverside, San Bernardino"
21,9,"Inyo, Mono"
22,10,"Alpine, Amador, Calaveras, Mariposa, Merced, S..."


In [24]:
# Split counties string into list and create 1-to-1 dictionary of counties to CalTrans Districts
districts['counties'] = districts['counties'].map(lambda x: [s.strip() for s in x.split(',')])

districts_list = districts['district']
counties = {}
for i in range(len(districts_list)):
    counties_list = districts['counties'].values[i]
    for county in counties_list:
        counties[county] = int(districts_list.values[i])
    
print(counties)
    

{'Del Norte': 1, 'Humboldt': 1, 'Lake': 1, 'Mendocino': 1, 'Lassen': 2, 'Modoc': 2, 'Plumas': 2, 'Shasta': 2, 'Siskiyou': 2, 'Tehama': 2, 'Trinity; portions of Butte and Sierra': 2, 'Butte': 3, 'Colusa': 3, 'El Dorado': 3, 'Glenn': 3, 'Nevada': 3, 'Placer': 3, 'Sacramento': 3, 'Sierra': 3, 'Sutter': 3, 'Yolo': 3, 'Yuba': 3, 'Alameda': 4, 'Contra Costa': 4, 'Marin': 4, 'Napa': 4, 'San Francisco': 4, 'San Mateo': 4, 'Santa Clara': 4, 'Solano': 4, 'Sonoma': 4, '': 4, 'Monterey': 5, 'San Benito': 5, 'San Luis Obispo': 5, 'Santa Barbara': 5, 'Santa Cruz': 5, 'Madera': 6, 'Fresno': 6, 'Tulare': 6, 'Kings': 6, 'Kern': 6, 'Los Angeles': 7, 'Ventura': 7, 'Riverside': 8, 'San Bernardino': 8, 'Inyo': 9, 'Mono': 9, 'Alpine': 10, 'Amador': 10, 'Calaveras': 10, 'Mariposa': 10, 'Merced': 10, 'San Joaquin': 10, 'Stanislaus': 10, 'Tuolumne': 10, 'Imperial': 11, 'San Diego': 11, 'Orange': 12}


In [5]:
! ls ../data/scrapes

cal-trans-districts_list.csv county_districts_list.csv
california_street_names.csv  notable_fires_data.csv
counties_district_scrape.csv


In [11]:
# Import fire data.
fires = pd.read_csv('../data/scrapes/notable_fires_data.csv')
fires.head()

Unnamed: 0,name,county,acres,hectares,start,contained,notes
0,Rumsey,Yolo,39138,15838.6,October 10 2004,October 16 2004,5 structures destroyed
1,Old,San Bernardino,91281,36940.1,October 21 2003,November 25 2003,975 structures destroyed
2,Simi,Ventura,108204,43788.6,October 25 2003,November 5 2003,315 structures destroyed
3,Topanga,Los Angeles,24175,9783.3,September 28 2005,October 6 2005,
4,Esperanza,Riverside,41173,16662.1,October 26 2006,November 1 2006,"5 fatalities, 54 structures destroyed"


In [12]:
fires['district'] = fires['county'].map(counties)
fires

Unnamed: 0,name,county,acres,hectares,start,contained,notes,district
0,Rumsey,Yolo,39138,15838.6,October 10 2004,October 16 2004,5 structures destroyed,3.0
1,Old,San Bernardino,91281,36940.1,October 21 2003,November 25 2003,975 structures destroyed,8.0
2,Simi,Ventura,108204,43788.6,October 25 2003,November 5 2003,315 structures destroyed,7.0
3,Topanga,Los Angeles,24175,9783.3,September 28 2005,October 6 2005,,7.0
4,Esperanza,Riverside,41173,16662.1,October 26 2006,November 1 2006,"5 fatalities, 54 structures destroyed",8.0
5,Island,Los Angeles,4750,1922.3,May 10 2007,May 15 2007,6 structures destroyed,7.0
6,Zaca,Santa Barbara,240207,97208.3,July 4 2007,September 4 2007,1 structure destroyed,5.0
7,Witch,San Diego,197990,80123.7,October 21 2007,November 6 2007,"1,650 structures destroyed",11.0
8,Harris,San Diego,90440,36599.8,October 21 2007,November 5 2007,472 structures destroyed; 1 fatality,11.0
9,Santiago,Orange,28400,11493.1,October 21 2007,November 9 2007,24 structures destroyed,12.0


In [13]:
# Lines with multiple counties need to be fixed, individually.
fires[fires['district'].isnull()]['county']

32               Amador Calaveras
33               Lake Napa Sonoma
43          Ventura Santa Barbara
47    Mendocino Lake Colusa Glenn
49            Los Angeles Ventura
Name: county, dtype: object

In [17]:
fires['district'] = fires['district'].fillna({
    32:counties['Amador'],
    33:counties['Napa'],
    43:counties['Ventura'],
    47:counties['Glenn'],
    49:counties['Los Angeles']
})

extra_Lake = dict(fires.iloc[33])
extra_Lake['district'] = counties['Lake']
fires = fires.append(extra_Lake,ignore_index=True)

extra_Mendocino = dict(fires.iloc[47])
extra_Mendocino['district'] = counties['Mendocino']
fires = fires.append(extra_Mendocino,ignore_index=True)

fires;

{'name': 'Valley', 'county': 'Lake Napa Sonoma', 'acres': 76067, 'hectares': 30783.2, 'start': 'September 12 2015', 'contained': 'October 15 2015', 'notes': '1,955 structures destroyed; 4 fatalities', 'district': 4.0}
{'name': 'Valley', 'county': 'Lake Napa Sonoma', 'acres': 76067, 'hectares': 30783.2, 'start': 'September 12 2015', 'contained': 'October 15 2015', 'notes': '1,955 structures destroyed; 4 fatalities', 'district': 1}
{'name': 'Mendocino Complex', 'county': 'Mendocino Lake Colusa Glenn', 'acres': 459102, 'hectares': 185792.0, 'start': 'July 27 2018', 'contained': 'September 18 2018', 'notes': '277 structures destroyed, 1 fatality', 'district': 3.0}
{'name': 'Mendocino Complex', 'county': 'Mendocino Lake Colusa Glenn', 'acres': 459102, 'hectares': 185792.0, 'start': 'July 27 2018', 'contained': 'September 18 2018', 'notes': '277 structures destroyed, 1 fatality', 'district': 1}


In [32]:
fires = fires.merge(cal_trans_df, how='inner', on='district')
fires.head()

Unnamed: 0,name,county,acres,hectares,start,contained,notes,district,account
0,Rumsey,Yolo,39138,15838.6,October 10 2004,October 16 2004,5 structures destroyed,3.0,CaltransDist3
1,King,El Dorado,97717,39544.7,September 13 2014,October 9 2014,80 structures destroyed,3.0,CaltransDist3
2,Mendocino Complex,Mendocino Lake Colusa Glenn,459102,185792.0,July 27 2018,September 18 2018,"277 structures destroyed, 1 fatality",3.0,CaltransDist3
3,Camp,Butte,153336,62050.0,November 8 2018,November 25 2018,"18,804 structures destroyed, 86 fatalities",3.0,CaltransDist3
4,Old,San Bernardino,91281,36940.1,October 21 2003,November 25 2003,975 structures destroyed,8.0,Caltrans8mtn


In [41]:
# Convert datetimes to twitter-friendly formats.
fires['start'] = pd.to_datetime(fires['start']).dt.strftime("%Y-%m-%d")
fires['contained'] = pd.to_datetime(fires['contained']).dt.strftime("%Y-%m-%d")
fires.head()

Unnamed: 0,name,county,acres,hectares,start,contained,notes,district,account
0,Rumsey,Yolo,39138,15838.6,2004-10-10,2004-10-16,5 structures destroyed,3.0,CaltransDist3
1,King,El Dorado,97717,39544.7,2014-09-13,2014-10-09,80 structures destroyed,3.0,CaltransDist3
2,Mendocino Complex,Mendocino Lake Colusa Glenn,459102,185792.0,2018-07-27,2018-09-18,"277 structures destroyed, 1 fatality",3.0,CaltransDist3
3,Camp,Butte,153336,62050.0,2018-11-08,2018-11-25,"18,804 structures destroyed, 86 fatalities",3.0,CaltransDist3
4,Old,San Bernardino,91281,36940.1,2003-10-21,2003-11-25,975 structures destroyed,8.0,Caltrans8mtn


In [42]:
# Create the a list of dictionaries to pass into the GOT3 query field.
tweet_params = []
for index, row in fires[['account','start','contained']].iterrows():
    tweet_params.append(dict(row))
tweet_params[0:2]

[{'account': 'CaltransDist3',
  'start': '2004-10-10',
  'contained': '2004-10-16'},
 {'account': 'CaltransDist3',
  'start': '2014-09-13',
  'contained': '2014-10-09'}]

## Create Functions for Pulling Tweets
We will create a few functions here for creating a tweet dictionary item and then pulling the batch of tweets from GetOldTweets3.

In [43]:
def create_tweet(tweet):
    '''
    Extract relevant information from the tweets collected by GetOldTweets3 and converts into a dicitonary.
    
    Parameters
    ----------
    tweet (json):
        The tweet object returned from GetOldTweets3
        
    Returns
    -------
    tweet_dict (dict):
        dictionary object with: 
            `id`, 
            `username`, 
            `date`, 
            `text`, 
            `hashtags`, 
            `geo` (geolocation of tweet, to be used when live civilian tweets are incorporated in future iterations), 
            `type` ("official", to be distinguished from civilian tweets in future iterations)
    '''
    tweet_dict = {}
    tweet_dict['id'] = tweet.id
    tweet_dict['username'] = tweet.username
    tweet_dict['date'] = tweet.date
    tweet_dict['text'] = tweet.text
    tweet_dict['hashtags'] = tweet.hashtags
    tweet_dict['geo'] = tweet.geo
    tweet_dict['type'] = 'official' 
    return tweet_dict

In [51]:
def grab_batch(batch_params, max_tweets=1000):
    '''
    This function grabs a batch of tweets of size `max_tweets` based on `batch_params` and appends them to a passed `append_list`.
    
    Parameters
    ----------
    batch_params (dict):
        dicitonary of query paramaters for the batch pull.
    max_tweets (int): default=1000
        maximum number of tweets to pull in each batch.
        
    Returns
    -------
    append_list (list):
        returns the appended list.
    '''
    # Instantiate list of tweets to return 
    tweet_list = []
    # establish our twitter criteria
    tweetCriteria = got.manager.TweetCriteria().setUsername(batch_params['account']).setSince(batch_params['start']).setUntil(batch_params['contained']).setMaxTweets(max_tweets)   

    # query GetOldTweets3
    tweets = got.manager.TweetManager.getTweets(tweetCriteria)

    # create dictionary from tweet object and append to list
    for tweet in tweets:
        tweet_list.append(create_tweet(tweet))
    time.sleep(3)

    return tweet_list

In [52]:
historical_tweets = []
for tweet_param in tweet_params:
    historical_tweets += grab_batch(tweet_param)

Unnamed: 0,id,username,date,text,hashtags,geo,type
0,519967790208266240,CaltransDist3,2014-10-08 21:49:16+00:00,Update: ETO is 7 p.m. to reopen EB I-80 betwee...,,,official
1,519952600498593792,CaltransDist3,2014-10-08 20:48:54+00:00,Eastbound Interstate 80 closed between Auburn ...,,,official
2,519937459191578625,CaltransDist3,2014-10-08 19:48:44+00:00,Hwy 113 closed at George Washington in Sutter ...,,,official
3,519877652728274945,CaltransDist3,2014-10-08 15:51:05+00:00,@D3PIO update hwy 162 OT big rig is near Butte...,,,official
4,519876511751749632,CaltransDist3,2014-10-08 15:46:33+00:00,Eastbound 80 at Penryn Rd #2&3 lanes blocked d...,#2,,official
...,...,...,...,...,...,...,...
128,511175131591606272,CaltransDist3,2014-09-14 15:30:23+00:00,US 50 Pioneer bridge working in the slow lanes...,,,official
129,511020410985406464,CaltransDist3,2014-09-14 05:15:34+00:00,US 50 Pioneer bridge work resumed at 10 pm in ...,#3 #4,,official
130,510947877385170944,CaltransDist3,2014-09-14 00:27:21+00:00,US 50 Pioneer bridge is now open and work will...,,,official
131,510934051721867265,CaltransDist3,2014-09-13 23:32:25+00:00,US 50 at the Pioneer Bridge will reopen at 5:3...,,,official


In [None]:
# Write historical tweets dataframe into a csv
pd.DataFrame(historical_tweets).to_csv('../data/tweets/historical_tweets.csv', index=False)

# Grab Live Tweets
This will pull the most recent tweets from all CalTrans twitter accounts.

In [58]:
def grab_current_tweets(twitter_account):
    tweets_list = []
    # establish our twitter criteria
    tweetCriteria = got.manager.TweetCriteria().setUsername(twitter_account).setMaxTweets(200)   

    # query GOT
    tweets = got.manager.TweetManager.getTweets(tweetCriteria)

    # create dictionary from tweet object
    for tweet in tweets:
        tweets_list.append(create_tweet(tweet))
    time.sleep(3)

    # added tweet dictionary to list
    return tweets_list

In [60]:
current_tweets = []
for account in my_accounts:
    current_tweets += grab_current_tweets(account)

In [63]:
pd.DataFrame(current_tweets).to_csv('../data/tweets/current_tweets.csv', index=False)

# Summary
In this notebook we established a historical corpus as well as a up-to-date library of CalTrans Tweets.