# 1. Data Collection Trial and Error
_**Author**: [Boom Devahastin Na Ayudhya](https://linkedin.com/in/boom-devahastin)_

### Contents
1. [TwitterScraper: Webscraping Tweets](#TwitterScraper:-Webscraping-Tweets)
2. [Geo Data for Cities](#Geo-Data-for-Cities)

### TwitterScraper: Webscraping Tweets
This is a package we can use to scrape historical tweets. While the ideal tool would use live Twitter data with the proper funding from clients, we would like to emphasize that this is a proof-of-concept model. Using historical tweets, we can identify posts that can be confirmed to be about power outages.

In [1]:
# Install Twitter Scraper
!pip install twitterscraper
import twitterscraper



No metadata found in c:\users\adiwid\anaconda3\lib\site-packages


In [2]:
# Load Packages
from twitterscraper import query_tweets
import datetime
import pandas as pd

# Create empty template tweet dataframe to later populate 
tweet_df =pd.DataFrame(columns=["id","text","timestamp"])

# Check template
tweet_df

Unnamed: 0,id,text,timestamp


This initial version scrapes posts that contain "blackout(s)", "power outage", or "outage(s)". Our ideal version of this query would also include an AND clause with a full list of US cities in hopes that we might infer the location from the body of the tweet text. However, the API locked us out due to attempting to pull too many posts at once so we were unable to do this within the given timeframe for the project. 

Since this is a proof-of-concept model, we will remedy this in Section 2 (Data Cleaning) which will involve randomly assigning a city to each post using the `NE_cities` dataframe above. This is so that we will have the chance to work with geospatial data visualization.

In [3]:
# Query Tweets between specified dates
list_of_tweets = query_tweets("blackout OR blackouts OR outage OR outages OR power outage",
                              begindate = datetime.date(2018,1,1),
                              enddate = datetime.date(2018,12,31),
                              poolsize = 1)

In [4]:
# Extract features of tweets to populate dataframe:
for row, tweet in enumerate(list_of_tweets):
    tweet_df.loc[row,'id'] = tweet.id
    tweet_df.loc[row,'text'] = tweet.text
    tweet_df.loc[row,'timestamp'] = tweet.timestamp

In [5]:
# Examine
print(tweet_df.shape)

In [6]:
# Save as .csv
tweet_df.to_csv("./dirty_tweets_20180101-20181231.csv")

### Geo Data for Cities
We read in a file that contains geographical data (including latitude and longitude) for all cities in the United States. As a proof-of-concept model, we're restricting our visualizations to just the New England region    

In [7]:
# Read in .csv
cities = pd.read_csv('./datasets/USCFinal.csv', index_col=0)

# Filter out only New England cities
NE_cities = cities[cities['state_id'].isin(["ME", "VT", "NH","MA", "CT", "RI"])]

# Extract only city, state abbreviation, state name, county name, latitude, and longitude
NE_cities = NE_cities.loc[:,['city', 'state_id', 'state_name', 'county_name', 'lat', 'lng']]

# View
NE_cities.head()

Unnamed: 0,city,state_id,state_name,county_name,lat,lng
4681,New London,NH,New Hampshire,Merrimack,43.4139,-71.9844
4682,Peterborough,NH,New Hampshire,Hillsborough,42.879,-71.9593
4683,Union,NH,New Hampshire,Carroll,43.4913,-71.0233
4684,Walpole,NH,New Hampshire,Cheshire,43.0792,-72.4236
4685,Enfield,NH,New Hampshire,Grafton,43.6448,-72.148


In [8]:
# Save down to records
NE_cities.to_csv("./datasets/new_england_cities_geo-data.csv")