## Instruction to Twitter Scraping

### 1. Twitter Scraping with Python Library GetoldTweets3
#### Some Reminders:
#### 1) Use "pip install GetOldTweets3" in Comment Prompt to install the package
#### 2) Meaningful keywords related to power outages: power outage, power outages, #poweroutage, #poweroutages, power is out, power is down, no electricity, #noelectricity, outage, outages, #outages    

In [1]:
## Importing libraries 
import os
import pandas as pd 
import GetOldTweets3 as got 

In [7]:
## Set the working directory for saving tweet files 
os.chdir("E:/Alliant/multipleRunning/no electricity/2018")
## Set parameters
from_date = "2020-04-24"
to_date = "2020-04-25"
maxTweets = 14000
searchQuery = "#poweroutage" 

In [8]:
### Download tweets 
tweetCriteria = got.manager.TweetCriteria().setQuerySearch(searchQuery)\
                                            .setSince(from_date)\
                                            .setUntil(to_date)\
                                            .setMaxTweets(maxTweets) 

tweets = got.manager.TweetManager.getTweets(tweetCriteria) 

## Retrieve information 
author_id = [] 
date = []  
text = [] 
username = [] 

for tweet in tweets: 

    id = str(tweet.author_id) 
    author_id.append(id) 
    
    dt = str(tweet.date) 
    date.append(dt) 
     
    txt = str(tweet.text) 
    text.append(txt) 

    user = str(tweet.username) 
    username.append(user) 
     
## Aggregate into file 
author_id = pd.Series(author_id) 
date = pd.Series(date)  
text = pd.Series(text) 
username = pd.Series(username) 

tweet_data = pd.concat([author_id, username, date, text], axis = 1) 
tweet_data.columns = ['author id', 'username', 'date', 'text'] 
tweet_data.to_csv("{}_{}_to_{}.csv".format(searchQuery,from_date,to_date))

print("Downloaded " + str(searchQuery) + str(from_date)+ "to" + str(to_date))

Downloaded #poweroutage2020-04-24to2020-04-25


### 2. Retrieve user location with username

#### Some reminders:
#### 1) Use "pip install tweepy" in Comment Prompt to install the package
#### 2) This method requires Twitter API keys. To obtain a Twitter Developer account and API keys, please go to: https://developer.twitter.com/en
#### 3) Not all users list a profile location: 1/3 of Tweets have user profile location

In [9]:
## Importing libraries 
import tweepy   

In [10]:
## Fill in credentials 
consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''
auth = tweepy.OAuthHandler(consumer_key, consumer_secret) 
auth.set_access_token(access_token, access_token_secret) 
api = tweepy.API(auth, wait_on_rate_limit=True,wait_on_rate_limit_notify=True) 

if (not api): 
    print ("Problem Connecting to API") 

In [11]:
### Retrieve user location
tweetCount=0 
user_loc = []
for i in range(len(author_id)): 
    
    user = str(username[i]) 
 
    try: 
        user_json = api.get_user(str("@" + user)) 
        userloc = user_json.location 
        if userloc: 
            user_loc.append(userloc) 
        else: 
            user_loc.append("N/A")
            
    except tweepy.TweepError as e: 
        user_loc.append("N/A") 
        print("some error : " + str(e))         

     
    tweetCount += 1 
    print("Downloaded {}/{} tweets".format(tweetCount, len(author_id))) 

user_loc = pd.Series(user_loc)    
tweet_data_with_loc = pd.concat([author_id, username, user_loc, date, text], axis = 1) 
tweet_data_with_loc.columns = ['author id', 'username', 'user location', 'date', 'text'] 
tweet_data_with_loc.to_csv("{}_{}_to_{}_with_loc.csv".format(searchQuery,from_date,to_date))

print("Downloaded " + str(searchQuery) + str(from_date)+ "to" + str(to_date) + " with location")

Downloaded 1/22 tweets
Downloaded 2/22 tweets
Downloaded 3/22 tweets
Downloaded 4/22 tweets
Downloaded 5/22 tweets
Downloaded 6/22 tweets
Downloaded 7/22 tweets
Downloaded 8/22 tweets
Downloaded 9/22 tweets
Downloaded 10/22 tweets
Downloaded 11/22 tweets
Downloaded 12/22 tweets
Downloaded 13/22 tweets
Downloaded 14/22 tweets
Downloaded 15/22 tweets
Downloaded 16/22 tweets
Downloaded 17/22 tweets
Downloaded 18/22 tweets
Downloaded 19/22 tweets
Downloaded 20/22 tweets
Downloaded 21/22 tweets
Downloaded 22/22 tweets
Downloaded #poweroutage2020-04-24to2020-04-25 with location


### 3. Retrieve and identify different states of the US for each user_loc

#### 1) User locations are manually entered, which is not always clean data.
#### 2) Some user locations that have capital “WI” or “IA” in the string get included in our filtering such as “NIGERIA”.
#### 3) A CSV file containing all the state names and abbreviation will be used 

In [12]:
#Read a CSV file containing all the state names from the Github
stateList = pd.read_csv('https://raw.githubusercontent.com/dengziqian/msba_AlliantEnergy/master/stateList.csv')
stateList_lowcase = pd.DataFrame({"State Lowercase":stateList["State"].str.lower(), "Abbreviation":stateList["Abbreviation"]})
new_stateList = stateList.join(stateList_lowcase.set_index("Abbreviation"), on="Abbreviation")

In [15]:
#Create an empty dataframe to save state information later
authorid_State=pd.DataFrame({"username":[], "state":[]})
ae = tweet_data_with_loc
for i in range(len(new_stateList)): 
    
    #In each tweet, search for each state name under "user location" column and return that tweet to a dataframe
    tweets_with_state_name = ae[ae["user location"].str.contains(str(new_stateList.iloc[i][0]), na=False)]
    
    #In each tweet, search for each state code under "user location" column and return that tweet to a dataframe
    tweets_with_state_code = ae[ae["user location"].str.contains(str(new_stateList.iloc[i][1]), na=False)]
    
    #In each tweet, search for each lower case state name under "user location" column and return that tweet to a dataframe
    tweets_with_state_name_lowercase = ae[ae["user location"].str.contains(str(new_stateList.iloc[i][2]), na=False)]
    
    #Combine there dataframes into one
    tweets_with_state = pd.concat([tweets_with_state_name, tweets_with_state_code, tweets_with_state_name_lowercase])
    
    #Create another dataframe to record the state code for the users
    state = pd.DataFrame({"username": tweets_with_state["username"], "state":str(stateList.iloc[i][1])})
    
    #Store the state information in an empty dataframe and keep adding information in it
    authorid_State = pd.concat([authorid_State, state])

#Since one user may have two states identified, I here keep the first state and remove the rest
authorid_State.drop_duplicates(subset = "username", keep='first', inplace=True)

#Join the original tweets file with authorid_State file on username
ae_join = ae.join(authorid_State.set_index('username'), on='username')
ae_join.to_csv("{}_{}_to_{}_with_loc_with_State.csv".format(searchQuery,from_date,to_date), index=False)
print("Downloaded " + str(searchQuery) + str(from_date)+ " to " + str(to_date) + " with_location_" + "with_State")

#Extract tweets from WI and IA
ae_wi = ae_join[ae_join["state"]=="WI"]
ae_ia = ae_join[ae_join["state"]=="IA"]
ae_wi_ia = pd.concat([ae_wi,ae_ia])

ae_wi_ia.to_csv("{}_{}_to_{}_with_loc_in_WI&IA.csv".format(searchQuery,from_date,to_date))
print("Downloaded " + str(searchQuery) + str(from_date)+ " to " + str(to_date) + " with_location_" + "in_WI & IA")

Downloaded #poweroutage2020-04-24to2020-04-25 with locationwith State
Downloaded #poweroutage2020-04-24to2020-04-25 with locationin WI & IA
