## Problem Statement

While traditional methods for alerting on events such as hurricanes and tornadoes rely on information derived from official sources (e.g. USGS), this project aims to utilize Twitter activity to identify such an event. In practice, once the event is predicted, an alert can then be sent out across social media. The outcome of this project will be a binary classification model that can analyze tweets and use them to predict whether a disaster is present and a warning must be sent. As a proof of concept, this project will use archived tweets collected during the most dangerous days of Hurricane Sandy in 2012. The project's terminology will center around that of hurricanes specifically. In this situation, predicting no emergency while a hurricane approaches (false negative) is a much more dangerous outcome than predicting a hurricane when there is none (false positive). Models will therefore be evaluated on recall as well as accuracy.

## Executive Summary

### Data Acquisition

I initially atttempted to use Twitter's API to collect live data. I also tried to use tweet IDs from archived datasets to obtain exact data on the location and datetime of each tweet. This only yielded a small amount of data, not enough to for prediciton. For the scope of this project, I needed a time-efficient solution. I decided to take CrisisLex's dataset of archived tweets during Hurricane Sandy and use them to predict the presence of a hurricane. CrisisLex is a repository of social media data on various crises and natural disasters. This dataset consists of tweets taken from late October 2012, posted by users in coastal New York and New Jersey, and based on 4 keywords: hurricane, hurricane sandy, frankenstorm, and #sandy. One columns lists tweets as "on-topic" or "off-topic", meaning their relevance or irrelevance to the subject of the hurricane. This column is the basis of my binary classification.

### Data Cleaning and EDA

In this dataset, duplicated rows in the "tweets" column would refer to retweets. These rows are dropped so that tweets are not counted more than once, which would give some predictive words too much weight, leading to potential bias in the model. The positive class is the presence of a hurricane, so I set "on-topic" to 1 and "off-topic" to 0. I vectorized the corpus of tweets, and observed the most common words across the entire corpus and between the two classes. Irrelevant words from the positive class were added to the list of stop words to ensure that it was distinct from the negative class. I then used a clustering model to view the overlap between the two classes.

### Modeling

I started with logistic regression models to see if a simpler model would suffice. Following that, I used random forest models to determine if more complexity would lead to more accurate predictions, but these models underperformed. I also experimented with two different methods of vectorizing the tweets. As I conducted EDA, I added words to the list of stop words and re-examined the four models with various versions of the list. This did not have a marked effect on the models' performance. I selected the logistic regression model with CountVectorizer for its accuracy and low variance compared to the random forest models. The recall score of this model is high, meaning that false negatives are kept to a minimum. The strongest feature coefficients of this model correspond with words that are highly relevant to a hurricane. This includes direct references to the event and also to safety precautions taken during a disaster.

## Potenial Use of the Twitter API

This notebook demonstrates one possible method of pulling live tweets using Twitter's API for developers. Developing this method did not fit within the scope of this project, but may be an avenue to explore in the future. It may require some amount of funding to pull tweets freely for specialized projects.

In [1]:
!pip install pylast

Collecting pylast
  Downloading https://files.pythonhosted.org/packages/f5/d9/7ca6f3f9f5687e3f5ae03bf60e502a8a154409b04f4edcfc34b618ca485e/pylast-3.2.0-py2.py3-none-any.whl
Installing collected packages: pylast
Successfully installed pylast-3.2.0


## Loading Libraries

In [84]:
import tweepy
from tweepy import OAuthHandler
import numpy as np
import pandas as pd
from tweepy.streaming import StreamListener
from tweepy import Stream
import json
import sys
import csv

## Authentiation Using Access Keys and Tokens

In [45]:
from IPython.core.display import HTML
import pylast

API_KEY = 'EXAMPLE'
API_SECRET = 'EXAMPLE'
ACCESS_TOKEN = 'EXAMPLE'
ACCESS_SECRET = 'EXAMPLE'

auth = OAuthHandler(API_KEY, API_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

api = tweepy.API(auth)

The above code takes the keys and access tokens from a user's developer account. This is how a developer gains authentication to pull data directly from Twitter.

## Creating a Class and Necessary Functions

In [None]:
df = pd.Dataframe()

In [57]:
class StreamListener(tweepy.StreamListener):
    
    # def a function that goes to StreamListener and takes the the text and 
    # location of the text and saves it into global df 
    def on_status(self, status):
        if status.user.location is not None:
            global df
            
            # append stream saved in global df to empty df created initially
            df = df.append({"text": status.text, "locations": status.user.location}, ignore_index=True)
            print(status.text)
        
    # def a function that spits out False if the algorithm encounter error 420    
    def on_error(self, status_code):
        if status_code == 420:
            return False

This class adds the text of a tweet and its author's location to the empty dataframe created above.

In [60]:
stream_Listener = StreamListener()

stream_Listener.on_status

<bound method StreamListener.on_status of <__main__.StreamListener object at 0x11738d910>>

In [65]:
stream = tweepy.Stream(auth=api.auth, listener=stream_Listener)
stream.filter(track=["hurricane"], languages=["en"])
pass 

KeyboardInterrupt: 

The above code starts the scrape of Twitter using the keywords, languages, and time range specified. At the moment, this runs indefinitely with no sign of returning a usable result. In order for this method to work, I would need to find a way to halt the scrape and return a dataframe with the tweets collected thus far.