# Proof of Concept

## Contents
-  [Configure Postgres Server with Docker](#Configure-Postgres-Server-with-Docker)
-  [Application Token](#Application-Token)
-  [Collect Tweets](#Collect-Tweets)
-  [Retrieve Data from PostgreSQL Database](#Retrieve-Data-from-PostgreSQL-Database)
-  [Clean Text](#Clean-Text)
-  [Model Prep](#Model-Prep)
-  [Logistic Regression Classification](#Logistic-Regression-Classification)
-  [Save Data](#Save-Data)

In [1]:
# !pip install python-twitter

In [2]:
# !pip install psycopg2-binary

In [3]:
import twitter, json, time, re, nltk, pickle

import psycopg2 as pg2
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from psycopg2.extras import RealDictCursor, Json

In [4]:
%run ../assets/sql_test.py
%run ../assets/twitter_credentials.py

## Configure Postgres Server with Docker

Define functions to programmatically connect to and insert data into database:
-  **con_cur_to_db**: returns both a connection and a cursor object for database
-  **execute_query**: executes query directly to database, without having to create a cursor and connection each time
-  **insert_entry_json**: inserts data into database

In [5]:
def con_cur_to_db(dbname=DBNAME, dict_cur=None):
    con = pg2.connect(host=IP_ADDRESS,
                  dbname=dbname,
                  user=USER,
                  password=PASSWORD)
    if dict_cur:
        cur = con.cursor(cursor_factory=RealDictCursor)
    else:
        cur = con.cursor()
    return con, cur
    
def execute_query(query, dbname=DBNAME, dict_cur=None, command=False):
    con, cur = con_cur_to_db(dbname, dict_cur)
    cur.execute(f'{query}')
    if not command:
        data = cur.fetchall()
        con.close()
        return data
    con.commit() #sends to server
    con.close() #closes server connection

def insert_entry_json(data, tablename=None):
    con, cur = con_cur_to_db()
    for x in data:
        cur.execute(f'INSERT INTO {tablename} (data) VALUES ({Json(x)});')
    con.commit()
    con.close()

Create table `raw_tweets` to save our collected data into.

query = '''CREATE TABLE raw_tweets
(id SERIAL,
data JSONB);'''

execute_query(query, command=True)

## Application Token

Define API keys and instantiate twitter API

In [6]:
twitter_keys = {
    'consumer_key':        CONSUMER_KEY,
    'consumer_secret':     CONSUMER_SECRET,
    'access_token_key':    ACCESS_TOKEN,
    'access_token_secret': ACCESS_TOKEN_SECRET
}

api = twitter.Api(consumer_key         =   twitter_keys['consumer_key'],
                  consumer_secret      =   twitter_keys['consumer_secret'],
                  access_token_key     =   twitter_keys['access_token_key'],
                  access_token_secret  =   twitter_keys['access_token_secret'],
)

## Collect Tweets

Collect tweets and store into database:
-  `term`: term to search by
-  `geocode`: specify geolocation within which to search for tweets
-  `since`: search for tweets since specified date
-  `count`: number of results returned (100 max)
-  `sql_db`: database to save tweets to

In [7]:
def streamTweets(term, geocode, since, count, sql_db):
    for i in range(1,8):
        year, month, day = since.split('-')
        day = int(day)
        day-=1
        day = str(day).zfill(2)
        date = year + month + day
        after = datetime.strptime(date, '%Y%m%d').strftime('%Y-%m-%d')
        
        results = api.GetSearch(
            term = term,
            geocode = geocode,
            return_json = True
        )

        insert_entry_json(data = results['statuses'], 
                          tablename = sql_db)
        before = after

Define function to have `streamTweets` on a loop to programmatically collect tweets:
-  Repeat function 15 times, returning 100 (`count`) each time
-  Pause for 40 seconds to avoid exceeding rate limit

In [8]:
def tweet_repeater(term, geocode, since, sql_db, repeats=15, count=100):
    for i in range(repeats):
        since = since
        
        streamTweets(term, geocode, since, count, sql_db)
        print(f'Loop {i+1} complete. Raw tweets pushed to {sql_db}.')
        time.sleep(40)
        
    print('All tweets pulled.')

Collect most recent tweets:
-  that contains the term `storm` (terms were determined by natural disasters during time of search)
-  within 15 mile radius of location
-  starting from 2019-01-13
-  run function 100 times, collecting 700 tweets (1 week x 100 tweets) each time
-  save into `raw_tweets` database
-  sample output is displayed 

We searched over two locations and used the following terms for each:

|Location|Latitude|Longitude|Search Terms|Since Date|
|---|---|---|---|---|
|Malibu, CA|34.0249999|-118.773830238|flood, mudslide, landslide, rain, storm|2019-01-06|
|Riverside, CA|33.9806|-117.3755|flood|2019-01-13|

## Retrieve Data from PostgreSQL Database

SELECT * to determine data structure and find information most relevant to us.

query = '''SELECT * FROM raw_tweets;'''
response = execute_query(query, dict_cur=True)
print(type(response))
print(type(response[0]))

### Text (Tweets)

Our data is stored as a list of nested dictionaries. We want to retrieve the text itself (`text`), nested under `data` and put it in a dataframe (`df_text`)

In [9]:
query = """SELECT data ->> 'text'
FROM raw_tweets;
"""
response = execute_query(query, dict_cur=True)
df_text = pd.DataFrame(response)

### Geo-Coordinates
We then want to retrieve the geo coordinates for each tweet in order to map their location and allocate resources there. This is stored in dataframe `df_geo`.

In [10]:
query = """SELECT data#>'{place,bounding_box,coordinates}'
FROM raw_tweets;
"""
response = execute_query(query, dict_cur=True)
df_geo = pd.DataFrame(response).dropna()
df_geo.head()

Unnamed: 0,?column?
89,"[[[-118.668404, 33.704538], [-118.155409, 33.7..."
190,"[[[-118.668404, 33.704538], [-118.155409, 33.7..."
191,"[[[-118.668404, 33.704538], [-118.155409, 33.7..."
192,"[[[-118.668404, 33.704538], [-118.155409, 33.7..."
193,"[[[-118.668404, 33.704538], [-118.155409, 33.7..."


Users that do not have location enabled will return `NaN`, so we'll drop these.

In [11]:
df_geo.dropna(inplace = True)

The bounding box for the geo-fence is in a nested list. We key into the list and take the average left/right latitude and upper/lower longitude to approximate the location of a given tweet. These values are stored in the `lat` and `long` column, respectively.

In [12]:
latitude = []
longitude = []

for tweet in df_geo['?column?']:
    inside = tweet[0][1]
    outside = tweet[0][3]
    lat = (inside[0] + outside[0])/2
    long = (inside[1] + outside[1])/2
    latitude.append(lat)
    longitude.append(long)
    
df_geo['lat'] = latitude
df_geo['long'] = longitude

Checking the `lat` and `long` columns were correctly added to df_geo.

In [13]:
df_geo.head()

Unnamed: 0,?column?,lat,long
89,"[[[-118.668404, 33.704538], [-118.155409, 33.7...",-118.411907,34.020789
190,"[[[-118.668404, 33.704538], [-118.155409, 33.7...",-118.411907,34.020789
191,"[[[-118.668404, 33.704538], [-118.155409, 33.7...",-118.411907,34.020789
192,"[[[-118.668404, 33.704538], [-118.155409, 33.7...",-118.411907,34.020789
193,"[[[-118.668404, 33.704538], [-118.155409, 33.7...",-118.411907,34.020789


In [14]:
df_geo.shape

(32490, 3)

We then drop the nested list stored in `?column?`.

In [15]:
df_geo.drop(columns=['?column?'], inplace=True)

### All Data

We merge our text and geo-coordinates data into one dataframe, `df`.

In [16]:
df = pd.merge(df_text, df_geo, left_index=True, right_index=True)

Every twitter query returns a random subset of tweets containing the specified search term. We drop duplicates to ensure every tweet is unique.

In [17]:
df.drop_duplicates(keep='first', inplace=True)

Checking the merged dataframe to ensure the correct output.

In [18]:
df.head()

Unnamed: 0,?column?,lat,long
89,ENCINO UPDATE: LAFD @PIOErikScott says evacuat...,-118.411907,34.020789
191,@twdandmetal @Axis7173 @nearly_departed @unkno...,-118.411907,34.020789
193,Fire crews keeping an eye on hillside that hav...,-118.411907,34.020789
194,@twdandmetal @Axis7173 @nearly_departed @unkno...,-118.411907,34.020789
579,@darreng60 👍🙏☺️,-118.411907,34.020789


Renaming the `?column?` column to prepare for text cleaning.

In [19]:
df.rename(columns={'?column?':'text'}, inplace=True)

In [20]:
df.shape

(917, 3)

## Clean Text

In [21]:
def processTweet(tweet):
    tweet = tweet.lower()
    tweet = re.sub('[\s]+', ' ', tweet)
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','',tweet)
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    tweet = re.sub('@', '', tweet)
    tweet = re.sub('rt', '', tweet)
    return tweet

In [22]:
df['processed'] = [processTweet(i) for i in df['text']]

### Tokenize

In [23]:
tokenizer = RegexpTokenizer(r'\w+')

In [24]:
df['tokenized'] = df['processed'].map(lambda x: tokenizer.tokenize(x))

### Lemmatize

In [25]:
lemmatizer = WordNetLemmatizer()

In [26]:
df['lemmatized'] = df['tokenized'].map(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x]))

In [27]:
df.head()

Unnamed: 0,text,lat,long,processed,tokenized,lemmatized
89,ENCINO UPDATE: LAFD @PIOErikScott says evacuat...,-118.411907,34.020789,encino update: lafd pioerikscott says evacuati...,"[encino, update, lafd, pioerikscott, says, eva...",encino update lafd pioerikscott say evacuation...
191,@twdandmetal @Axis7173 @nearly_departed @unkno...,-118.411907,34.020789,twdandmetal axis7173 nearly_depaed unknown_meu...,"[twdandmetal, axis7173, nearly_depaed, unknown...",twdandmetal axis7173 nearly_depaed unknown_meu...
193,Fire crews keeping an eye on hillside that hav...,-118.411907,34.020789,fire crews keeping an eye on hillside that hav...,"[fire, crews, keeping, an, eye, on, hillside, ...",fire crew keeping an eye on hillside that have...
194,@twdandmetal @Axis7173 @nearly_departed @unkno...,-118.411907,34.020789,twdandmetal axis7173 nearly_depaed unknown_meu...,"[twdandmetal, axis7173, nearly_depaed, unknown...",twdandmetal axis7173 nearly_depaed unknown_meu...
579,@darreng60 👍🙏☺️,-118.411907,34.020789,darreng60 👍🙏☺️,[darreng60],darreng60


## Model Prep

### TF-IDF

Since we are primarily interested in tweets related to flooding, we will TF-IDF vectorize these samples with the TF-IDF vectorizer fit on the training data from the combined flood datasets.

In [28]:
with open(f'./assets/tfidf_flood.pkl', 'rb') as f:
    tfidf_flood = pickle.load(f)

In [29]:
with open(f'./assets/tfidf_flood_col.pkl', 'rb') as f:
    cols = pickle.load(f)

In [30]:
tfidf_df = tfidf_flood.transform(df['lemmatized'])
tfidf_df = pd.DataFrame(tfidf_df.toarray(), columns=cols)
tfidf_df.head()

Unnamed: 0,00,000,10,100,11,12,13,14,15,1800,...,yycflood,yycflood abflood,yycflood relief,yycflood yyc,yycflood yychelps,yycfloods,yychelps,yychelps yycflood,zone,zoo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.446169,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Logistic Regression Classification

Loading the trained Logistic Regression model.

In [31]:
with open(f'assets/lr_flood.pkl', 'rb') as f:
    lr = pickle.load(f)

Predicting classes for the unlabeled data.

In [32]:
preds = lr.predict(tfidf_df)

Adding the predicted classes back into the dataframe so we can reference the original text

In [33]:
df['label_on-topic'] = preds
df.head()

Unnamed: 0,text,lat,long,processed,tokenized,lemmatized,label_on-topic
89,ENCINO UPDATE: LAFD @PIOErikScott says evacuat...,-118.411907,34.020789,encino update: lafd pioerikscott says evacuati...,"[encino, update, lafd, pioerikscott, says, eva...",encino update lafd pioerikscott say evacuation...,False
191,@twdandmetal @Axis7173 @nearly_departed @unkno...,-118.411907,34.020789,twdandmetal axis7173 nearly_depaed unknown_meu...,"[twdandmetal, axis7173, nearly_depaed, unknown...",twdandmetal axis7173 nearly_depaed unknown_meu...,False
193,Fire crews keeping an eye on hillside that hav...,-118.411907,34.020789,fire crews keeping an eye on hillside that hav...,"[fire, crews, keeping, an, eye, on, hillside, ...",fire crew keeping an eye on hillside that have...,False
194,@twdandmetal @Axis7173 @nearly_departed @unkno...,-118.411907,34.020789,twdandmetal axis7173 nearly_depaed unknown_meu...,"[twdandmetal, axis7173, nearly_depaed, unknown...",twdandmetal axis7173 nearly_depaed unknown_meu...,False
579,@darreng60 👍🙏☺️,-118.411907,34.020789,darreng60 👍🙏☺️,[darreng60],darreng60,False


Counting the number of samples assigned to each class

In [34]:
df['label_on-topic'].value_counts()

False    896
True      21
Name: label_on-topic, dtype: int64

Most of the tweets collected are off topic, a result we would expect given there currently is no state of emergency. Sampling 10 tweets from each class to see representative tweets.

In [35]:
df[df['label_on-topic'] == 'False'][['lemmatized', 'label_on-topic']].sample(n=10, axis=0)

Unnamed: 0,lemmatized,label_on-topic
64052,devondstewa don t sta with me logically it doe...,False
166766,honestly uber pool wouldn t be a thing if ther...,False
122410,we really got 7 sub at our school like what s ...,False
29834,the way these clipper fan is booing j30_randle...,False
19753,kaydontplay_ girl it s your year all about you...,False
50763,this is why the medium asks absurd question so...,False
86063,hey bed,False
57757,manilaluzon,False
16499,thatssocute we are aren t we beau,False
17221,they have slowly added in culture under the wh...,False


Tweets classified as false appear to be correctly labeled.

In [36]:
df[df['label_on-topic'] == 'True'][['lemmatized', 'label_on-topic']].sample(n=10, axis=0)

Unnamed: 0,lemmatized,label_on-topic
155326,damien flood crest 2017 oil on canvas 30 x 24 ...,True
9187,heathdwilliams lol triple storm front rolling ...,True
145488,malibu storm prep video foxnews bethannstyne n...,True
27986,at citymalibu council meeting city manager rev...,True
7532,little man think it s going to flood whenever ...,True
9184,awoodruff the massive mannasas molasses flood,True
7301,am i crazy to drive into malibu during a rain ...,True
130927,los angeles ca tue jan 15th am forecast today ...,True
13826,a an adult driving through paially flooded str...,True
147075,nike adapt bb sneaker the first shoe of 2019 t...,True


Tweets labeled as on-topic appear to have relavance to a potential weather event. Words such as flood, storm, rain, accident, and safety align with terms identified by our modeling pipeline. We will look at the tweets that a high and low certainty of being labeled by our model.

In [37]:
proba = lr.predict_proba(tfidf_df)
proba

array([[0.62675569, 0.37324431],
       [0.99175131, 0.00824869],
       [0.55118037, 0.44881963],
       ...,
       [0.97606998, 0.02393002],
       [0.99631707, 0.00368293],
       [0.99175131, 0.00824869]])

The above shows the class prediction probabilities for each sample. Below, we will assign them into high certainty and low certainty groups so we can review the original text.

In [38]:
high_cert = []
low_cert = []

for i, prob in enumerate(proba):
    if (prob[0] > .75) | (prob[0] < .25):
        high_cert.append(i)
    
    elif (prob[0] > .40) & (prob[0] < .60):
        low_cert.append(i)

In [39]:
df.iloc[low_cert, [5,6]]

Unnamed: 0,lemmatized,label_on-topic
193,fire crew keeping an eye on hillside that have...,False
7320,that s more like it theandykatz stjohnsbball t...,False
7322,good chance of some thunderstorm high wind and...,False
7532,little man think it s going to flood whenever ...,True
9187,heathdwilliams lol triple storm front rolling ...,True
25462,area of malibu will be under mandatory evacuat...,False
28258,mandatory evacuation b c of mudslide debris fl...,False
35108,2011 2019,False
119932,help u fight eating disorder here by donating ...,False


The low certainty texts appear to be mixed accuracy, with tweets such as 7322, 25462 & 28258 incorrectly labeled. Interestingly, the term 'evacuation' appeared in 2 of the incorrectly labeled tweets, which suggests room for improvement with our model parameters.

In [40]:
df.iloc[high_cert, [5,6]].head(20)

Unnamed: 0,lemmatized,label_on-topic
191,twdandmetal axis7173 nearly_depaed unknown_meu...,False
194,twdandmetal axis7173 nearly_depaed unknown_meu...,False
579,darreng60,False
700,i m planning on playing how to survive in sout...,False
977,witch hunt la tonight 1 15 8pm aaannnd we re b...,False
1597,dooooyouuuubooo espn go run in traffic,False
2001,okay this pose is becoming a problem i need to...,False
2004,ponglizardo gillette that s not what i took aw...,False
2193,nevergillette,False
2450,michaelfinleyl1 zeromc we re riding rain or sh...,False


The high certainty tweets have mention of weather-related data but are correctly labeled off-topic, suggesting our model performs well when judging the context of certain terms.

## Save Data

In [41]:
df.to_csv('../data/processed_tweets')