# Project 4 - Client Problem #1: 
# Leveraging Social Media to Map Disasters
# Modelling the hurricane tweets

### This model will identify the Hurricane impact without classifying individual tweets. 

### Each observation will be a period of time (1 hour here) for a given geographical location. Historical data will be used to label each observation as to whether there was Hurricane impact at the time. 

### The model will then identify the most important words etc in the text that indicates that there is a natural disaster impact. This is done as part of the training process.

In [43]:
import json
import pandas as pd
from pprint import pprint
import datetime

import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

### Read in the file that was saved after the Exploratory Data Analysis was complete

In [2]:
df = pd.read_json('tweets_df_michael_time_series_oct.json')

In [3]:
df.head()

Unnamed: 0,tweet_count,hourly_text
2018-10-01 00:00:00,3,"Sorry fellas, she’s unaVEILable! #HechtYeaIDo ..."
2018-10-01 01:00:00,2,Happy birthday to my little princess!!! Love ...
2018-10-01 02:00:00,4,"At 4:24 PM EDT, 2 S Panama City [Bay Co, FL] O..."
2018-10-01 03:00:00,4,"#old pic, and yeah I think I forgot how to whe..."
2018-10-01 09:00:00,1,"We saved carbs all week for this, and it was w..."


In [4]:
df.shape # (341, 2)

(341, 2)

In [7]:
df.iloc[0]

tweet_count                                                    3
hourly_text    Sorry fellas, she’s unaVEILable! #HechtYeaIDo ...
Name: 2018-10-01 00:00:00, dtype: object

In [6]:
df.iloc[0].hourly_text

"Sorry fellas, she’s unaVEILable! #HechtYeaIDo #joiningjordan\n\n👰🏽\nShop my look by clicking this link —&gt; https://t.co/JKFHxk2xOQ \nOr screenshot to shop on the https://t.co/lgaMntMElS app! #liketkit… https://t.co/MLeLYTc1lT ~|~ These two did it ALL this weekend at the beach. #Brothers @ Shell Island White Sand Beach https://t.co/FtMa1krCAo ~|~ 🍍🍍🍍🍍🍍🍍 @ Pineapple Willy's Restaurant https://t.co/pZILxSXKLj"

### Get the count of all the tweets on the day Hurricane Michael made landfall

In [18]:
np.sum(df['2018-10-10']) # tweet_count: 149
df['2018-10-10'].shape # (23, 2)

(23, 2)

### Create a target column as the label for prediction training. Initially default it to zero.

In [24]:
df['target'] = 0

In [25]:
df.head()

Unnamed: 0,tweet_count,hourly_text,target
2018-10-01 00:00:00,3,"Sorry fellas, she’s unaVEILable! #HechtYeaIDo ...",0
2018-10-01 01:00:00,2,Happy birthday to my little princess!!! Love ...,0
2018-10-01 02:00:00,4,"At 4:24 PM EDT, 2 S Panama City [Bay Co, FL] O...",0
2018-10-01 03:00:00,4,"#old pic, and yeah I think I forgot how to whe...",0
2018-10-01 09:00:00,1,"We saved carbs all week for this, and it was w...",0


## Hurricane Michael made landfall at Mexico Beach on Oct 10, 2018 at 2pm EDT. That was 6pm UTC which is what the Twitter timestamps use.

#### We are going to arbitrarily set the target time (when we are trying to detect the presence of the hurricane) to 24 hours before and 72 hours after. This would be from Oct 9, 6pm UTC to Oct 13, 6pm UTC

In [31]:
df['2018-10-09 18:00:00':'2018-10-13 18:00:00']
df['2018-10-09 18:00:00':'2018-10-13 18:00:00'].target = 1
df.loc['2018-10-09 18:00:00':'2018-10-13 18:00:00'].target = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [34]:
df[df.target==1]
df.head()

Unnamed: 0,tweet_count,hourly_text,target
2018-10-01 00:00:00,3,"Sorry fellas, she’s unaVEILable! #HechtYeaIDo ...",0
2018-10-01 01:00:00,2,Happy birthday to my little princess!!! Love ...,0
2018-10-01 02:00:00,4,"At 4:24 PM EDT, 2 S Panama City [Bay Co, FL] O...",0
2018-10-01 03:00:00,4,"#old pic, and yeah I think I forgot how to whe...",0
2018-10-01 09:00:00,1,"We saved carbs all week for this, and it was w...",0


In [35]:
df.shape # (341, 3)

(341, 3)

### Set the columns of X to be the tweet count, and the hourly text i.e. the combined text of the tweets in that one hour period.

In [41]:
X = df[['tweet_count', 'hourly_text']]
X.head()

Unnamed: 0,tweet_count,hourly_text
2018-10-01 00:00:00,3,"Sorry fellas, she’s unaVEILable! #HechtYeaIDo ..."
2018-10-01 01:00:00,2,Happy birthday to my little princess!!! Love ...
2018-10-01 02:00:00,4,"At 4:24 PM EDT, 2 S Panama City [Bay Co, FL] O..."
2018-10-01 03:00:00,4,"#old pic, and yeah I think I forgot how to whe..."
2018-10-01 09:00:00,1,"We saved carbs all week for this, and it was w..."


In [42]:
y = df.target

In [44]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42)

In [48]:
X_train.shape # (255, 2)
X_test.shape # (86, 2)
y_train.shape # (255,)
y_test.shape # (86,)

(86,)

In [51]:
# cvect = CountVectorizer(lowercase=True, stop_words='english',max_df=1.0, min_df=1, 
#                         max_features=None)
cvect = CountVectorizer()

In [63]:
cvect.fit(X_train.hourly_text)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [69]:
len(cvect.get_feature_names()) # 5085 = X_train
# 6187 = X
cvect.get_feature_names()[:5]

['00', '000', '00pm', '03j9p2auhc', '04']

In [77]:
X_train_vect = cvect.transform(X_train.hourly_text)

In [88]:
X_train_vect
# <255x5085 sparse matrix of type '<class 'numpy.int64'>'
# 	with 16165 stored elements in Compressed Sparse Row format>
X_train_vect.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [74]:
X_train.shape # (255, 2)
X.shape # (341, 2)
X_test.shape # (86, 2)

(86, 2)

In [80]:
lr = LogisticRegressionCV()

In [81]:
lr.fit(X_train_vect, y_train)



LogisticRegressionCV(Cs=10, class_weight=None, cv='warn', dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='warn', n_jobs=None, penalty='l2',
           random_state=None, refit=True, scoring=None, solver='lbfgs',
           tol=0.0001, verbose=0)

In [82]:
lr.score(X_train_vect, y_train)
# 0.9882352941176471

0.9882352941176471

### So the training score was 0.988

In [83]:
X_test_vect = cvect.transform(X_test.hourly_text)

In [84]:
lr.score(X_test_vect, y_test)
# 0.8255813953488372

0.8255813953488372

### The test score was 0.82558, indicating the model is overfit i.e. it is significantly less than the training score (0.988).

### Now we will identify the words that were considered most significant in identifying the target variable.

In [85]:
lr.coef_

array([[ 0.1396091 ,  0.02024425, -0.0139385 , ..., -0.00707096,
        -0.00707096, -0.00707096]])

In [117]:
coefs=lr.coef_[0]
top_ten = np.argpartition(coefs, -10)[-10:] # get the top ten

top_ten[np.argsort(coefs[top_ten])] # sort the top ten
# array([ 393,   41, 4579, 2900, 3288,   74,   28, 3654,  337, 2267]) 
# last one is most important

# In order of importance - most important first
#features_array[2267] # 'hurricanemichael'
# features_array[337] # '80'
# features_array[3654] # 'rain'
#features_array[28] # '100'
#features_array[74] # '13mph'
#features_array[3288] # 'panhandle'
#features_array[2900] # 'michael'
#features_array[4579] # 'tyndall'
#features_array[41] # '1027mb'
#features_array[393] # '9mph'

'100'

### The ten most significant words were

### hurricanemichael, 80, rain, 100, 13mph, panhandle, michael, tyndall, 1027mb, 9mph

In [91]:
lr.coef_[0]

array([ 0.1396091 ,  0.02024425, -0.0139385 , ..., -0.00707096,
       -0.00707096, -0.00707096])

### Now we will look to see how many false Positives and false Negatives we had.

In [122]:
preds_array = lr.predict(X_test_vect)
#y_test
#y_test_preds = y_test.assign(target = preds_array)
type(y_test) # pandas.core.series.Series

pandas.core.series.Series

In [139]:
#cvect.get_feature_names()
type(preds_array) # numpy.ndarray

numpy.ndarray

In [142]:
X_test.shape # (86, 2)
X_test.head()

Unnamed: 0,tweet_count,hourly_text
2018-10-17 01:00:00,5,#hurricanemichael photos @rickcyoung @ Bay Cou...
2018-10-07 01:00:00,4,Kisses for the couple #itskloiberingtime #rosi...
2018-10-06 22:00:00,7,Panama City FL Sat Oct 6th PM Forecast: TONIGH...
2018-10-03 04:00:00,1,light intensity drizzle -&gt; clear sky\ntempe...
2018-10-07 16:00:00,5,I'm at Grayton Beach State Park in Santa Rosa ...


In [147]:
X_test_actual = X_test.assign(actual=y_test)

In [149]:
X_test_actual.head()

Unnamed: 0,tweet_count,hourly_text,actual
2018-10-17 01:00:00,5,#hurricanemichael photos @rickcyoung @ Bay Cou...,0
2018-10-07 01:00:00,4,Kisses for the couple #itskloiberingtime #rosi...,0
2018-10-06 22:00:00,7,Panama City FL Sat Oct 6th PM Forecast: TONIGH...,0
2018-10-03 04:00:00,1,light intensity drizzle -&gt; clear sky\ntempe...,0
2018-10-07 16:00:00,5,I'm at Grayton Beach State Park in Santa Rosa ...,0


In [154]:
X_test_actual_preds = X_test_actual.assign(predicted=preds_array)

### Results:
### 10 are correctly predicted as hurricane out of 14 total that are predicted as hurricane.
### 11 are incorrectly marked as not hurricane out of 21 actual hurricane flags

In [164]:
X_test_actual_preds[X_test_actual_preds.actual==1].shape # (21, 4)
X_test_actual_preds[X_test_actual_preds.predicted==1] 
# 10 are correctly predicted as hurricane out of 14 total that are predicted as hurricane

X_test_actual_preds[X_test_actual_preds.actual==1] 
# 11 are incorrectly marked as not hurricane out of 21 actual hurricane flags

#X_test.shape # (86, 2)

Unnamed: 0,tweet_count,hourly_text,actual,predicted
2018-10-12 05:00:00,2,🌀 #TeamTrimew #prayforflorida #panamabeach #pa...,1,0
2018-10-10 10:00:00,2,WALL TO WALL COVERAGE!!!!!!!! TUNE INTO ALL OF...,1,0
2018-10-10 06:00:00,2,"🌤 @ Panama City, Florida https://t.co/SRo9kDNU...",1,0
2018-10-12 04:00:00,9,St. John the Evangelist middle school @ St. An...,1,1
2018-10-11 06:00:00,1,moderate rain -&gt; light rain\ntemperature up...,1,1
2018-10-09 23:00:00,5,Where did everyone go? Beautiful day out here....,1,1
2018-10-10 04:00:00,4,Livvie loving the waves! #goldendoodle #waves ...,1,0
2018-10-11 14:00:00,3,"Disgruntled tourists. @ Rosemary Beach, Florid...",1,0
2018-10-11 03:00:00,2,Our house! But we are safe and our house has m...,1,0
2018-10-12 14:00:00,4,I’ve seen pictures and videos of the damage in...,1,0
