### In this notebook, to futher explore our predictive model's abilities, we took additional pre-labeled disaster twitter datasets and ran predictions based on the unseen twitter data.  found a dataset from CrisisLEX that contained labeled tweets from a 2015 Earthquake from Nepal as well as flood in Queensland. These datasets were used solely to test the model we trained ability to make predictions.

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.naive_bayes import MultinomialNB
from xgboost import XGBClassifier
from sklearn.svm import SVC
from nltk.corpus import stopwords
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt 
import seaborn as sns 
import pickle 

In [43]:
#reading in dataset focusing on the earthquake.
testing_df = pd.read_csv('2015_Nepal_Earthquake_train.tsv', sep='\t', encoding="ISO-8859-1")

In [4]:
testing_df.head()

Unnamed: 0,tweet_id,text,label
0,591902739002560512,RT @AnupKaphle: #Nepal's prime minister addres...,relevant
1,592939706788216832,@jonsnowC4 So have we; read our friends blog f...,relevant
2,592591542168252416,Lend a helping hand if you can #Nepal https://...,relevant
3,591903009279385600,@shilpaanand they've managed to reach Kathmand...,relevant
4,592099765271199744,Israel Sending Aid Teams to Nepal After Quake:...,relevant


In [5]:
testing_df['label'].value_counts()

not_relevant    3606
relevant        3293
Name: label, dtype: int64

In [11]:
testing_df.isna().sum()

tweet_id    0
text        0
label       0
dtype: int64

In [6]:
#mapped predicted "label" column to 1's and 0's for ease of reading

testing_df['label'] = testing_df['label'].map({'not_relevant':0, 'relevant': 1,})

In [44]:
#baseline score

testing_df['label'].value_counts(normalize=True)

not_relevant    0.522684
relevant        0.477316
Name: label, dtype: float64

In [10]:
#loaded in the ultimate choice saved voting classifier model.

with open('vote_model_save2', 'rb') as f:
    trained_model3 = pickle.load(f)

In [13]:
#used regex to clean the data set of any html and other extraneious symbols

testing_df['text'] = testing_df['text'].map(lambda x: x.lower())
testing_df['text'] = testing_df['text'].map(lambda x: re.sub('\s[\/]?r\/[^s]+', ' ', x))
testing_df['text'] = testing_df['text'].map(lambda x: re.sub('http[s]?:\/\/[^\s]*', ' ', x))
testing_df['text'] = testing_df['text'].str.replace('[^\w\s#@/:%.,_-]', '', flags=re.UNICODE)

In [14]:
testing_df['predictions'] = trained_model3.predict(testing_df['text'])

In [16]:
testing_df['sum'] = testing_df['label'] + testing_df['predictions']

### Created a 'sum' colum in order to compare the 'label' column to the predicted column. In doing this:
- "2" would mean the tweet had previously been marked as being relevant to the disaster and the model **correctly** predicted it was relevant.
- "1" would mean the tweet had been previously marked not relevant and the model predicted it to be relvant, therefore an **incorrect** prediction
- "0" would mean the tweet had been previously marked not relevant and the model **correctly** predicted it to be relevant.

In [19]:
testing_df['sum'].value_counts(normalize=True)

1    0.389332
0    0.313959
2    0.296710
Name: sum, dtype: float64

#### We can see that the model scored 69% accuracy on this earthquake dataset. Not great, but still above the baseline.

In [20]:
#importing the first flood dataset

flood_test_df = pd.read_csv('2013_Queensland_Floods_dev.tsv', sep='\t', encoding="ISO-8859-1")

In [21]:
flood_test_df

Unnamed: 0,tweet_id,text,label
0,295530424854269952,Fuck It.. Chelsea should have been all over Br...,not_relevant
1,297305363668140032,Hey Dana Does @Alistairovereem gets the title ...,not_relevant
2,296215855602204672,@mimstacey @janecaro game over. In most states...,not_relevant
3,297199390697852928,I just made a new word: Awkwul . A mix between...,not_relevant
4,295358655036010497,Nothing like stifling heat to make me want to ...,not_relevant
...,...,...,...
998,296115486503075840,Grafton Queensland Flood Peaks at 10.7 Meters:...,relevant
999,295937544577748993,Helicopters Deployed to Rescue Flood Victims i...,relevant
1000,296180025714155520,RT @maltesemanor: NO! we've suffered enough! M...,relevant
1001,296223927305383937,"@skeletonunicorn the waters upto the door, my ...",relevant


In [45]:
#baseline score

flood_test_df['label'].value_counts(normalize = True)

1    0.539382
0    0.460618
Name: label, dtype: float64

In [24]:
#mapped "label" column to 1's and 0's for ease of reading

flood_test_df['label'] = flood_test_df['label'].map({'not_relevant':0, 'relevant': 1,})

In [25]:
#used regex to clean the data set of any html and other extraneious symbols

flood_test_df['text'] = flood_test_df['text'].map(lambda x: x.lower())
flood_test_df['text'] = flood_test_df['text'].map(lambda x: re.sub('\s[\/]?r\/[^s]+', ' ', x))
flood_test_df['text'] = flood_test_df['text'].map(lambda x: re.sub('http[s]?:\/\/[^\s]*', ' ', x))
flood_test_df['text'] = flood_test_df['text'].str.replace('[^\w\s#@/:%.,_-]', '', flags=re.UNICODE)

In [26]:
#model predictions

flood_test_df['predictions'] = trained_model3.predict(flood_test_df['text'])

In [27]:
flood_test_df

Unnamed: 0,tweet_id,text,label,predictions
0,295530424854269952,fuck it.. chelsea should have been all over br...,0,0
1,297305363668140032,hey dana does @alistairovereem gets the title ...,0,0
2,296215855602204672,@mimstacey @janecaro game over. in most states...,0,0
3,297199390697852928,i just made a new word: awkwul . a mix between...,0,0
4,295358655036010497,nothing like stifling heat to make me want to ...,0,0
...,...,...,...,...
998,296115486503075840,grafton queensland flood peaks at 10.7 meters:...,1,1
999,295937544577748993,helicopters deployed to rescue flood victims i...,1,1
1000,296180025714155520,rt @maltesemanor: no weve suffered enough mt @...,1,1
1001,296223927305383937,"@skeletonunicorn the waters upto the door, my ...",1,0


### Created a 'sum' colum in order to compare the 'label' column to the predicted column. In doing this:
- "2" would mean the tweet had previously been marked as being relevant to the disaster and the model **correctly** predicted it was relevant.
- "1" would mean the tweet had been previously marked not relevant and the model predicted it to be relvant, therefore an **incorrect** prediction
- "0" would mean the tweet had been previously marked not relevant and the model **correctly** predicted it to be relevant.

In [28]:
flood_test_df['sum'] = flood_test_df['label'] + flood_test_df['predictions']

In [30]:
#scores for the predicted test.`
flood_test_df['sum'].value_counts(normalize=True)

2    0.440678
0    0.433699
1    0.125623
Name: sum, dtype: float64

#### we can see above that the model was 89% correct when it came to predicting if the tweet was relevant or not for this flood dataset.

In [31]:
#importing a different flood dataset to test on.

flood_test_df_2 = pd.read_csv('2013_Queensland_Floods_train.tsv', sep='\t', encoding="ISO-8859-1")

In [35]:
flood_test_df_2['label'].value_counts(normalize=True)

relevant        0.539625
not_relevant    0.460375
Name: label, dtype: float64

In [36]:
flood_test_df_2['label'] = flood_test_df_2['label'].map({'not_relevant':0, 'relevant': 1,})

In [38]:
#used regex to clean the data set of any html and other extraneious symbols

flood_test_df_2['text'] = flood_test_df_2['text'].map(lambda x: x.lower())
flood_test_df_2['text'] = flood_test_df_2['text'].map(lambda x: re.sub('\s[\/]?r\/[^s]+', ' ', x))
flood_test_df_2['text'] = flood_test_df_2['text'].map(lambda x: re.sub('http[s]?:\/\/[^\s]*', ' ', x))
flood_test_df_2['text'] = flood_test_df_2['text'].str.replace('[^\w\s#@/:%.,_-]', '', flags=re.UNICODE)

In [39]:
flood_test_df_2['predictions'] = trained_model3.predict(flood_test_df_2['text'])

In [41]:
flood_test_df_2['sum'] = flood_test_df_2['label'] + flood_test_df_2['predictions']

In [42]:
flood_test_df_2['sum'].value_counts(normalize=True)

2    0.462701
0    0.434790
1    0.102509
Name: sum, dtype: float64

#### Similar to above, the scoring shows that this testing dataset was 89% accurate from the trained model.