# CS4447 Final Project - Predicting Real Disasters from Tweets
## Hafez Gharbiah, Tyler Christeson
## data: https://www.kaggle.com/vbmokin/nlp-with-disaster-tweets-cleaning-data

## Rubric + Guidelines
1. Proper tagging of Github repository for final report as per deadlines (0.5 = 0.25 + 0.25 points)
1. Dataset and motivation slide (1 points)
    - How/why the dataset was collected and a description of the metadata of your dataset.
1. Actual task definition/research question (2 points)
    - What real-world problem are you trying to solve? What are the input and output of your analysis?
1. Literature review (2 points)
    - What other work has been done in this area, and how is your work novel compared to others?
1. Quality of cleaning (6 points, 2 points each) 
    - Data cleaning and type conversion activity. Please share anything unusual you faced during this activity.
    - What did you do about missing values and why? Handling missing values properly is very important.
    - New feature/attribute creation and data summary statistics and interpretation.
1. Visualization (8 points, 2 points each)
    - Data visualization activity (box plot, bar plot, violin plot, and pairplot to see relationships and distribution, etc.).
    - Describe anything you find in the data after each visualization.
    - What data visualization helped you understand about data distribution.
    - What you did about possible outlier as per data distribution visualization. (Did you confirm with your client whether it is actually an outlier or put a disclosure statement in your notebook if you decided to remove it?)

- The problem we're trying to solve is predicting whether a tweet is about a real disaster or not, which can be used to determine if emergency services need to be sent.

- We have a collection of 10,000 tweets. The attributes of the dataset are a unique identifier for each tweet, text of the tweet, where the tweet was sent from, keywords that could be used to identify disasters, and whether or not it is about a real disaster (only on some of them).

- Examples of records:
    - "Heard about # earthquake is different cities, stay safe everyone ." 
    - "Please like and share our new page for our Indoor Trampoline Park Aftershock opening this fall !" 
    - " nowplaying Alfons - Ablaze 2015 on Puls Radio pulsradio" 
    - "Coincidence Or # Curse ? Still # Unresolved Secrets From Past # accident"

- This is a noisy data set because the tweets are not all about disasters, and certain disaster keywords are used in contexts that are not disasters . For example, while "ablaze" is used in several real disaster tweets about ongoing fires, in the above example it is used as a song title. The same is true for many keywords, like "accident" and "aftershock" above.

- Feature engineering can be used in this dataset to:
    - extract years to see if we're tweeting about events that aren't current
    - extract news network name to determine if the accident is being reported on or not
    - topic modeling to extract relevant topics as features

In [148]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

import pandas as pd
import numpy as np
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

from sklearn.feature_extraction.text import CountVectorizer
import sklearn
# from sklearn.naive_bayes import GaussianNB
# from sklearn.metrics import confusion_matrix
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.ensemble import RandomForestClassifier

In [149]:
traindf = pd.read_csv('train_data_cleaning.csv',index_col=0)
testdf = pd.read_csv('test_data_cleaning.csv',index_col=0)
traindf.keyword[traindf.target==0].value_counts()

body%20bags          40
armageddon           37
harm                 37
deluge               36
wrecked              36
                     ..
oil%20spill           1
outbreak              1
typhoon               1
suicide%20bombing     1
suicide%20bomber      1
Name: keyword, Length: 218, dtype: int64

In [150]:
# Quality of cleaning (6 points, 2 points each)
### Data cleaning and type conversion activity. Please share anything unusual you faced during this activity.
### What did you do about missing values and why? Handling missing values properly is very important.
### New feature/attribute creation and data summary statistics and interpretation. 
train_text = traindf.text
stopwords = nltk.corpus.stopwords.words('english')
train_text = [nltk.word_tokenize(i) for i in train_text]
train_text = [[w.lower() for w in train_text[i] if w not in stopwords] for i in range(len(train_text))]

wnetl = WordNetLemmatizer()

def nltk_tag_pos(tag):   
    #adapted from https://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python
    if tag[0]=='J':
        return wordnet.ADJ
    elif tag[0]=='V':
        return wordnet.VERB
    elif tag[0]=='N':
        return wordnet.NOUN
    elif tag[0]=='R':
        return wordnet.ADV
    else:          
        return wordnet.NOUN #noun is default lemmatize POS
    
train_text_POS = [nltk.pos_tag(i) for i in train_text]    
train_text = [[wnetl.lemmatize(i[0],nltk_tag_pos(i[1])) for i in j] for j in train_text_POS]
train_text = [' '.join(train_text[i]) for i in range(len(train_text))]

traindf.text = train_text


In [151]:
train_text

['our deed reason # earthquake may allah forgive u',
 'forest fire near la ronge sask . canada',
 "all resident ask ' shelter place ' notified officer . no evacuation shelter place order expect",
 '13,000 people receive # wildfire evacuation order california',
 'just get sent photo ruby # alaska smoke # wildfire pour school',
 '# rocky fire update = > california hwy . 20 closed direction due lake county fire - # cafire # wildfire',
 '# flood # disaster heavy rain cause flash flood street manitou , colorado spring area',
 'i top hill i see fire wood .',
 'there emergency evacuation happen building across street',
 'i afraid tornado come area .',
 'three people die heat wave far',
 'haha south tampa get flood hah - wait a second i live in south tampa what be i gon na do what be i gon na do fvck # flood',
 '# rain # flood # florida # tampabay # tampa 18 19 day . i lose count',
 '# flood bago myanmar # we arrive bago',
 'damage school bus 80 multi car crash # break',
 'what man ?',
 'i lov

In [152]:
#bag of words model
cv = CountVectorizer(max_features=1000)
x= cv.fit_transform(train_text).toarray()
y = traindf.target.values


In [153]:
#70:30 train-test split
xtrain, xtest, ytrain, ytest = sklearn.model_selection.train_test_split(x,y,test_size=0.3,random_state=1234)

import sklearn.naive_bayes
classifier = sklearn.naive_bayes.GaussianNB()
classifier.fit(xtrain,ytrain)

ypred = classifier.predict(xtest)
ypred

confusionMatrix = sklearn.metrics.confusion_matrix(ytest,ypred)
print(confusionMatrix)
tn, fp, fn, tp = confusionMatrix.ravel()
accuracy = (tp+tn)/(sum(sum(confusionMatrix)))
print(f'Model accuracy is: {accuracy}')
print(f'Randomly assigning "target" status would result in accuracy of : {traindf.target.mean()}')

[[1209   79]
 [ 435  561]]
Model accuracy is: 0.7749562171628721
Randomly assigning "target" status would result in accuracy of : 0.4296597924602653


1000