# Disaster Tweets

This notebook contains the final models I used for the Kaggle competition "Real or Not? NLP with Disaster Tweets" located here: https://www.kaggle.com/c/nlp-getting-started/overview.

In [1]:
# imports

# data
import pandas as pd
import numpy as np

# modeling
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import f1_score, make_scorer
from sklearn.naive_bayes import MultinomialNB
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV, StratifiedKFold, RandomizedSearchCV
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier

from nltk.corpus import stopwords

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')


## Read the training and testing data from files.
### Read the training data from file.

In [2]:
train_df = pd.read_csv('/Users/davidwalkup/ds-course/projects/Mod4/disaster_tweet_prediction/data/cleaned_train.csv')

In [3]:
train_df.head()

Unnamed: 0,id,keyword,text,target
0,1,nokeyword,our deeds are the reason of this earthquake ma...,1
1,4,nokeyword,forest fire near la ronge sask canada,1
2,5,nokeyword,all residents asked to shelter in place are be...,1
3,6,nokeyword,people receive wildfires evacuation orders in ...,1
4,7,nokeyword,just got sent this photo from ruby alaska as s...,1


### Create a DataFrame for lemmatized text.

In [4]:
lemmatized_train_df = pd.read_csv('/Users/davidwalkup/ds-course/projects/Mod4/disaster_tweet_prediction/data/lemmatized_train.csv')

In [5]:
lemmatized_train_df.head()

Unnamed: 0,id,keyword,text,target
0,1,nokeyword,our deed are the reason of this earthquake may...,1
1,4,nokeyword,forest fire near la ronge sask canada,1
2,5,nokeyword,all resident asked to shelter in place are bei...,1
3,6,nokeyword,people receive wildfire evacuation order in ca...,1
4,7,nokeyword,just got sent this photo from ruby alaska a sm...,1


### Read the testing data from file.

In [6]:
test_df = pd.read_csv('/Users/davidwalkup/ds-course/projects/Mod4/disaster_tweet_prediction/data/cleaned_test.csv')

In [7]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,nokeyword,nolocation,just happened a terrible car crash
1,2,nokeyword,nolocation,heard about earthquake is different city stay ...
2,3,nokeyword,nolocation,there is a forest fire at spot pond goose are ...
3,9,nokeyword,nolocation,apocalypse lighting spokane wildfire
4,11,nokeyword,nolocation,typhoon soudelor kill in china and taiwan


## How good does my model have to be to outperform the naive approach (i.e., no tweet is about a disaster)?

In [8]:
p_classes = dict(train_df['target'].value_counts(normalize=True))
naive_approach = p_classes[0]
print('Class probabilities: ', p_classes,
      '\nChance tweet is not about a real disaster: ', np.round(naive_approach, decimals = 4))

Class probabilities:  {0: 0.5737136763529725, 1: 0.42628632364702745} 
Chance tweet is not about a real disaster:  0.5737


## Final models
I used the highest-scoring version of each model type (Logistic Regression, Multinomial Naive Bayes, Support Vector Machine) to create submissions for the Kaggle contest.

### Logistic Regression Model
This model used lemmatized training data and term frequency-inverse document frequency vectorization. Stopwords were included.

In [9]:
tf_idf = TfidfVectorizer(ngram_range=(1, 1),
                         max_df=0.5,
                         min_df=2)

In [10]:
lr_train_tfidf_lemma = tf_idf.fit_transform(lemmatized_train_df['text'])

In [11]:
clf_lr = LogisticRegressionCV(class_weight = 'balanced')

scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_tfidf_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

0.6741968290118585 +/- 0.04037677659541278


Transformed the testing data using TF-IDF to match the training data vector matrix.

In [12]:
test_tfidf_lemma = tf_idf.transform(test_df['text'])

Confirmed matching number of columns in the vector matrices, so the test set target could be predicted.

In [13]:
lr_train_tfidf_lemma.shape

(7502, 5578)

In [14]:
test_tfidf_lemma.shape

(3263, 5578)

In [15]:
clf_lr.fit(lr_train_tfidf_lemma, train_df["target"])
lr_preds = clf_lr.predict(test_tfidf_lemma)
lr_preds

array([1, 1, 1, ..., 1, 1, 0])

### Multinomial Naive Bayes Model
This model used lemmatized training data and bagging via CountVectorizer. Stopwords were included.

In [16]:
count_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                   ngram_range = (1, 2),
                                   binary = True)

In [17]:
mnb_train_vector_lemma = count_vectorizer.fit_transform(lemmatized_train_df['text'])

In [18]:
# this was the best model (Kaggle score: 0.80777)
clf_mnb = MultinomialNB()

scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_vector_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

0.6838672747769288 +/- 0.04188014881745761


In [19]:
test_vector_lemma = count_vectorizer.transform(test_df['text'])

In [20]:
mnb_train_vector_lemma.shape

(7502, 68135)

In [21]:
test_vector_lemma.shape

(3263, 68135)

In [22]:
clf_mnb.fit(mnb_train_vector_lemma, train_df["target"])
mnb_preds = clf_mnb.predict(test_vector_lemma)
mnb_preds

array([1, 1, 1, ..., 1, 1, 1])

### Support Vector Machine Model
This model used lemmatized training data, term frequency-inverse document frequency weighting, and latent sentiment analysis.

#### LSA (Latent Sentiment Analysis)

In [23]:
svd = decomposition.TruncatedSVD(n_components = 100, random_state = 42)
normalizer = preprocessing.Normalizer()

In [24]:
svc_train_tfidf_lemma = tf_idf.fit_transform(lemmatized_train_df['text'])

In [25]:
clf_svc = svm.SVC(C = 0.7,
                  class_weight = 'balanced',
                  coef0 = 1,
                  degree = 2,
                  gamma = 'scale',
                  kernel = 'sigmoid',
                  random_state = 42)

pipe = pipeline.make_pipeline(svd, normalizer, clf_svc)

scores = model_selection.cross_val_score(pipe,
                                         svc_train_tfidf_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

0.6447315143896916 +/- 0.01868945409182802


In [26]:
tf_idf.fit(lemmatized_train_df['text'])
test_tfidf_svc_lemma = tf_idf.transform(test_df['text'])

In [27]:
svc_train_tfidf_lemma.shape

(7502, 5578)

In [28]:
test_tfidf_svc_lemma.shape

(3263, 5578)

In [29]:
clf_svc.fit(svc_train_tfidf_lemma, train_df["target"])
svc_preds = clf_svc.predict(test_tfidf_svc_lemma)
svc_preds

array([1, 1, 1, ..., 1, 1, 0])

---

### Create submission files.

In [30]:
# submission for Logistic Regression predictions
# model_sub = pd.read_csv('../data/sample_submission.csv')
# model_sub['target'] = lr_preds
# model_sub.to_csv('../data/lr_prediction_submission.csv', index = False)

In [31]:
# submission for Multinomial Naive Bayes predictions
# this one got the best Kaggle score: 0.80777
# model_sub = pd.read_csv('../data/sample_submission.csv')
# model_sub['target'] = mnb_preds
# model_sub.to_csv('../data/mnb_prediction_submission.csv', index = False)

In [32]:
# submission for SVM predictions
# model_sub = pd.read_csv('../data/sample_submission.csv')
# model_sub['target'] = svc_preds
# model_sub.to_csv('../data/svc_prediction_submission.csv', index = False)

### Next Steps
There are still some things I can try with these models in order to improve them:
* Vary the number of components in the LSA models
* More/better tuning of hyperparameters in all models

As well, I would like to test TensorFlow & BERT in a Kaggle notebook w/GPU turned on to see how it performs on this problem.