# Disaster Tweets

This notebook is used for modeling of the cleaned and preprocessed dataset from the Kaggle competition "Real or Not? NLP with Disaster Tweets" located here: https://www.kaggle.com/c/nlp-getting-started/overview.

In [1]:
# imports

# data
import pandas as pd
import numpy as np

# modeling
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import f1_score, make_scorer
from sklearn.naive_bayes import MultinomialNB
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV, StratifiedKFold, RandomizedSearchCV
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier

from nltk.corpus import stopwords

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')


### Read the training and testing data from files.
#### Read the training data from file.

In [2]:
train_df = pd.read_csv('/Users/davidwalkup/ds-course/projects/Mod4/disaster_tweet_prediction/data/cleaned_train.csv')

In [3]:
train_df.head()

Unnamed: 0,id,keyword,text,target
0,1,nokeyword,our deeds are the reason of this earthquake ma...,1
1,4,nokeyword,forest fire near la ronge sask canada,1
2,5,nokeyword,all residents asked to shelter in place are be...,1
3,6,nokeyword,people receive wildfires evacuation orders in ...,1
4,7,nokeyword,just got sent this photo from ruby alaska as s...,1


#### Create a DataFrame for stemmed text.

In [4]:
stemmed_train_df = pd.read_csv('/Users/davidwalkup/ds-course/projects/Mod4/disaster_tweet_prediction/data/stemmed_train.csv')

In [5]:
stemmed_train_df.head()

Unnamed: 0,id,keyword,text,target
0,1,nokeyword,our deed are the reason of thi earthquak may a...,1
1,4,nokeyword,forest fire near la rong sask canada,1
2,5,nokeyword,all resid ask to shelter in place are be notif...,1
3,6,nokeyword,peopl receiv wildfir evacu order in california,1
4,7,nokeyword,just got sent thi photo from rubi alaska as sm...,1


#### Create a DataFrame for lemmatized text.

In [6]:
lemmatized_train_df = pd.read_csv('/Users/davidwalkup/ds-course/projects/Mod4/disaster_tweet_prediction/data/lemmatized_train.csv')

In [7]:
lemmatized_train_df.head()

Unnamed: 0,id,keyword,text,target
0,1,nokeyword,our deed are the reason of this earthquake may...,1
1,4,nokeyword,forest fire near la ronge sask canada,1
2,5,nokeyword,all resident asked to shelter in place are bei...,1
3,6,nokeyword,people receive wildfire evacuation order in ca...,1
4,7,nokeyword,just got sent this photo from ruby alaska a sm...,1


#### Read the testing data from file.

In [8]:
test_df = pd.read_csv('/Users/davidwalkup/ds-course/projects/Mod4/disaster_tweet_prediction/data/cleaned_test.csv')

In [9]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,nokeyword,nolocation,just happened a terrible car crash
1,2,nokeyword,nolocation,heard about earthquake is different city stay ...
2,3,nokeyword,nolocation,there is a forest fire at spot pond goose are ...
3,9,nokeyword,nolocation,apocalypse lighting spokane wildfire
4,11,nokeyword,nolocation,typhoon soudelor kill in china and taiwan


### How good does my model have to be to outperform the naive approach (i.e., no tweet is about a disaster)?

In [10]:
p_classes = dict(train_df['target'].value_counts(normalize=True))
naive_approach = p_classes[0]
print('Class probabilities: ', p_classes,
      '\nChance tweet is not about a real disaster: ', np.round(naive_approach, decimals = 4))

Class probabilities:  {0: 0.5737136763529725, 1: 0.42628632364702745} 
Chance tweet is not about a real disaster:  0.5737


#### Set up a DataFrame to hold scoring information, for final model selection.

In [11]:
scoring_df = pd.DataFrame(columns = ['Model', 'Vectorizer', 'Text_Treatment', 'Mean_F1_Score', 'F1_Std_Dev'])

### Bagging using sklearn CountVectorizer

First set of experiments will include stop words.

In [12]:
count_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                   ngram_range = (1, 2),
                                   binary = True)

#### Logistic Regression on CountVectorizer treated training data

In [13]:
lr_train_vectors = count_vectorizer.fit_transform(train_df['text'])

In [14]:
# LogReg, raw
clf_lr = LogisticRegressionCV(class_weight = 'balanced')

scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_vectors, train_df["target"],
                                         cv=5,
                                         scoring="f1")

mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)
# TODO: write function for scoring rows
score_row = pd.DataFrame.from_dict({'Model' : ['LogisticRegression'],
                                    'Vectorizer' : ['CountVectorizer'],
                                    'Text_Treatment' : ['None'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6382107580107526 +/- 0.05755892959272567


In [15]:
lr_train_vector_stem = count_vectorizer.fit_transform(stemmed_train_df['text'])

In [16]:
# LogReg, stemmed
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_vector_stem, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['LogisticRegression'],
                                    'Vectorizer' : ['CountVectorizer'],
                                    'Text_Treatment' : ['Stemmed'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.647272547233625 +/- 0.05912689198763571


In [17]:
lr_train_vector_lemma = count_vectorizer.fit_transform(lemmatized_train_df['text'])

In [18]:
# LogReg, lemmatized
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_vector_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['LogisticRegression'],
                                    'Vectorizer' : ['CountVectorizer'],
                                    'Text_Treatment' : ['Lemmatized'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6554592142449882 +/- 0.048466482009136516


Second set of experiments will remove stop words, to see if that improves performance.

In [19]:
english_stops = stopwords.words('english')
count_vectorizer_no_stops = CountVectorizer(strip_accents = 'unicode',
                                            stop_words = english_stops,
                                            ngram_range = (1, 2),
                                            binary = True)

In [20]:
lr_train_vector_no_stops = count_vectorizer_no_stops.fit_transform(train_df['text'])

In [21]:
# LogReg, no stopwords
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_vector_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['LogisticRegression'],
                                    'Vectorizer' : ['CountVectorizer'],
                                    'Text_Treatment' : ['Removed stopwords'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.5998213558149595 +/- 0.07525380294906855


In [22]:
lr_train_vector_stem_no_stops = count_vectorizer_no_stops.fit_transform(stemmed_train_df['text'])

In [23]:
# LogReg, stemmed, no stopwords
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_vector_stem_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['LogisticRegression'],
                                    'Vectorizer' : ['CountVectorizer'],
                                    'Text_Treatment' : ['Removed stopwords, stemmed'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6073351791553622 +/- 0.0706929906869057


In [24]:
lr_train_vector_lemma_no_stops = count_vectorizer_no_stops.fit_transform(lemmatized_train_df['text'])

In [25]:
# LogReg, lemmatized, no stopwords
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_vector_lemma_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['LogisticRegression'],
                                    'Vectorizer' : ['CountVectorizer'],
                                    'Text_Treatment' : ['Removed stopwords, lemmatized'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6080859128072598 +/- 0.06530945915867539


#### Multinomial Bayes on CountVectorizer treated training data

First set of experiments includes stopwords.

In [26]:
mnb_train_vectors = count_vectorizer.fit_transform(train_df['text'])

In [27]:
# Multinomial Naive Bayes
clf_mnb = MultinomialNB()
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_vectors, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['MultinomialNB'],
                                    'Vectorizer' : ['CountVectorizer'],
                                    'Text_Treatment' : ['None'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6749567532885051 +/- 0.0388777822155517


In [28]:
mnb_train_vector_stem = count_vectorizer.fit_transform(stemmed_train_df['text'])

In [29]:
# Multinomial Naive Bayes, stemmed
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_vector_stem, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['MultinomialNB'],
                                    'Vectorizer' : ['CountVectorizer'],
                                    'Text_Treatment' : ['Stemmed'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.679516177528353 +/- 0.04514210200517351


In [30]:
mnb_train_vector_lemma = count_vectorizer.fit_transform(lemmatized_train_df['text'])

In [31]:
# Multinomial Naive Bayes, lemmatized
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_vector_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['MultinomialNB'],
                                    'Vectorizer' : ['CountVectorizer'],
                                    'Text_Treatment' : ['Lemmatized'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6838672747769288 +/- 0.04188014881745761


The second set of experiments excludes stopwords.

In [32]:
mnb_train_vector_no_stops = count_vectorizer_no_stops.fit_transform(train_df['text'])

In [33]:
# Multinomial Naive Bayes, no stopwords
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_vector_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['MultinomialNB'],
                                    'Vectorizer' : ['CountVectorizer'],
                                    'Text_Treatment' : ['Removed stopwords'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6600585905094238 +/- 0.04066299988918634


In [34]:
mnb_train_vector_stem_no_stops = count_vectorizer_no_stops.fit_transform(stemmed_train_df['text'])

In [35]:
# Multinomial Naive Bayes, stemmed, no stopwords
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_vector_stem_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['MultinomialNB'],
                                    'Vectorizer' : ['CountVectorizer'],
                                    'Text_Treatment' : ['Removed stopwords, stemmed'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6593484695899581 +/- 0.04559206635857508


In [36]:
mnb_train_vector_lemma_no_stops = count_vectorizer_no_stops.fit_transform(lemmatized_train_df['text'])

In [37]:
# Multinomial Naive Bayes, lemmatized, no stopwords
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_vector_lemma_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['MultinomialNB'],
                                    'Vectorizer' : ['CountVectorizer'],
                                    'Text_Treatment' : ['Removed stopwords, lemmatized'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6607353953919963 +/- 0.04315073975283043


### Bagging and term frequency weighting using TD-IDF vectorization

For the first set of experiments, I did not remove stopwords from the tweets to get a baseline for comparison.

In [38]:
tf_idf = TfidfVectorizer(ngram_range=(1, 1),
                         max_df=0.5,
                         min_df=2)

For the second set of experiments using TF-IDF term weighting, I removed the stopwords.

In [39]:
tf_idf_no_stops = TfidfVectorizer(stop_words = english_stops,
                                  ngram_range=(1, 1),
                                  max_df=0.5,
                                  min_df=2)

#### Logistic Regression on TF-IDF treated training data
The first set of experiments includes stopwords.

In [40]:
lr_train_tfidf = tf_idf.fit_transform(train_df['text'])

In [41]:
# Logistic Regression
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_tfidf, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['LogisticRegression'],
                                    'Vectorizer' : ['TfidfVectorizer'],
                                    'Text_Treatment' : ['None'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6661062126913281 +/- 0.036940176833570974


In [42]:
lr_train_tfidf_stem = tf_idf.fit_transform(stemmed_train_df['text'])

In [43]:
# Logistic Regression, stemmed
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_tfidf_stem, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['LogisticRegression'],
                                    'Vectorizer' : ['TfidfVectorizer'],
                                    'Text_Treatment' : ['Stemmed'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.672782226487904 +/- 0.043425305750678725


In [44]:
lr_train_tfidf_lemma = tf_idf.fit_transform(lemmatized_train_df['text'])

In [45]:
# Logistic Regression, lemmatized
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_tfidf_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['LogisticRegression'],
                                    'Vectorizer' : ['TfidfVectorizer'],
                                    'Text_Treatment' : ['Lemmatized'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6741968290118585 +/- 0.04037677659541278


Second set of experiments, excluding stopwords.

In [46]:
lr_train_tfidf_no_stops = tf_idf_no_stops.fit_transform(train_df['text'])

In [47]:
# Logistic Regression, no stopwords
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_tfidf_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['LogisticRegression'],
                                    'Vectorizer' : ['TfidfVectorizer'],
                                    'Text_Treatment' : ['Removed stopwords'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6286092341566316 +/- 0.054729023589583974


In [48]:
lr_train_tfidf_stem_no_stops = tf_idf_no_stops.fit_transform(stemmed_train_df['text'])

In [49]:
# Logistic Regression, stemmed, no stopwords
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_tfidf_stem_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['LogisticRegression'],
                                    'Vectorizer' : ['TfidfVectorizer'],
                                    'Text_Treatment' : ['Removed stopwords, stemmed'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6350515980527547 +/- 0.0538273658163454


In [50]:
lr_train_tfidf_lemma_no_stops = tf_idf_no_stops.fit_transform(lemmatized_train_df['text'])

In [51]:
# Logistic Regression, lemmatized, no stopwords
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_tfidf_lemma_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['LogisticRegression'],
                                    'Vectorizer' : ['TfidfVectorizer'],
                                    'Text_Treatment' : ['Removed stopwords, lemmatized'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6253084385157027 +/- 0.05239536927812601


#### Multinomial Bayes on TF-IDF treated training data
First set of experiments includes stopwords.

In [52]:
mnb_train_tfidf = tf_idf.fit_transform(train_df['text'])

In [53]:
# Multinomial Naive Bayes
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_tfidf, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['MultinomialNB'],
                                    'Vectorizer' : ['TfidfVectorizer'],
                                    'Text_Treatment' : ['None'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6554771472834363 +/- 0.05697076934045421


In [54]:
mnb_train_tfidf_stem = tf_idf.fit_transform(stemmed_train_df['text'])

In [55]:
# Multinomial Naive Bayes, stemmed
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_tfidf_stem, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['MultinomialNB'],
                                    'Vectorizer' : ['TfidfVectorizer'],
                                    'Text_Treatment' : ['Stemmed'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6601653543759294 +/- 0.06010001540846695


In [56]:
mnb_train_tfidf_lemma = tf_idf.fit_transform(lemmatized_train_df['text'])

In [57]:
# Multinomial Naive Bayes, lemmatized
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_tfidf_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['MultinomialNB'],
                                    'Vectorizer' : ['TfidfVectorizer'],
                                    'Text_Treatment' : ['Lemmatized'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6617217901213649 +/- 0.055086839634722246


Second set of experiments excludes stopwords

In [58]:
mnb_train_tfidf_no_stops = tf_idf_no_stops.fit_transform(train_df['text'])

In [59]:
# Multinomial Naive Bayes, no stopwords
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_tfidf_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['MultinomialNB'],
                                    'Vectorizer' : ['TfidfVectorizer'],
                                    'Text_Treatment' : ['Removed stopwords'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6478170987485605 +/- 0.054678210710489006


In [60]:
mnb_train_tfidf_stem_no_stops = tf_idf_no_stops.fit_transform(stemmed_train_df['text'])

In [61]:
# Multinomial Naive Bayes, stemmed, no stopwords
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_tfidf_stem_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['MultinomialNB'],
                                    'Vectorizer' : ['TfidfVectorizer'],
                                    'Text_Treatment' : ['Removed stopwords, stemmed'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6466233450531924 +/- 0.05253238685817721


In [62]:
mnb_train_tfidf_lemma_no_stops = tf_idf_no_stops.fit_transform(lemmatized_train_df['text'])

In [63]:
# Multinomial Naive Bayes, lemmatized, no stopwords
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_tfidf_lemma_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['MultinomialNB'],
                                    'Vectorizer' : ['TfidfVectorizer'],
                                    'Text_Treatment' : ['Removed stopwords, lemmatized'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6514290195889771 +/- 0.05372415943963887


#### Support Vector Machine Models
I used cross-validation to determine the parameters for all SVM models.

The cross-validation steps were commented out for subsequent runs of the notebook.

In [64]:
# Set up parameter grid for GridSearchCV testing

# my_params = {'C': [0.1, 0.3, 0.5, 0.7],
#              'kernel': ['rbf', 'poly', 'sigmoid'],
#              'degree': [2, 3],
#              'gamma' : ['auto', 'scale'],
#              'class_weight' : ['balanced'],
#              'random_state' : [42],
#              'probability' : [False, True],
# #              'shrinking' : [False, True],
#              'coef0' : [1e2, 0.1, 1, 10]}

In [65]:
# GridSearchCV testing to find best parameters for SVM model

# scorer = make_scorer(f1_score)
# gs_clf = GridSearchCV(svm.SVC(),
#                       param_grid = my_params,
#                       scoring = scorer,
#                       verbose = 1,
#                       n_jobs = -1)
# gs_clf.fit(train_tfidf_lemmatized_df, train_df["target"])
# print(gs_clf.best_params_, gs_clf.best_score_)

# results:
# {'C': 0.7,
#  'class_weight': 'balanced',
#  'coef0': 1,
#  'degree': 2,
#  'gamma': 'scale',
#  'kernel': 'sigmoid',
#  'probability': False,
#  'random_state': 42}
# 0.6660730647063914

First set of experiments, stopwords included

In [66]:
svc_train_vectors = count_vectorizer.fit_transform(train_df['text'])

In [67]:
# SVM: CountVectorizer, raw
clf_svc = svm.SVC(C = 0.5,
              kernel = 'sigmoid',
              degree = 2,
              gamma = 'scale',
              class_weight = 'balanced',
              random_state = 42)

scores = model_selection.cross_val_score(clf_svc,
                                         svc_train_vectors, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['SVM'],
                                    'Vectorizer' : ['CountVectorizer'],
                                    'Text_Treatment' : ['None'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6187119538548428 +/- 0.049889153215654146


In [68]:
svc_train_vector_stem = count_vectorizer.fit_transform(stemmed_train_df['text'])

In [69]:
# SVM: CountVectorizer, stemmed
scores = model_selection.cross_val_score(clf_svc,
                                         svc_train_vector_stem, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['SVM'],
                                    'Vectorizer' : ['CountVectorizer'],
                                    'Text_Treatment' : ['Stemmed'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6275031285870213 +/- 0.05665626546700774


In [70]:
svc_train_vector_lemma = count_vectorizer.fit_transform(lemmatized_train_df['text'])

In [71]:
# SVM: CountVectorizer, lemmatized
scores = model_selection.cross_val_score(clf_svc,
                                         svc_train_vector_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['SVM'],
                                    'Vectorizer' : ['CountVectorizer'],
                                    'Text_Treatment' : ['Lemmatized'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.628212870572273 +/- 0.053337266693638356


Second set of experiments, stopwords excluded

In [72]:
svc_train_vector_no_stops = count_vectorizer_no_stops.fit_transform(train_df['text'])

In [73]:
# SVM: CountVectorizer, no stopwords
scores = model_selection.cross_val_score(clf_svc,
                                         svc_train_vector_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['SVM'],
                                    'Vectorizer' : ['CountVectorizer'],
                                    'Text_Treatment' : ['Removed stopwords'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.5400199107702773 +/- 0.07443757239006776


In [74]:
svc_train_vector_stem_no_stops = count_vectorizer_no_stops.fit_transform(stemmed_train_df['text'])

In [75]:
# SVM: CountVectorizer, stemmed, no stopwords
scores = model_selection.cross_val_score(clf_svc,
                                         svc_train_vector_stem_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['SVM'],
                                    'Vectorizer' : ['CountVectorizer'],
                                    'Text_Treatment' : ['Stemmed, removed stopwords'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.5690484866165505 +/- 0.07519960383938466


In [76]:
svc_train_vector_lemma_no_stops = count_vectorizer_no_stops.fit_transform(lemmatized_train_df['text'])

In [77]:
# SVM: CountVectorizer, lemmatized, no stopwords
scores = model_selection.cross_val_score(clf_svc,
                                         svc_train_vector_lemma_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['SVM'],
                                    'Vectorizer' : ['CountVectorizer'],
                                    'Text_Treatment' : ['Lemmatized, removed stopwords'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.5567032812309256 +/- 0.08343611635643196


#### I did an additional set of experiments with SVM, using LSA

LSA (Latent Sentiment Analysis)

In [78]:
svd = decomposition.TruncatedSVD(n_components = 100, random_state = 42)
normalizer = preprocessing.Normalizer()

In [79]:
svc_train_tfidf = tf_idf.fit_transform(train_df['text'])

As usual, first set of experiments included stopwords.

In [80]:
# SVM: LSA, TF-IDF, raw
pipe = pipeline.make_pipeline(svd, normalizer, clf_svc)

scores = model_selection.cross_val_score(pipe,
                                         svc_train_tfidf, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['SVM'],
                                    'Vectorizer' : ['LSA'],
                                    'Text_Treatment' : ['None'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6422426503451424 +/- 0.027939842727464533


In [81]:
svc_train_tfidf_stem = tf_idf.fit_transform(stemmed_train_df['text'])

In [82]:
# SVM: LSA, TF-IDF, stemmed
scores = model_selection.cross_val_score(pipe,
                                         svc_train_tfidf_stem, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['SVM'],
                                    'Vectorizer' : ['LSA'],
                                    'Text_Treatment' : ['Stemmed'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6516378542457344 +/- 0.030410924927612156


In [83]:
svc_train_tfidf_lemma = tf_idf.fit_transform(lemmatized_train_df['text'])

In [84]:
# SVM: LSA, TF-IDF, lemmatized
scores = model_selection.cross_val_score(pipe,
                                         svc_train_tfidf_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['SVM'],
                                    'Vectorizer' : ['LSA'],
                                    'Text_Treatment' : ['Lemmatized'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.653453506250657 +/- 0.029915676849887145


The second set of experiments excluded stopwords.

In [85]:
svc_train_tfidf_no_stops = tf_idf_no_stops.fit_transform(train_df['text'])

In [86]:
# SVM: LSA, TF-IDF, raw, no stopwords
scores = model_selection.cross_val_score(pipe,
                                         svc_train_tfidf_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['SVM'],
                                    'Vectorizer' : ['LSA'],
                                    'Text_Treatment' : ['Removed stopwords'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6462697248132283 +/- 0.03027529085221108


In [87]:
svc_train_tfidf_stem_no_stops = tf_idf_no_stops.fit_transform(stemmed_train_df['text'])

In [88]:
# SVM: LSA, TF-IDF, stemmed, no stopwords
scores = model_selection.cross_val_score(pipe,
                                         svc_train_tfidf_stem_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['SVM'],
                                    'Vectorizer' : ['LSA'],
                                    'Text_Treatment' : ['Removed stopwords, stemmed'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6382746296233381 +/- 0.02709286556980339


In [89]:
svc_train_tfidf_lemma_no_stops = tf_idf_no_stops.fit_transform(lemmatized_train_df['text'])

In [90]:
# SVM: LSA, TF-IDF,lemmatized, no stopwords
scores = model_selection.cross_val_score(pipe,
                                         svc_train_tfidf_lemma_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['SVM'],
                                    'Vectorizer' : ['LSA'],
                                    'Text_Treatment' : ['Removed stopwords, lemmatized'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.6357309894803729 +/- 0.03266720780230194


#### KNN Models

I also tried a couple of KNN models, but the results weren't promising.

In [91]:
knn_train_tfidf = tf_idf.fit_transform(train_df['text'])

In [92]:
# LSA -> KNN: TF-IDF, raw
clf_knn = KNeighborsClassifier(n_neighbors=5,
                               algorithm='brute',
                               metric='cosine')

pipe = pipeline.make_pipeline(svd, normalizer, clf_knn)

scores = model_selection.cross_val_score(pipe,
                                         knn_train_tfidf, train_df['target'],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['KNN'],
                                    'Vectorizer' : ['LSA'],
                                    'Text_Treatment' : ['None'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.5198641468784797 +/- 0.04564133695115445


In [93]:
knn_train_tfidf_no_stops = tf_idf_no_stops.fit_transform(train_df['text'])

In [94]:
# KNN: LSA, TF-IDF, no stopwords
scores = model_selection.cross_val_score(pipe,
                                         knn_train_tfidf_no_stops, train_df['target'],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

score_row = pd.DataFrame.from_dict({'Model' : ['KNN'],
                                    'Vectorizer' : ['LSA'],
                                    'Text_Treatment' : ['Removed stopwords'],
                                    'Mean_F1_Score' : [mean_score],
                                    'F1_Std_Dev' : [stability]})
scoring_df = scoring_df.append(score_row, ignore_index = True)

0.5589305021316724 +/- 0.041042963510106854


---

### Table of Model Scores
This table is sorted by the mean F1 score of each model, to help me select the final model(s) for submission to the contest.

In [95]:
scoring_df.sort_values(by = ['Mean_F1_Score'], ascending = False)

Unnamed: 0,Model,Vectorizer,Text_Treatment,Mean_F1_Score,F1_Std_Dev
8,MultinomialNB,CountVectorizer,Lemmatized,0.683867,0.04188
7,MultinomialNB,CountVectorizer,Stemmed,0.679516,0.045142
6,MultinomialNB,CountVectorizer,,0.674957,0.038878
14,LogisticRegression,TfidfVectorizer,Lemmatized,0.674197,0.040377
13,LogisticRegression,TfidfVectorizer,Stemmed,0.672782,0.043425
12,LogisticRegression,TfidfVectorizer,,0.666106,0.03694
20,MultinomialNB,TfidfVectorizer,Lemmatized,0.661722,0.055087
11,MultinomialNB,CountVectorizer,"Removed stopwords, lemmatized",0.660735,0.043151
19,MultinomialNB,TfidfVectorizer,Stemmed,0.660165,0.0601
9,MultinomialNB,CountVectorizer,Removed stopwords,0.660059,0.040663


---

### Final models
I used the highest-scoring version of each model type (Logistic Regression, Multinomial Naive Bayes, Support Vector Machine) to create submissions for the Kaggle contest.

In [97]:
# Final models - Logistic Regression, lemmatized
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_tfidf_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

0.6741968290118585 +/- 0.04037677659541278


In [102]:
tf_idf.fit(lemmatized_train_df['text'])
test_tfidf_lemma = tf_idf.transform(test_df['text'])

In [103]:
lr_train_tfidf_lemma.shape

(7502, 5578)

In [104]:
test_tfidf_lemma.shape

(3263, 5578)

In [105]:
clf_lr.fit(lr_train_tfidf_lemma, train_df["target"])
lr_preds = clf_lr.predict(test_tfidf_lemma)
lr_preds

array([1, 1, 1, ..., 1, 1, 0])

In [106]:
# Final models - Multinomial Bayes, lemmatized
# this was the best model (Kaggle score: 0.80777)

scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_vector_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

0.6838672747769288 +/- 0.04188014881745761


In [107]:
count_vectorizer.fit(lemmatized_train_df['text'])
test_vector_lemma = count_vectorizer.transform(test_df['text'])

In [108]:
mnb_train_vector_lemma.shape

(7502, 68135)

In [109]:
test_vector_lemma.shape

(3263, 68135)

In [110]:
clf_mnb.fit(mnb_train_vector_lemma, train_df["target"])
mnb_preds = clf_mnb.predict(test_vector_lemma)
mnb_preds

array([1, 1, 1, ..., 1, 1, 1])

In [111]:
# Final models - SVM: LSA, TFIDF, lemmatized
pipe = pipeline.make_pipeline(svd, normalizer, clf_svc)

scores = model_selection.cross_val_score(pipe,
                                         svc_train_tfidf_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

0.653453506250657 +/- 0.029915676849887145


In [112]:
tf_idf.fit(lemmatized_train_df['text'])
test_tfidf_svc_lemma = tf_idf.transform(test_df['text'])

In [113]:
svc_train_tfidf_lemma.shape

(7502, 5578)

In [114]:
test_tfidf_svc_lemma.shape

(3263, 5578)

In [115]:
clf_svc.fit(svc_train_tfidf_lemma, train_df["target"])
svc_preds = clf_svc.predict(test_tfidf_svc_lemma)
svc_preds

array([1, 1, 1, ..., 1, 1, 0])

---

### Create submission files.

In [None]:
# submission for Logistic Regression predictions
# model_sub = pd.read_csv('../data/sample_submission.csv')
# model_sub['target'] = lr_preds
# model_sub.to_csv('../data/lr_prediction_submission.csv', index = False)

In [None]:
# submission for Multinomial Naive Bayes predictions
# this one got the best Kaggle score: 0.80777
# model_sub = pd.read_csv('../data/sample_submission.csv')
# model_sub['target'] = mnb_preds
# model_sub.to_csv('../data/mnb_prediction_submission.csv', index = False)

In [None]:
# submission for SVM predictions
# model_sub = pd.read_csv('../data/sample_submission.csv')
# model_sub['target'] = svc_preds
# model_sub.to_csv('../data/svc_prediction_submission.csv', index = False)

### Next Steps
There are still some things I can try with these models in order to improve them:
* Vary the number of components in the LSA models
* More/better tuning of hyperparameters in all models

As well, I would like to test TensorFlow & BERT in a Kaggle notebook w/GPU turned on to see how it performs on this problem.