# Disaster Tweets

This notebook is used for modeling of the cleaned and preprocessed dataset from the Kaggle competition "Real or Not? NLP with Disaster Tweets" located here: https://www.kaggle.com/c/nlp-getting-started/overview.

In [1]:
# imports

# data
import pandas as pd
import numpy as np

# modeling
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import f1_score, make_scorer
from sklearn.naive_bayes import MultinomialNB
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV, StratifiedKFold, RandomizedSearchCV
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier

from nltk.corpus import stopwords

# helper functions
import helpers

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')


### Read the training and testing data from files.
#### Read the training data from file.

In [2]:
train_df = pd.read_csv('/Users/davidwalkup/ds-course/projects/Mod4/disaster_tweet_prediction/data/cleaned_train.csv')

In [3]:
train_df.head()

Unnamed: 0,id,keyword,text,target
0,1,nokeyword,our deeds are the reason of this earthquake ma...,1
1,4,nokeyword,forest fire near la ronge sask canada,1
2,5,nokeyword,all residents asked to shelter in place are be...,1
3,6,nokeyword,people receive wildfires evacuation orders in ...,1
4,7,nokeyword,just got sent this photo from ruby alaska as s...,1


#### Create a DataFrame for stemmed text.

In [4]:
stemmed_train_df = pd.read_csv('/Users/davidwalkup/ds-course/projects/Mod4/disaster_tweet_prediction/data/stemmed_train.csv')

In [5]:
stemmed_train_df.head()

Unnamed: 0,id,keyword,text,target
0,1,nokeyword,our deed are the reason of thi earthquak may a...,1
1,4,nokeyword,forest fire near la rong sask canada,1
2,5,nokeyword,all resid ask to shelter in place are be notif...,1
3,6,nokeyword,peopl receiv wildfir evacu order in california,1
4,7,nokeyword,just got sent thi photo from rubi alaska as sm...,1


#### Create a DataFrame for lemmatized text.

In [6]:
lemmatized_train_df = pd.read_csv('/Users/davidwalkup/ds-course/projects/Mod4/disaster_tweet_prediction/data/lemmatized_train.csv')

In [7]:
lemmatized_train_df.head()

Unnamed: 0,id,keyword,text,target
0,1,nokeyword,our deed are the reason of this earthquake may...,1
1,4,nokeyword,forest fire near la ronge sask canada,1
2,5,nokeyword,all resident asked to shelter in place are bei...,1
3,6,nokeyword,people receive wildfire evacuation order in ca...,1
4,7,nokeyword,just got sent this photo from ruby alaska a sm...,1


#### Read the testing data from file.

In [8]:
test_df = pd.read_csv('/Users/davidwalkup/ds-course/projects/Mod4/disaster_tweet_prediction/data/cleaned_test.csv')

In [9]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,nokeyword,nolocation,just happened a terrible car crash
1,2,nokeyword,nolocation,heard about earthquake is different city stay ...
2,3,nokeyword,nolocation,there is a forest fire at spot pond goose are ...
3,9,nokeyword,nolocation,apocalypse lighting spokane wildfire
4,11,nokeyword,nolocation,typhoon soudelor kill in china and taiwan


### How good does my model have to be to outperform the naive approach (i.e., no tweet is about a disaster)?

In [10]:
p_classes = dict(train_df['target'].value_counts(normalize=True))
naive_approach = p_classes[0]
print('Class probabilities: ', p_classes,
      '\nChance tweet is not about a real disaster: ', np.round(naive_approach, decimals = 4))

Class probabilities:  {0: 0.5737136763529725, 1: 0.42628632364702745} 
Chance tweet is not about a real disaster:  0.5737


#### Set up a DataFrame to hold scoring information, for final model selection.

In [11]:
scoring_df = pd.DataFrame(columns = ['Model', 'Vectorizer', 'Text_Treatment', 'Mean_F1_Score', 'F1_Std_Dev'])

### Bagging using sklearn CountVectorizer

First set of experiments will include stop words.

In [12]:
count_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                   ngram_range = (1, 2),
                                   binary = True)

#### Logistic Regression on CountVectorizer treated training data

In [13]:
lr_train_vectors = count_vectorizer.fit_transform(train_df['text'])

In [14]:
# LogReg, raw
clf_lr = LogisticRegressionCV(class_weight = 'balanced',
                              random_state = 42)

scores = cross_val_score(clf_lr,
                         lr_train_vectors, train_df["target"],
                         cv=5,
                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'LogisticRegression', 'CountVectorizer',
                                     'None', clf_score[0], clf_score[1])

Mean score:  0.6382107580107526 +/- 0.05755892959272567


In [15]:
lr_train_vector_stem = count_vectorizer.fit_transform(stemmed_train_df['text'])

In [16]:
# LogReg, stemmed
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_vector_stem, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'LogisticRegression', 'CountVectorizer',
                                     'Stemmed', clf_score[0], clf_score[1])

Mean score:  0.647272547233625 +/- 0.05912689198763571


In [17]:
lr_train_vector_lemma = count_vectorizer.fit_transform(lemmatized_train_df['text'])

In [18]:
# LogReg, lemmatized
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_vector_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'LogisticRegression', 'CountVectorizer',
                                     'Lemmatized', clf_score[0], clf_score[1])

Mean score:  0.6554592142449882 +/- 0.048466482009136516


Second set of experiments will remove stop words, to see if that improves performance.

In [19]:
english_stops = stopwords.words('english')
count_vectorizer_no_stops = CountVectorizer(strip_accents = 'unicode',
                                            stop_words = english_stops,
                                            ngram_range = (1, 2),
                                            binary = True)

In [20]:
lr_train_vector_no_stops = count_vectorizer_no_stops.fit_transform(train_df['text'])

In [21]:
# LogReg, no stopwords
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_vector_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'LogisticRegression', 'CountVectorizer',
                             'Removed stopwords', clf_score[0], clf_score[1])

Mean score:  0.5998213558149595 +/- 0.07525380294906855


In [22]:
lr_train_vector_stem_no_stops = count_vectorizer_no_stops.fit_transform(stemmed_train_df['text'])

In [None]:
# LogReg, stemmed, no stopwords
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_vector_stem_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'LogisticRegression', 'CountVectorizer',
                                     'Removed stopwords, stemmed', clf_score[0], clf_score[1])

In [None]:
lr_train_vector_lemma_no_stops = count_vectorizer_no_stops.fit_transform(lemmatized_train_df['text'])

In [None]:
# LogReg, lemmatized, no stopwords
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_vector_lemma_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'LogisticRegression', 'CountVectorizer',
                                     'Removed stopwords, lemmatized',
                                     clf_score[0], clf_score[1])

#### Multinomial Bayes on CountVectorizer treated training data

First set of experiments includes stopwords.

In [None]:
mnb_train_vectors = count_vectorizer.fit_transform(train_df['text'])

In [None]:
# Multinomial Naive Bayes
clf_mnb = MultinomialNB()
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_vectors, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'MultinomialNB', 'CountVectorizer',
                                     'None',
                                     clf_score[0], clf_score[1])

In [None]:
mnb_train_vector_stem = count_vectorizer.fit_transform(stemmed_train_df['text'])

In [None]:
# Multinomial Naive Bayes, stemmed
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_vector_stem, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'MultinomialNB', 'CountVectorizer',
                                     'Stemmed',
                                     clf_score[0], clf_score[1])

In [None]:
mnb_train_vector_lemma = count_vectorizer.fit_transform(lemmatized_train_df['text'])

In [None]:
# Multinomial Naive Bayes, lemmatized
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_vector_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'MultinomialNB', 'CountVectorizer',
                                     'Lemmatized',
                                     clf_score[0], clf_score[1])

The second set of experiments excludes stopwords.

In [None]:
mnb_train_vector_no_stops = count_vectorizer_no_stops.fit_transform(train_df['text'])

In [None]:
# Multinomial Naive Bayes, no stopwords
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_vector_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'MultinomialNB', 'CountVectorizer',
                                     'Removed stopwords',
                                     clf_score[0], clf_score[1])

In [None]:
mnb_train_vector_stem_no_stops = count_vectorizer_no_stops.fit_transform(stemmed_train_df['text'])

In [None]:
# Multinomial Naive Bayes, stemmed, no stopwords
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_vector_stem_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'MultinomialNB', 'CountVectorizer',
                                     'Removed stopwords, stemmed',
                                     clf_score[0], clf_score[1])

In [None]:
mnb_train_vector_lemma_no_stops = count_vectorizer_no_stops.fit_transform(lemmatized_train_df['text'])

In [None]:
# Multinomial Naive Bayes, lemmatized, no stopwords
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_vector_lemma_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'MultinomialNB', 'CountVectorizer',
                                     'Removed stopwords, lemmatized',
                                     clf_score[0], clf_score[1])

### Bagging and term frequency weighting using TD-IDF vectorization

For the first set of experiments, I did not remove stopwords from the tweets to get a baseline for comparison.

In [None]:
tf_idf = TfidfVectorizer(ngram_range=(1, 1),
                         max_df=0.5,
                         min_df=2)

For the second set of experiments using TF-IDF term weighting, I removed the stopwords.

In [None]:
tf_idf_no_stops = TfidfVectorizer(stop_words = english_stops,
                                  ngram_range=(1, 1),
                                  max_df=0.5,
                                  min_df=2)

#### Logistic Regression on TF-IDF treated training data
The first set of experiments includes stopwords.

In [None]:
lr_train_tfidf = tf_idf.fit_transform(train_df['text'])

In [None]:
# Logistic Regression
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_tfidf, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'LogisticRegression', 'TfidfVectorizer',
                                     'None',
                                     clf_score[0], clf_score[1])

In [None]:
lr_train_tfidf_stem = tf_idf.fit_transform(stemmed_train_df['text'])

In [None]:
# Logistic Regression, stemmed
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_tfidf_stem, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'LogisticRegression', 'TfidfVectorizer',
                                     'Stemmed',
                                     clf_score[0], clf_score[1])

In [None]:
lr_train_tfidf_lemma = tf_idf.fit_transform(lemmatized_train_df['text'])

In [None]:
# Logistic Regression, lemmatized
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_tfidf_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'LogisticRegression', 'TfidfVectorizer',
                                     'Lemmatized',
                                     clf_score[0], clf_score[1])

Second set of experiments, excluding stopwords.

In [None]:
lr_train_tfidf_no_stops = tf_idf_no_stops.fit_transform(train_df['text'])

In [None]:
# Logistic Regression, no stopwords
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_tfidf_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'LogisticRegression', 'TfidfVectorizer',
                                     'Removed stopwords',
                                     clf_score[0], clf_score[1])

In [None]:
lr_train_tfidf_stem_no_stops = tf_idf_no_stops.fit_transform(stemmed_train_df['text'])

In [None]:
# Logistic Regression, stemmed, no stopwords
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_tfidf_stem_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'LogisticRegression', 'TfidfVectorizer',
                                     'Removed stopwords, stemmed',
                                     clf_score[0], clf_score[1])

In [None]:
lr_train_tfidf_lemma_no_stops = tf_idf_no_stops.fit_transform(lemmatized_train_df['text'])

In [None]:
# Logistic Regression, lemmatized, no stopwords
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_tfidf_lemma_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'LogisticRegression', 'TfidfVectorizer',
                                     'Removed stopwords, lemmatized',
                                     clf_score[0], clf_score[1])

#### Multinomial Bayes on TF-IDF treated training data
First set of experiments includes stopwords.

In [None]:
mnb_train_tfidf = tf_idf.fit_transform(train_df['text'])

In [None]:
# Multinomial Naive Bayes
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_tfidf, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'MultinomialNB', 'TfidfVectorizer',
                                     'None',
                                     clf_score[0], clf_score[1])

In [None]:
mnb_train_tfidf_stem = tf_idf.fit_transform(stemmed_train_df['text'])

In [None]:
# Multinomial Naive Bayes, stemmed
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_tfidf_stem, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'MultinomialNB', 'TfidfVectorizer',
                                     'Stemmed',
                                     clf_score[0], clf_score[1])

In [None]:
mnb_train_tfidf_lemma = tf_idf.fit_transform(lemmatized_train_df['text'])

In [None]:
# Multinomial Naive Bayes, lemmatized
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_tfidf_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'MultinomialNB', 'TfidfVectorizer',
                                     'Lemmatized',
                                     clf_score[0], clf_score[1])

Second set of experiments excludes stopwords

In [None]:
mnb_train_tfidf_no_stops = tf_idf_no_stops.fit_transform(train_df['text'])

In [None]:
# Multinomial Naive Bayes, no stopwords
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_tfidf_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'MultinomialNB', 'TfidfVectorizer',
                                     'Removed stopwords',
                                     clf_score[0], clf_score[1])

In [None]:
mnb_train_tfidf_stem_no_stops = tf_idf_no_stops.fit_transform(stemmed_train_df['text'])

In [None]:
# Multinomial Naive Bayes, stemmed, no stopwords
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_tfidf_stem_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'MultinomialNB', 'TfidfVectorizer',
                                     'Removed stopwords, stemmed',
                                     clf_score[0], clf_score[1])

In [None]:
mnb_train_tfidf_lemma_no_stops = tf_idf_no_stops.fit_transform(lemmatized_train_df['text'])

In [None]:
# Multinomial Naive Bayes, lemmatized, no stopwords
scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_tfidf_lemma_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'MultinomialNB', 'TfidfVectorizer',
                                     'Removed stopwords, lemmatized',
                                     clf_score[0], clf_score[1])

#### Support Vector Machine Models
I used cross-validation to determine the parameters for all SVM models.

The cross-validation steps were commented out for subsequent runs of the notebook.

In [None]:
# Set up parameter grid for GridSearchCV testing

# my_params = {'C': [0.1, 0.3, 0.5, 0.7],
#              'kernel': ['rbf', 'poly', 'sigmoid'],
#              'degree': [2, 3],
#              'gamma' : ['auto', 'scale'],
#              'class_weight' : ['balanced'],
#              'random_state' : [42],
#              'probability' : [False, True],
# #              'shrinking' : [False, True],
#              'coef0' : [1e2, 0.1, 1, 10]}

In [None]:
# GridSearchCV testing to find best parameters for SVM model

# scorer = make_scorer(f1_score)
# gs_clf = GridSearchCV(svm.SVC(),
#                       param_grid = my_params,
#                       scoring = scorer,
#                       verbose = 1,
#                       n_jobs = -1)
# gs_clf.fit(train_tfidf_lemmatized_df, train_df["target"])
# print(gs_clf.best_params_, gs_clf.best_score_)

# results:
# {'C': 0.7,
#  'class_weight': 'balanced',
#  'coef0': 1,
#  'degree': 2,
#  'gamma': 'scale',
#  'kernel': 'sigmoid',
#  'probability': False,
#  'random_state': 42}
# 0.6660730647063914

First set of experiments, stopwords included

In [None]:
svc_train_vectors = count_vectorizer.fit_transform(train_df['text'])

In [None]:
# SVM: CountVectorizer, raw
clf_svc = svm.SVC(C = 0.7,
              kernel = 'sigmoid',
              degree = 2,
              gamma = 'scale',
              class_weight = 'balanced',
              random_state = 42)

scores = model_selection.cross_val_score(clf_svc,
                                         svc_train_vectors, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'SVM', 'CountVectorizer',
                                     'None',
                                     clf_score[0], clf_score[1])

In [None]:
svc_train_vector_stem = count_vectorizer.fit_transform(stemmed_train_df['text'])

In [None]:
# SVM: CountVectorizer, stemmed
scores = model_selection.cross_val_score(clf_svc,
                                         svc_train_vector_stem, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'SVM', 'CountVectorizer',
                                     'Stemmed',
                                     clf_score[0], clf_score[1])

In [None]:
svc_train_vector_lemma = count_vectorizer.fit_transform(lemmatized_train_df['text'])

In [None]:
# SVM: CountVectorizer, lemmatized
scores = model_selection.cross_val_score(clf_svc,
                                         svc_train_vector_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'SVM', 'CountVectorizer',
                                     'Lemmatized',
                                     clf_score[0], clf_score[1])

Second set of experiments, stopwords excluded

In [None]:
svc_train_vector_no_stops = count_vectorizer_no_stops.fit_transform(train_df['text'])

In [None]:
# SVM: CountVectorizer, no stopwords
scores = model_selection.cross_val_score(clf_svc,
                                         svc_train_vector_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'SVM', 'CountVectorizer',
                                     'Removed stopwords',
                                     clf_score[0], clf_score[1])

In [None]:
svc_train_vector_stem_no_stops = count_vectorizer_no_stops.fit_transform(stemmed_train_df['text'])

In [None]:
# SVM: CountVectorizer, stemmed, no stopwords
scores = model_selection.cross_val_score(clf_svc,
                                         svc_train_vector_stem_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'SVM', 'CountVectorizer',
                                     'Removed stopwords, stemmed',
                                     clf_score[0], clf_score[1])

In [None]:
svc_train_vector_lemma_no_stops = count_vectorizer_no_stops.fit_transform(lemmatized_train_df['text'])

In [None]:
# SVM: CountVectorizer, lemmatized, no stopwords
scores = model_selection.cross_val_score(clf_svc,
                                         svc_train_vector_lemma_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'SVM', 'CountVectorizer',
                                     'Removed stopwords, lemmatized',
                                     clf_score[0], clf_score[1])

#### I did an additional set of experiments with SVM, using LSA

LSA (Latent Sentiment Analysis)

In [None]:
svd = decomposition.TruncatedSVD(n_components = 100, random_state = 42)
normalizer = preprocessing.Normalizer()

In [None]:
svc_train_tfidf = tf_idf.fit_transform(train_df['text'])

As usual, first set of experiments included stopwords.

In [None]:
# SVM: LSA, TF-IDF, raw
pipe = pipeline.make_pipeline(svd, normalizer, clf_svc)

scores = model_selection.cross_val_score(pipe,
                                         svc_train_tfidf, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'SVM', 'LSA',
                                     'None',
                                     clf_score[0], clf_score[1])

In [None]:
svc_train_tfidf_stem = tf_idf.fit_transform(stemmed_train_df['text'])

In [None]:
# SVM: LSA, TF-IDF, stemmed
scores = model_selection.cross_val_score(pipe,
                                         svc_train_tfidf_stem, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'SVM', 'LSA',
                                     'Stemmed',
                                     clf_score[0], clf_score[1])

In [None]:
svc_train_tfidf_lemma = tf_idf.fit_transform(lemmatized_train_df['text'])

In [None]:
# SVM: LSA, TF-IDF, lemmatized
scores = model_selection.cross_val_score(pipe,
                                         svc_train_tfidf_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'SVM', 'LSA',
                                     'Lemmatized',
                                     clf_score[0], clf_score[1])

The second set of experiments excluded stopwords.

In [None]:
svc_train_tfidf_no_stops = tf_idf_no_stops.fit_transform(train_df['text'])

In [None]:
# SVM: LSA, TF-IDF, raw, no stopwords
scores = model_selection.cross_val_score(pipe,
                                         svc_train_tfidf_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'SVM', 'LSA',
                                     'Removed stopwords',
                                     clf_score[0], clf_score[1])

In [None]:
svc_train_tfidf_stem_no_stops = tf_idf_no_stops.fit_transform(stemmed_train_df['text'])

In [None]:
# SVM: LSA, TF-IDF, stemmed, no stopwords
scores = model_selection.cross_val_score(pipe,
                                         svc_train_tfidf_stem_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'SVM', 'LSA',
                                     'Removed stopwords, stemmed',
                                     clf_score[0], clf_score[1])

In [None]:
svc_train_tfidf_lemma_no_stops = tf_idf_no_stops.fit_transform(lemmatized_train_df['text'])

In [None]:
# SVM: LSA, TF-IDF,lemmatized, no stopwords
scores = model_selection.cross_val_score(pipe,
                                         svc_train_tfidf_lemma_no_stops, train_df["target"],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'SVM', 'LSA',
                                     'Removed stopwords, lemmatized',
                                     clf_score[0], clf_score[1])

#### KNN Models

I also tried a couple of KNN models, but the results weren't promising.

In [None]:
knn_train_tfidf = tf_idf.fit_transform(train_df['text'])

In [None]:
# LSA -> KNN: TF-IDF, raw
clf_knn = KNeighborsClassifier(n_neighbors=5,
                               algorithm='brute',
                               metric='cosine')

pipe = pipeline.make_pipeline(svd, normalizer, clf_knn)

scores = model_selection.cross_val_score(pipe,
                                         knn_train_tfidf, train_df['target'],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'KNN', 'LSA',
                                     'None',
                                     clf_score[0], clf_score[1])

In [None]:
knn_train_tfidf_no_stops = tf_idf_no_stops.fit_transform(train_df['text'])

In [None]:
# KNN: LSA, TF-IDF, no stopwords
scores = model_selection.cross_val_score(pipe,
                                         knn_train_tfidf_no_stops, train_df['target'],
                                         cv=5,
                                         scoring="f1")

clf_score = helpers.model_scoring(scores)

scoring_df = helpers.score_recording(scoring_df, 'KNN', 'LSA',
                                     'Removed stopwords',
                                     clf_score[0], clf_score[1])

---

### Table of Model Scores
This table is sorted by the mean F1 score of each model, to help me select the final model(s) for submission to the contest.

In [None]:
scoring_df.sort_values(by = 'Mean_F1_Score', ascending = False)

---

### Final models
I used the highest-scoring version of each model type (Logistic Regression, Multinomial Naive Bayes, Support Vector Machine) to create submissions for the Kaggle contest.

In [None]:
# Final models - Logistic Regression, lemmatized
scores = model_selection.cross_val_score(clf_lr,
                                         lr_train_tfidf_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

In [None]:
tf_idf.fit(lemmatized_train_df['text'])
test_tfidf_lemma = tf_idf.transform(test_df['text'])

In [None]:
lr_train_tfidf_lemma.shape

In [None]:
test_tfidf_lemma.shape

In [None]:
clf_lr.fit(lr_train_tfidf_lemma, train_df["target"])
lr_preds = clf_lr.predict(test_tfidf_lemma)
lr_preds

In [None]:
# Final models - Multinomial Bayes, lemmatized
# this was the best model (Kaggle score: 0.80777)

scores = model_selection.cross_val_score(clf_mnb,
                                         mnb_train_vector_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

In [None]:
count_vectorizer.fit(lemmatized_train_df['text'])
test_vector_lemma = count_vectorizer.transform(test_df['text'])

In [None]:
mnb_train_vector_lemma.shape

In [None]:
test_vector_lemma.shape

In [None]:
clf_mnb.fit(mnb_train_vector_lemma, train_df["target"])
mnb_preds = clf_mnb.predict(test_vector_lemma)
mnb_preds

In [None]:
# Final models - SVM: LSA, TFIDF, lemmatized
pipe = pipeline.make_pipeline(svd, normalizer, clf_svc)

scores = model_selection.cross_val_score(pipe,
                                         svc_train_tfidf_lemma, train_df["target"],
                                         cv=5,
                                         scoring="f1")
mean_score = scores.mean()
stability = scores.std()
print(mean_score, '+/-', stability)

In [None]:
tf_idf.fit(lemmatized_train_df['text'])
test_tfidf_svc_lemma = tf_idf.transform(test_df['text'])

In [None]:
svc_train_tfidf_lemma.shape

In [None]:
test_tfidf_svc_lemma.shape

In [None]:
clf_svc.fit(svc_train_tfidf_lemma, train_df["target"])
svc_preds = clf_svc.predict(test_tfidf_svc_lemma)
svc_preds

---

### Create submission files.

In [None]:
# submission for Logistic Regression predictions
# model_sub = pd.read_csv('../data/sample_submission.csv')
# model_sub['target'] = lr_preds
# model_sub.to_csv('../data/lr_prediction_submission.csv', index = False)

In [None]:
# submission for Multinomial Naive Bayes predictions
# this one got the best Kaggle score: 0.80777
# model_sub = pd.read_csv('../data/sample_submission.csv')
# model_sub['target'] = mnb_preds
# model_sub.to_csv('../data/mnb_prediction_submission.csv', index = False)

In [None]:
# submission for SVM predictions
# model_sub = pd.read_csv('../data/sample_submission.csv')
# model_sub['target'] = svc_preds
# model_sub.to_csv('../data/svc_prediction_submission.csv', index = False)

### Next Steps
There are still some things I can try with these models in order to improve them:
* Vary the number of components in the LSA models
* More/better tuning of hyperparameters in all models

As well, I would like to test TensorFlow & BERT in a Kaggle notebook w/GPU turned on to see how it performs on this problem.