# "A Comparison of TF-IDF, Word2Vec, and Transfer Learning for Text Classification"

- toc: true
- author: David Byron
- comments: true
- categories: [tf-idf, word2vec, neural networks, text classification, natural language processing]

In [1]:
#hide
import warnings
warnings.filterwarnings('ignore')

Text Classification is the assignment of a particular label to a text with respect to its content. In modern Natural Language Processing (NLP), there are many different algorithms and techniques used to gain significant accuracy in text classification tasks.

In this notebook, we will cover three of the most popular methods for text classification: TF-IDF, Word2Vec, and transfer learning. For each of the three methods, we will also show their effectiveness based on the amount of preprocessing that is done to the text beforehand, leaving us with a total of nine measurements at the end.

We will see that transfer learning is by far the superior method for the task in terms of ease of use and accuracy.

The data that we will be using comes from [Kaggle's "Real or Not? NLP with Disaster Tweets"](https://www.kaggle.com/c/nlp-getting-started) competition, where the user is tasked with predicting which tweets are about real disasters, and which ones are not.

In the competition, leaderboard position is based on the model's F1 score. Therefore, for clarity, we will provide both the accuracy and F1 score for each output below.

To begin, let's start with some data analysis and augmentation:

## Data Analysis, Augmentation, and Splitting

### Light Analysis

In [1]:
#collapse
import pandas as pd

First, let's take a look at the training data we're given:

In [2]:
data = pd.read_csv('./data/disaster_tweets/train.csv')
data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


To find out a bit more information about the data, we can use the `.info()` and `.nunique()` methods on our DataFrame:

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [4]:
data.nunique()

id          7613
keyword      221
location    3341
text        7503
target         2
dtype: int64

Interesting! It looks like some of the tweets (110 of them, to be precise) are the same.

### Data Augmentation

#### Cleaning

In [5]:
#collapse
import re
import spacy

As mentioned above, I will incorporate different methods of preprocessing to our data to see if such changes have a positive or negative effect on our evaluation metrics. The three differently processed data I'll be using are:

1. Unprocessed - the data as it is given to us.
2. "Simply" cleaned - the data without any hashtags, @-symbols, website links, or punctuation.
3. SpaCy cleaned - the data lemmatized and without any stop words according to SpaCy's pretrained English language model (which we'll get to in a moment).

The unprocessed data is already done for us in the `text` column of our DataFrame.

Moving on to the second preprocessing method, "simply" cleaned data. By "simply" I mean cleaned explicitly by me using [regular expressions](https://docs.python.org/3/library/re.html) with prior assumptions about the data. For the data we're using here, we have a bunch of tweets. Thererfore, it makes sense to me to remove things like hashtags, @-symbols, and websites, since those don't intuitively seem like they contribute to a tweets disaster level (though this isn't necessarily true, just an assumption!).

To achieve this "simple" cleaning of the data, we can use the following three functions I've created:

In [6]:
def remove_at_hash(sent):
    """ Returns a string with @-symbols and hashtags removed. """
    return re.sub(r'@|#', r'', sent.lower())

def remove_sites(sent):
    """ Returns a string with any websites starting with 'http.' removed. """
    return re.sub(r'http.*', r'', sent.lower())

def remove_punct(sent):
    """ Returns a string with only English unicode word characters ([a-zA-Z0-9_]). """
    return ' '.join(re.findall(r'\w+', sent.lower()))

Now we can create a new column in our `data` DataFrame that represents the "simply" cleaned tweets. I'll call this column `text_simple`.

In [7]:
data['text_simple'] = data['text'].apply(lambda x: remove_punct(remove_sites(remove_at_hash(x))))
data.head()

Unnamed: 0,id,keyword,location,text,target,text_simple
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,our deeds are the reason of this earthquake ma...
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada
2,5,,,All residents asked to 'shelter in place' are ...,1,all residents asked to shelter in place are be...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,13 000 people receive wildfires evacuation ord...
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,just got sent this photo from ruby alaska as s...


Moving now to the last preprocessing method: SpaCy. [SpaCy](https://spacy.io/) is a great, open-source software library for NLP. It includes varying, pretrained language models of a number of different sizes for a number of different langauges, allowing you to quickly perform routine NLP tasks. Here, we're going to use SpaCy to [lemmatize](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) each tweet in the data and remove any [stop words](https://en.wikipedia.org/wiki/Stop_word).

Below, we need to first load in SpaCy's (full) English model (note that, for speed, I disable some features that we won't need here). Then, create a function that will give us a string lemmatized by SpaCy.

In [8]:
nlp = spacy.load('en', disable=['ner', 'parser'])

def spacy_cleaning(doc):
    """ Returns a string that has been lemmatized and rid of stop words via SpaCy. """
    doc = nlp(doc.lower())
    text = [token.lemma_ for token in doc if not token.is_stop]
    return ' '.join(text)

Using our new function, we can again create a new column in our `data` DataFrame with the SpaCy-cleaned tweets. I'll call this column `text_spacy`.

In [9]:
data['text_spacy'] = data['text'].apply(lambda x: spacy_cleaning(x))
data.head()

Unnamed: 0,id,keyword,location,text,target,text_simple,text_spacy
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,our deeds are the reason of this earthquake ma...,deed reason # earthquake allah forgive
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada,forest fire near la ronge sask . canada
2,5,,,All residents asked to 'shelter in place' are ...,1,all residents asked to shelter in place are be...,resident ask ' shelter place ' notify officer ...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,13 000 people receive wildfires evacuation ord...,"13,000 people receive # wildfire evacuation or..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,just got sent this photo from ruby alaska as s...,get send photo ruby # alaska smoke # wildfire ...


#### n-grams

In [10]:
#collapse
from gensim.models.phrases import Phrases, Phraser

> Important: I'll only be applying what we learn in the n-gram section to the Word2Vec model. If you'd like to skip this section, and come back when you get to Word2Vec, feel free to do so.

An [n-gram](https://en.wikipedia.org/wiki/N-gram#:~:text=In%20the%20fields%20of%20computational,a%20text%20or%20speech%20corpus.) is a contiguous sequence of *n* items from a given sample of text or speech. This turns out to be quite useful in NLP. Consider the phrase "New York Times". When all three words are together, the phrase is understood to mean the widely spread news source based in New York of the same moniker. However, if we split the words up (while maintaining original order), we get: "New York", "York Times", "New", "York", and "Times". These separate words and phrases can occur in many contexts other than those in which the full phrase "New York Times" is found, skewing the phrase's true meaning in the data. N-gram models allow us to concatenate these commonly occurring multi-word phrases in our data, allowing their true meaning to shine through.

Thankfully, we can use the `Phraser` and `Phrases` classes provided by [gensim](https://radimrehurek.com/gensim/) in order to easily find n-grams in our data.

Let's start by getting trigrams found in the unprocessed data.

First, we extract the tweets and split them by whitespace characters.

In [11]:
text = [re.split('\s+', tweet) for tweet in data['text']]

Then, we find bigrams throughout our data. Here we use a parameter of `min_count=30` for our `Phrases` class. This ensures that only bigrams that occur more than 30 times in the data are found. Many combinations of words occur side by side only a few times, and don't contribute much additional knowledge to our model, so this is important.

In [12]:
bigram_phrases = Phrases(text, min_count=30)
bigram = Phraser(bigram_phrases)
bigram_text = bigram[text]

Next, we can use the bigrams we just made to search for trigrams in the exact same way.

> Note: This is repeatable! Keep going to find n-grams of size 5 if you wanted!

In [13]:
trigram_phrases = Phrases(bigram_text, min_count=30)
trigram = Phraser(trigram_phrases)
trigram_text = trigram[bigram_text]

That's it! Now we can pop this list back into our `data` DataFrame to be used later.

In [14]:
data['text_trigram'] = [' '.join(tweet) for tweet in trigram_text]
data.head()

Unnamed: 0,id,keyword,location,text,target,text_simple,text_spacy,text_trigram
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,our deeds are the reason of this earthquake ma...,deed reason # earthquake allah forgive,Our Deeds are the Reason of this #earthquake M...
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada,forest fire near la ronge sask . canada,Forest fire near La Ronge Sask. Canada
2,5,,,All residents asked to 'shelter in place' are ...,1,all residents asked to shelter in place are be...,resident ask ' shelter place ' notify officer ...,All residents asked to 'shelter in place' are ...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,13 000 people receive wildfires evacuation ord...,"13,000 people receive # wildfire evacuation or...","13,000 people receive #wildfires evacuation or..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,just got sent this photo from ruby alaska as s...,get send photo ruby # alaska smoke # wildfire ...,Just got sent this photo from Ruby #Alaska as ...


Great work! Now let's do the same for the `text_simple` and `text_spacy` columns.

In [15]:
#collapse
text_simple = [re.split('\s+', tweet) for tweet in data['text_simple']]

bigram_phrases = Phrases(text_simple, min_count=30)
bigram = Phraser(bigram_phrases)
bigram_text_simple = bigram[text_simple]

trigram_phrases = Phrases(bigram_text_simple, min_count=30)
trigram = Phraser(trigram_phrases)
trigram_text_simple = trigram[bigram_text_simple]

data['text_trigram_simple'] = [' '.join(tweet) for tweet in trigram_text_simple]

In [16]:
#collapse
text_spacy = [re.split('\s+', tweet) for tweet in data['text_spacy']]

bigram_phrases = Phrases(text_spacy, min_count=30)
bigram = Phraser(bigram_phrases)
bigram_text_spacy = bigram[text_spacy]

trigram_phrases = Phrases(bigram_text_spacy, min_count=30)
trigram = Phraser(trigram_phrases)
trigram_text_spacy = trigram[bigram_text_spacy]

data['text_trigram_spacy'] = [' '.join(tweet) for tweet in trigram_text_spacy]

In [17]:
data.head()

Unnamed: 0,id,keyword,location,text,target,text_simple,text_spacy,text_trigram,text_trigram_simple,text_trigram_spacy
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,our deeds are the reason of this earthquake ma...,deed reason # earthquake allah forgive,Our Deeds are the Reason of this #earthquake M...,our deeds are the reason of this earthquake ma...,deed reason # earthquake allah forgive
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada,forest fire near la ronge sask . canada,Forest fire near La Ronge Sask. Canada,forest fire near la ronge sask canada,forest fire near la ronge sask . canada
2,5,,,All residents asked to 'shelter in place' are ...,1,all residents asked to shelter in place are be...,resident ask ' shelter place ' notify officer ...,All residents asked to 'shelter in place' are ...,all residents asked to shelter in place are be...,resident ask ' shelter place ' notify officer ...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,13 000 people receive wildfires evacuation ord...,"13,000 people receive # wildfire evacuation or...","13,000 people receive #wildfires evacuation or...",13 000 people receive wildfires evacuation ord...,"13,000 people receive # wildfire evacuation or..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,just got sent this photo from ruby alaska as s...,get send photo ruby # alaska smoke # wildfire ...,Just got sent this photo from Ruby #Alaska as ...,just got sent this photo from ruby alaska as s...,get send photo ruby # alaska smoke # wildfire ...


Fantastic! We've found all of the trigrams and bigrams in each of our three datasets that occur more than 30 times. This data will prove to be very useful when we reach Word2Vec.

Now that we've got the three separately preprocessed sets of tweets in neat columns in our dataset, it's time to split our data into training and validation data and begin our testing!

### Splitting

In [18]:
#collapse
from sklearn.model_selection import train_test_split

In order to properly test our data, we'll need to split it into training and validation sets. To do this, we simply pass our `data` DataFrame to sklearn's `train_test_split`. We reset the index of each newly-created DataFrame to avoid complications with indexing later on. Then, check the shapes to make everything adds up.

In [19]:
train, valid = train_test_split(data, random_state=24)

train = train.reset_index()
valid = valid.reset_index()

train.shape, valid.shape, data.shape

((5709, 11), (1904, 11), (7613, 10))

Things are looking good! One last preprocessing step is in order, and that is dividing our newly-created `train` data by their target labels, thereby giving us two new DataFrames representing disaster tweets and non-disaster tweets.

When we call `.nunique()` on both `disasters` and `not_disasters`, we can see that the unique number of `target`s in each DataFrame is 1, indicating we split the data properly.

In [20]:
disasters = train[train['target'] == 1].reset_index()
not_disasters = train[train['target'] == 0].reset_index()

disasters.nunique(), not_disasters.nunique()

(level_0                2450
 index                  2450
 id                     2450
 keyword                 220
 location               1197
 text                   2418
 target                    1
 text_simple            2130
 text_spacy             2417
 text_trigram           2417
 text_trigram_simple    2130
 text_trigram_spacy     2416
 dtype: int64,
 level_0                3259
 index                  3259
 id                     3259
 keyword                 216
 location               1668
 text                   3239
 target                    1
 text_simple            3069
 text_spacy             3237
 text_trigram           3239
 text_trigram_simple    3069
 text_trigram_spacy     3237
 dtype: int64)

Awesome! We're all set and we can begin to train our models.

Let's start with TF-IDF.

## TF-IDF

In [21]:
#collapse
from collections import defaultdict
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import MatrixSimilarity
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

TF-IDF is an incredible, straightforward way to analyze document similarity. It involves no fancy machine learning, just the term frequency across documents! For this reason, we will begin with trying to use TF-IDF to determine if a tweet is about a disaster or not.

From [tfidf.com](http://www.tfidf.com/):
> Tf-idf stands for *term frequency-inverse document frequency*, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

You can learn more about the mathematical foundations of TF-IDF [here](https://rare-technologies.com/pivoted-document-length-normalisation/).

We'll start by analyzing the unprocessed tweets.

### TF-IDF with Unprocessed Tweets

In order to calculate the similarity between two tweets (namely, a tweet in the validation set with a tweet in the training set) without having to do all the math out ourselves, we'll use [gensim](https://radimrehurek.com/gensim/), a free Python library that provides a lot of great NLP functionality.

Gensim requires a list of *texts* in a list of *documents*. For us, that's a list of *words in a tweet* in a list of *tweets*. So let's make that now.

> Note: We're using the unprocessed tweets in the `text` column of our data this time around. We'll be using the other two preprocessed tweets in a bit!

In [22]:
disaster_tweets = disasters['text'].tolist()
not_disaster_tweets = not_disasters['text'].tolist()

disaster_tweets_split = [
    [word for word in tweet.split()]
    for tweet in disaster_tweets
]
not_disaster_tweets_split = [
    [word for word in tweet.split()]
    for tweet in not_disaster_tweets
]

Thinking about a TF-IDF model, words that only occur once throughout the entire corpus will not provide any noteworthy advantage to the model. Therefore, in the next step, we remove words that only occur once from `disaster_tweets_split` and `not_disaster_tweets` split.

In [23]:
disaster_tweets_word_frequency = defaultdict(int)
for tweet in disaster_tweets_split:
    for word in tweet:
        disaster_tweets_word_frequency[word] += 1
        
not_disaster_tweets_word_frequency = defaultdict(int)
for tweet in not_disaster_tweets_split:
    for word in tweet:
        not_disaster_tweets_word_frequency[word] += 1

disaster_tweets_split = [
    [word for word in tweet if disaster_tweets_word_frequency[word] > 1]
    for tweet in disaster_tweets_split
]

not_disaster_tweets_split = [
    [word for word in tweet if not_disaster_tweets_word_frequency[word] > 1]
    for tweet in not_disaster_tweets_split
]

Next, we create a Dictionary object with gensim, which is a mapping between words and their integer ids. With this Dictionary object we can create a "corpus" for disaster tweets and non-disaster tweets by converting each document (i.e., tweet) in each set to a Bag of Words format (that is, a list of `(token_id, token_count)` tuples).

In [24]:
disaster_tweets_dct = Dictionary(disaster_tweets_split)
not_disaster_tweets_dct = Dictionary(not_disaster_tweets_split)

disaster_tweets_corpus = [disaster_tweets_dct.doc2bow(tweet) for tweet in disaster_tweets_split]
not_disaster_tweets_corpus = [not_disaster_tweets_dct.doc2bow(tweet) for tweet in not_disaster_tweets_split]

Fit TF-IDF models for our two sets of tweets.

In [25]:
disaster_tweets_tfidf = TfidfModel(disaster_tweets_corpus)
not_disaster_tweets_tfidf = TfidfModel(not_disaster_tweets_corpus)

Apply the models to our corpora to get vectors for each tweet.

In [26]:
disaster_tweets_tfidf_vectors = disaster_tweets_tfidf[disaster_tweets_corpus]
not_disaster_tweets_tfidf_vectors = not_disaster_tweets_tfidf[not_disaster_tweets_corpus]

Create variable which we can index into using another vector to compute similarity.

In [27]:
disaster_tweets_similarity = MatrixSimilarity(disaster_tweets_tfidf_vectors)
not_disaster_tweets_similarity = MatrixSimilarity(not_disaster_tweets_tfidf_vectors)

Now we can compare each tweet in the validation set to each set of tweets (disaster and non-disaster) in the training set. Whichever set contains a greater number of "similar enough" tweets (to be determined by a threshold) determines how the validation tweet will be labeled.

First, configure the validation tweets in the same way that we did for the training tweets:

In [28]:
valid_tweets = valid['text'].tolist()

valid_tweets_split = [
    [word for word in tweet.split()]
    for tweet in valid_tweets
]

valid_tweets_word_frequency = defaultdict(int)
for tweet in valid_tweets_split:
    for word in tweet:
        valid_tweets_word_frequency[word] += 1
    
valid_tweets_split = [
    [word for word in tweet if valid_tweets_word_frequency[word] > 1]
    for tweet in valid_tweets_split
]

We now have all the information we need to make our predictions! We can store our predictions in the `valid` DataFrame. This will make for easier access when comparing target to prediction.

To do that, we need to initialize a new column in the DataFrame, let's call it `prediction`:

In [29]:
valid['prediction'] = np.zeros(len(valid)).astype('int')

In order to make predictions using the model we just created, we have to compare each tweet in the validation data with each tweet in both the `disasters` DataFrame and the `not_disasters` DataFrame.

Therefore, for each tweet, we:
1. Turn it into a BoW according to each set of tweets' Dictionary object.
2. Get a vector for it using each set's TF-IDF model.
3. Compare it's vector with each set's full set of tweets using the MatrixSimilarity object we created earleir.
4. Tally up the total number of disaster and non-disaster tweets whose cosine similarity is greater than 0.1.
5. If the disaster tally is greater than the non-disaster tally, we change the value of the prediction column for this tweet in the `valid` DataFrame to 1 (otherwise, it stays 0, indicating a non-disastrous guess).

This is exemplefied below:

In [30]:
for row in range(len(valid)):
    tweet = valid_tweets_split[row]
    
    tweet_bow_with_disasters_dct = disaster_tweets_dct.doc2bow(tweet)
    tweet_bow_with_not_disasters_dct = not_disaster_tweets_dct.doc2bow(tweet)
    
    tweet_tfidf_vector_with_disasters_tfidf = disaster_tweets_tfidf[tweet_bow_with_disasters_dct]
    tweet_tfidf_vector_with_not_disasters_tfidf = not_disaster_tweets_tfidf[tweet_bow_with_not_disasters_dct]
    
    disaster_similarity_vector = disaster_tweets_similarity[tweet_tfidf_vector_with_disasters_tfidf]
    not_disaster_similarity_vector = not_disaster_tweets_similarity[tweet_tfidf_vector_with_not_disasters_tfidf]
    
    disaster_tally = np.where(disaster_similarity_vector > 0.1)[0].size # np.where() returns a tuple, so we have to index into [0] to get what we want
    not_disaster_tally = np.where(not_disaster_similarity_vector > 0.1)[0].size
    
    if disaster_tally > not_disaster_tally:
        valid.loc[row, 'prediction'] = 1

If all went well, we should be able to see our predictions in the `valid` DataFrame...

In [31]:
valid.head()

Unnamed: 0,index,id,keyword,location,text,target,text_simple,text_spacy,text_trigram,text_trigram_simple,text_trigram_spacy,prediction
0,3068,4402,electrocute,,Kids got Disney version of the game Operation ...,0,kids got disney version of the game operation ...,kid get disney version game operation 2 aa bat...,Kids got Disney version of the game Operation ...,kids got disney version of the game operation ...,kid get disney version game operation 2 aa bat...,1
1,3148,4522,emergency,"Indianapolis, IN",UPDATE: Indiana State Police reopening I-65 ne...,1,update indiana state police reopening i 65 nea...,update : indiana state police reopen i-65 near...,UPDATE: Indiana State Police reopening I-65 ne...,update indiana state police reopening i 65 nea...,update : indiana state police reopen i-65 near...,1
2,3139,4511,emergency,Phoenix,God forbid anyone in my family knows how to an...,0,god forbid anyone in my family knows how to an...,god forbid family know answer phone . need new...,God forbid anyone in my family knows how to an...,god forbid anyone in my family knows how to an...,god forbid family know answer phone . need new...,0
3,7485,10707,wreck,"Alabama, USA",First wreck today. So so glad me and mom are o...,0,first wreck today so so glad me and mom are ok...,wreck today . glad mom okay . lot bad . happy ...,First wreck today. So so glad me and mom are o...,first wreck today so so glad me and mom are ok...,wreck today . glad mom okay . lot bad . happy ...,0
4,6023,8608,seismic,Somalia,Exploration takes seismic shift in Gabon to So...,0,exploration takes seismic shift in gabon to so...,exploration take seismic shift gabon somalia -...,Exploration takes seismic shift in Gabon to So...,exploration takes seismic shift in gabon to so...,exploration take seismic shift gabon somalia -...,0


Look at that! Seems we've made some predictions! But how well did we do?

Let's take a look at both the accuracy and F1 score:

In [32]:
accuracy = accuracy_score(valid['target'], valid['prediction'])
F1 = f1_score(valid['target'], valid['prediction'])
accuracy, F1

(0.6407563025210085, 0.5302197802197802)

`64.08%` accuracy! That's not too shabby for just looking at word frequencies...

But what happens if we calculate tweet similarities using TF-IDF again, but this time using the preprocessed data that we prepared in the last section?

Let's start by seeing how our scores improve with the "simply" cleaned tweets.

### TF-IDF with "Simple" Tweets

Before we go any further, we'll need to get rid of the predictions we just made in `valid`.

In [33]:
valid = valid.drop(columns=['prediction'])
valid.head()

Unnamed: 0,index,id,keyword,location,text,target,text_simple,text_spacy,text_trigram,text_trigram_simple,text_trigram_spacy
0,3068,4402,electrocute,,Kids got Disney version of the game Operation ...,0,kids got disney version of the game operation ...,kid get disney version game operation 2 aa bat...,Kids got Disney version of the game Operation ...,kids got disney version of the game operation ...,kid get disney version game operation 2 aa bat...
1,3148,4522,emergency,"Indianapolis, IN",UPDATE: Indiana State Police reopening I-65 ne...,1,update indiana state police reopening i 65 nea...,update : indiana state police reopen i-65 near...,UPDATE: Indiana State Police reopening I-65 ne...,update indiana state police reopening i 65 nea...,update : indiana state police reopen i-65 near...
2,3139,4511,emergency,Phoenix,God forbid anyone in my family knows how to an...,0,god forbid anyone in my family knows how to an...,god forbid family know answer phone . need new...,God forbid anyone in my family knows how to an...,god forbid anyone in my family knows how to an...,god forbid family know answer phone . need new...
3,7485,10707,wreck,"Alabama, USA",First wreck today. So so glad me and mom are o...,0,first wreck today so so glad me and mom are ok...,wreck today . glad mom okay . lot bad . happy ...,First wreck today. So so glad me and mom are o...,first wreck today so so glad me and mom are ok...,wreck today . glad mom okay . lot bad . happy ...
4,6023,8608,seismic,Somalia,Exploration takes seismic shift in Gabon to So...,0,exploration takes seismic shift in gabon to so...,exploration take seismic shift gabon somalia -...,Exploration takes seismic shift in Gabon to So...,exploration takes seismic shift in gabon to so...,exploration take seismic shift gabon somalia -...


The process this time around will, in fact, be exactly the same as last time! The only change we need to make is that we are indexing into the `text_simple` column in the `disaster_tweets` and `not_disaster_tweets` DataFrames.

Since the procedure is the same, let's skip to the metrics! (You can still expand the code below if you need a closer look.)

In [34]:
#collapse
disaster_tweets = disasters['text_simple'].tolist()
not_disaster_tweets = not_disasters['text_simple'].tolist()

disaster_tweets_split = [
    [word for word in tweet.split()]
    for tweet in disaster_tweets
]
not_disaster_tweets_split = [
    [word for word in tweet.split()]
    for tweet in not_disaster_tweets
]

disaster_tweets_word_frequency = defaultdict(int)
for tweet in disaster_tweets_split:
    for word in tweet:
        disaster_tweets_word_frequency[word] += 1
        
not_disaster_tweets_word_frequency = defaultdict(int)
for tweet in not_disaster_tweets_split:
    for word in tweet:
        not_disaster_tweets_word_frequency[word] += 1

disaster_tweets_split = [
    [word for word in tweet if disaster_tweets_word_frequency[word] > 1]
    for tweet in disaster_tweets_split
]

not_disaster_tweets_split = [
    [word for word in tweet if not_disaster_tweets_word_frequency[word] > 1]
    for tweet in not_disaster_tweets_split
]

disaster_tweets_dct = Dictionary(disaster_tweets_split)
not_disaster_tweets_dct = Dictionary(not_disaster_tweets_split)

disaster_tweets_corpus = [disaster_tweets_dct.doc2bow(tweet) for tweet in disaster_tweets_split]
not_disaster_tweets_corpus = [not_disaster_tweets_dct.doc2bow(tweet) for tweet in not_disaster_tweets_split]

disaster_tweets_tfidf = TfidfModel(disaster_tweets_corpus)
not_disaster_tweets_tfidf = TfidfModel(not_disaster_tweets_corpus)

disaster_tweets_tfidf_vectors = disaster_tweets_tfidf[disaster_tweets_corpus]
not_disaster_tweets_tfidf_vectors = not_disaster_tweets_tfidf[not_disaster_tweets_corpus]

disaster_tweets_similarity = MatrixSimilarity(disaster_tweets_tfidf_vectors)
not_disaster_tweets_similarity = MatrixSimilarity(not_disaster_tweets_tfidf_vectors)

valid_tweets = valid['text_simple'].tolist()

valid_tweets_split = [
    [word for word in tweet.split()]
    for tweet in valid_tweets
]

valid_tweets_word_frequency = defaultdict(int)
for tweet in valid_tweets_split:
    for word in tweet:
        valid_tweets_word_frequency[word] += 1
    
valid_tweets_split = [
    [word for word in tweet if valid_tweets_word_frequency[word] > 1]
    for tweet in valid_tweets_split
]

valid['prediction'] = np.zeros(len(valid)).astype('int')

for row in range(len(valid)):
    tweet = valid_tweets_split[row]
    
    tweet_bow_with_disasters_dct = disaster_tweets_dct.doc2bow(tweet)
    tweet_bow_with_not_disasters_dct = not_disaster_tweets_dct.doc2bow(tweet)
    
    tweet_tfidf_vector_with_disasters_tfidf = disaster_tweets_tfidf[tweet_bow_with_disasters_dct]
    tweet_tfidf_vector_with_not_disasters_tfidf = not_disaster_tweets_tfidf[tweet_bow_with_not_disasters_dct]
    
    disaster_similarity_vector = disaster_tweets_similarity[tweet_tfidf_vector_with_disasters_tfidf]
    not_disaster_similarity_vector = not_disaster_tweets_similarity[tweet_tfidf_vector_with_not_disasters_tfidf]
    
    disaster_tally = np.where(disaster_similarity_vector > 0.1)[0].size # np.where() returns a tuple, so we have to index into [0] to get what we want
    not_disaster_tally = np.where(not_disaster_similarity_vector > 0.1)[0].size
    
    if disaster_tally > not_disaster_tally:
        valid.loc[row, 'prediction'] = 1

In [35]:
accuracy = accuracy_score(valid['target'], valid['prediction'])
F1 = f1_score(valid['target'], valid['prediction'])
accuracy, F1

(0.6659663865546218, 0.5702702702702702)

`66.60%` accuracy; we've gotten better! Notice that our F1 score has gone up also, from `0.53` to `0.57`.

For the last of the TF-IDF similarities, let's see how things go if we use the tweets that were preprocessed with SpaCy:

### TF-IDF with SpaCy Tweets

Same process as before, let's clear the old predictions from `valid` and skip to the metrics!

In [36]:
valid = valid.drop(columns=['prediction'])

In [37]:
#collapse
disaster_tweets = disasters['text_spacy'].tolist()
not_disaster_tweets = not_disasters['text_spacy'].tolist()

disaster_tweets_split = [
    [word for word in tweet.split()]
    for tweet in disaster_tweets
]
not_disaster_tweets_split = [
    [word for word in tweet.split()]
    for tweet in not_disaster_tweets
]

disaster_tweets_word_frequency = defaultdict(int)
for tweet in disaster_tweets_split:
    for word in tweet:
        disaster_tweets_word_frequency[word] += 1
        
not_disaster_tweets_word_frequency = defaultdict(int)
for tweet in not_disaster_tweets_split:
    for word in tweet:
        not_disaster_tweets_word_frequency[word] += 1

disaster_tweets_split = [
    [word for word in tweet if disaster_tweets_word_frequency[word] > 1]
    for tweet in disaster_tweets_split
]

not_disaster_tweets_split = [
    [word for word in tweet if not_disaster_tweets_word_frequency[word] > 1]
    for tweet in not_disaster_tweets_split
]

disaster_tweets_dct = Dictionary(disaster_tweets_split)
not_disaster_tweets_dct = Dictionary(not_disaster_tweets_split)

disaster_tweets_corpus = [disaster_tweets_dct.doc2bow(tweet) for tweet in disaster_tweets_split]
not_disaster_tweets_corpus = [not_disaster_tweets_dct.doc2bow(tweet) for tweet in not_disaster_tweets_split]

disaster_tweets_tfidf = TfidfModel(disaster_tweets_corpus)
not_disaster_tweets_tfidf = TfidfModel(not_disaster_tweets_corpus)

disaster_tweets_tfidf_vectors = disaster_tweets_tfidf[disaster_tweets_corpus]
not_disaster_tweets_tfidf_vectors = not_disaster_tweets_tfidf[not_disaster_tweets_corpus]

disaster_tweets_similarity = MatrixSimilarity(disaster_tweets_tfidf_vectors)
not_disaster_tweets_similarity = MatrixSimilarity(not_disaster_tweets_tfidf_vectors)

valid_tweets = valid['text_spacy'].tolist()

valid_tweets_split = [
    [word for word in tweet.split()]
    for tweet in valid_tweets
]

valid_tweets_word_frequency = defaultdict(int)
for tweet in valid_tweets_split:
    for word in tweet:
        valid_tweets_word_frequency[word] += 1
    
valid_tweets_split = [
    [word for word in tweet if valid_tweets_word_frequency[word] > 1]
    for tweet in valid_tweets_split
]

valid['prediction'] = np.zeros(len(valid)).astype('int')

for row in range(len(valid)):
    tweet = valid_tweets_split[row]
    
    tweet_bow_with_disasters_dct = disaster_tweets_dct.doc2bow(tweet)
    tweet_bow_with_not_disasters_dct = not_disaster_tweets_dct.doc2bow(tweet)
    
    tweet_tfidf_vector_with_disasters_tfidf = disaster_tweets_tfidf[tweet_bow_with_disasters_dct]
    tweet_tfidf_vector_with_not_disasters_tfidf = not_disaster_tweets_tfidf[tweet_bow_with_not_disasters_dct]
    
    disaster_similarity_vector = disaster_tweets_similarity[tweet_tfidf_vector_with_disasters_tfidf]
    not_disaster_similarity_vector = not_disaster_tweets_similarity[tweet_tfidf_vector_with_not_disasters_tfidf]
    
    disaster_tally = np.where(disaster_similarity_vector > 0.1)[0].size # np.where() returns a tuple, so we have to index into [0] to get what we want
    not_disaster_tally = np.where(not_disaster_similarity_vector > 0.1)[0].size
    
    if disaster_tally > not_disaster_tally:
        valid.loc[row, 'prediction'] = 1

In [38]:
accuracy = accuracy_score(valid['target'], valid['prediction'])
F1 = f1_score(valid['target'], valid['prediction'])
accuracy, F1

(0.6402310924369747, 0.4891871737509322)

With SpaCy lemmatization and removal of stop words, we've actually gotten the worst results of the three datasets, with an accuracy of `64.02%` and an F1 score of `0.49`.

So it seems of the three preprocessing techniques used in a TF-IDF model, in this case, "simple" cleaning worked the best with an accuracy of `66.60` and an F1 score of `0.57`.

Let's now move forward with Word2Vec.

## Word2Vec

In [39]:
#collapse
from gensim.models import Word2Vec
from gensim.models.phrases import Phrases, Phraser
import time

The second method for text classification that we'll use is **word vectors**.

Word vectors were first introduced by Mikolov et al.[[1]](https://arxiv.org/pdf/1301.3781.pdf)[[2]](https://arxiv.org/pdf/1310.4546.pdf) and provide highly accurate results in word similarity tasks at relatively low computational cost. You can think of a word vector as a 1-dimensional matrix of numbers of some arbitrary length computed by neural networks. Word similarity is then determined by the [cosine distance](https://en.wikipedia.org/wiki/Cosine_similarity) between two vectors.

Word vectors, interestingly, can encode linguistic regularities and patterns. Therefore, many of these patterns can be represented as linear translations. For example `vector(king) - vector(man) + vector(woman)` is going to very close to `vector(queen)`. This is surprising!

Let's see how word vectors do at predicting disaster tweets.

#### Word2Vec Preprocessing

We'll be using gensim's Word2Vec module, which processes text using a `min_count` parameter. This parameter only includes words in the input that occur more than the set `min_count` number of times. This will cause problems later on when trying to classify the tweets in the validation set because some of the words will have occurred less than the `min_count` parameter, throwing an "out-of-vocabulary" (OOV) error.

In order to remedy this, we have two options:
1. Train the Word2Vec model and then remove the words from the validation tweets that are not in the trained vocabulary.
2. Preemptively change the words in our corpus that occur less than the expected `min_count` number of times with some sort of "unknown" character.

Both of these methods alter the original tweet that we'll be classifying, but the latter option seems to adhere closer to the original meaning of the tweet. If we drop words, we could make an entirely new sentence with an enitrely new grammatical structure and meaning. Whereas if we replace the words that occur less than `min_count` amount of times with an unknown character, the original grammatical structure of each sentence is held in tact, creating a closer tie to the tweet's original meaning.

To do this efficiently, I've created a function `replace_unknowns()` that replaces the words in a text which occur less than a specified `min_count` number of times with `'UNK'`. We can use this to alter the preprocessed columns that we made earlier and store them in our original `data` DataFrame.

In [40]:
#collapse
def replace_unknowns(search_texts, min_count):
    """
    Replaces words that occur less than a certain number of times
    in a string or list of strings with 'UNK'.
    
    Parameters
    ----------
    search_texts : list
        A list of input strings to iterate over.
    min_count : int
        An integer specify the minimum count a word should occur in
        the search_texts to not be replaced with 'UNK'.
    
    Returns
    -------
    list
        List of search_texts with words that occur less than the min_count
        amount of times replaced with 'UNK'.
    
    """
    
    # Get all tweets lowered and tokenized.
    # This makes sense because we'd never want to
    # treat an 'a' different from an 'A'.
    # (Capitalization is just an orthographical convention)
    texts = [
        [word for word in re.split('\s+', text.lower())]
        for text in search_texts
    ]

    # create a dictionary that stores the count of each
    # word in our uncleaned tweets. We can insert new words
    # into the dict or add to their count if their already in it.
    vocab_counts = defaultdict(int)

    # Create a list that we can append words that occur more than
    # the desired threshold number of times to.
    vocab = []

    for text in texts:
        for word in text:
            vocab_counts[word] += 1

    # Now go through the vocab_counts and get rid of
    # words that occur less than five times.
    for word in vocab_counts.keys():
        if vocab_counts[word] > min_count:
            vocab.append(word)

    # Now initialize a new column in data that will hold
    # the tweets with 'UNK' replacing words that occur
    # across the entire vocabulary less than five times.
    # This creates congruency later on in the model.
    # data['text_count_5'] = np.empty(len(data), dtype=str) # ***** DO THIS OUTSIDE FUNC IN WORD2VEC SECTION

    # Now, go through each tweet and replace the words that
    # occur less than 5 times throughout the entire corpus
    # with 'UNK'. Then, we insert the new tweet into a new
    # column in the original dataframe.

    out = []
    # this process takes about a minute
    for i, text in enumerate(texts):
        text_replaced = []
        for word in text:
            if word in vocab:
                text_replaced.append(word)
            else:
                text_replaced.append('UNK')
        text_replaced = ' '.join(text_replaced)
        out.append(text_replaced)
        
    return out

Below, we'll use `min_count=5` as one of our [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)) in our Word2Vec model, so let's replace all of the words in all three of our preprocessed tweet columns (`text_trigram`, `text_trigram_simple`, and `text_trigram_spacy`) in each DataFrame with `'UNK'`.

> Important: We're using the tweets with trigrams that we built in the Data Augmentation section for our Word2Vec model. If you skipped it, go back now!

> Note: Normally this would happen during the initial preprocessing stage, allowing us to only need to call `replace_unknowns()` on our initial `data` DataFrame. Because we're calling `replace_unknowns()` after we've already split our data into training and validation sets, we need to call the function on all of the DataFrames we've already created.

In [41]:
#collapse
data['text_count_5'] = replace_unknowns(data['text_trigram'], 5)
data['text_simple_5'] = replace_unknowns(data['text_trigram_simple'], 5)
data['text_spacy_5'] = replace_unknowns(data['text_trigram_spacy'], 5)
data.head()

Unnamed: 0,id,keyword,location,text,target,text_simple,text_spacy,text_trigram,text_trigram_simple,text_trigram_spacy,text_count_5,text_simple_5,text_spacy_5
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,our deeds are the reason of this earthquake ma...,deed reason # earthquake allah forgive,Our Deeds are the Reason of this #earthquake M...,our deeds are the reason of this earthquake ma...,deed reason # earthquake allah forgive,our UNK are the reason of this #earthquake may...,our UNK are the reason of this earthquake may ...,UNK reason # earthquake allah UNK
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada,forest fire near la ronge sask . canada,Forest fire near La Ronge Sask. Canada,forest fire near la ronge sask canada,forest fire near la ronge sask . canada,forest fire near la UNK UNK canada,forest fire near la UNK UNK canada,forest fire near la UNK UNK . canada
2,5,,,All residents asked to 'shelter in place' are ...,1,all residents asked to shelter in place are be...,resident ask ' shelter place ' notify officer ...,All residents asked to 'shelter in place' are ...,all residents asked to shelter in place are be...,resident ask ' shelter place ' notify officer ...,all residents asked to UNK in UNK are being UN...,all residents asked to shelter in place are be...,resident ask ' shelter place ' UNK officer . e...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,13 000 people receive wildfires evacuation ord...,"13,000 people receive # wildfire evacuation or...","13,000 people receive #wildfires evacuation or...",13 000 people receive wildfires evacuation ord...,"13,000 people receive # wildfire evacuation or...",UNK people UNK UNK evacuation orders in califo...,13 UNK people UNK wildfires evacuation orders ...,UNK people UNK # wildfire evacuation order cal...
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,just got sent this photo from ruby alaska as s...,get send photo ruby # alaska smoke # wildfire ...,Just got sent this photo from Ruby #Alaska as ...,just got sent this photo from ruby alaska as s...,get send photo ruby # alaska smoke # wildfire ...,just got sent this photo from UNK UNK as smoke...,just got sent this photo from UNK alaska as sm...,get send photo UNK # alaska smoke # wildfire U...


In [42]:
#collapse
valid['text_count_5'] = replace_unknowns(valid['text_trigram'], 5)
valid['text_simple_5'] = replace_unknowns(valid['text_trigram_simple'], 5)
valid['text_spacy_5'] = replace_unknowns(valid['text_trigram_spacy'], 5)

disasters['text_count_5'] = replace_unknowns(disasters['text_trigram'], 5)
disasters['text_simple_5'] = replace_unknowns(disasters['text_trigram_simple'], 5)
disasters['text_spacy_5'] = replace_unknowns(disasters['text_trigram_spacy'], 5)

not_disasters['text_count_5'] = replace_unknowns(not_disasters['text_trigram'], 5)
not_disasters['text_simple_5'] = replace_unknowns(not_disasters['text_trigram_simple'], 5)
not_disasters['text_spacy_5'] = replace_unknowns(not_disasters['text_trigram_spacy'], 5)

Great, now our data is set up and ready to be used with a Word2Vec model!

### Word2Vec with Unprocessed Tweets

First and foremost, let's get rid of the `valid['prediction']` column that we made using TF-IDF.

In [43]:
valid = valid.drop(columns=['prediction'])

Initialize our Word2Vec model.
> Note: I'm splitting up the training of the model into three steps. See [this notebook](https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial/comments) for more details on why (and Word2Vec in general).

In [44]:
model = Word2Vec(min_count=5, sample=1e-3, workers=4, seed=24)

Build the vocab for our model.

The `.build_vocab()` method expects an iterable of a list of strings as its input, so first we split our tweets to adhere to that. Notice that we're looping through all of the tweets in our original `data` DataFrame rather than the `train` DataFrame we created. This is because we need the vocabulary of *all* tweets (in both the training and validation data) in order to properly compare tweets in the training data to tweets in the validation data. If we just built our model on the training data, many of the words in the validation tweets would throw OOV errors!

In [45]:
tweets = [
    [wd for wd in tweet.split(' ')]
    for tweet in data['text_count_5']
]

model.build_vocab(tweets)

Now we can train the model over 30 epochs (cycles).

In [46]:
model.train(tweets, total_examples=model.corpus_count, epochs=30)

(2006344, 3383040)

Now we normalize vectors in the vocaulary for consistency.
> Important: You wouldn't do this if you were going to train further down the line. See [this notebook](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/online_w2v_tutorial.ipynb) for more information about expanding your model's vocabulary.

In [47]:
model.wv.init_sims(replace=True)

Now we can make our predictions.

Just as when we were doing TF-IDF, we need to initialize a `prediction` column in the `valid` DataFrame to store our predictions.

In [48]:
valid['prediction'] = np.zeros(len(valid)).astype('int')

Similar to how we predicted whether a tweet was a disaster or not with TF-IDF, we have to compare each tweet in the validation data with each tweet in both the `disasters` DataFrame and the `not_disasters` DataFrame.

So, this time, for each tweet, we:
1. Split the validation tweet on all whitespace characters.
2. Calculate the similarity between the validation tweet and each disaster and non-disaster tweet (also split on whitespace characters).
3. If the similarity between the two tweets is greater than 0.7, add to that tweet set's tally.
4. If the disaster tally is gerater than the non-disaster tally, we change the value of the prediction column for the validation tweet to 1 (otherwise, it remains 0, indicating a non-disastrous guess).

This is exemplified below:

> Note: This model takes a little bit of time to train. It took almost 16 minutes on my machine.

In [49]:
start_time = time.time()

for valid_row in range(len(valid)):
    valid_tweet = valid.loc[valid_row, 'text_count_5']
    tokenized_valid_tweet = re.split('\s+', valid_tweet) # split on all whitespace characters
    
    disaster_count = 0
    not_disaster_count = 0
    
    # we can just reuse "disasters" and
    # "not_disasters" from earlier!
    for disaster_row in range(len(disasters)):
        disaster_tweet = disasters.loc[disaster_row, 'text_count_5']
        tokenized_disaster_tweet = re.split('\s+', disaster_tweet)
        if model.wv.n_similarity(tokenized_valid_tweet, tokenized_disaster_tweet) > 0.7:
            disaster_count += 1
        
    for not_disaster_row in range(len(not_disasters)):
        not_disaster_tweet = not_disasters.loc[not_disaster_row, 'text_count_5']
        tokenized_not_disaster_tweet = re.split('\s+', not_disaster_tweet)
        if model.wv.n_similarity(tokenized_valid_tweet, tokenized_not_disaster_tweet) > 0.7:
            not_disaster_count += 1
            
    if disaster_count > not_disaster_count:
        valid.loc[valid_row, 'prediction'] = 1
        
end_time = time.time()
print(f'Runtime: {(end_time - start_time) / 60.0} mins')

Runtime: 17.70957545042038 mins


Now let's take another look at the `valid` DataFrame to see if we've got some predictions...

In [50]:
valid.head()

Unnamed: 0,index,id,keyword,location,text,target,text_simple,text_spacy,text_trigram,text_trigram_simple,text_trigram_spacy,text_count_5,text_simple_5,text_spacy_5,prediction
0,3068,4402,electrocute,,Kids got Disney version of the game Operation ...,0,kids got disney version of the game operation ...,kid get disney version game operation 2 aa bat...,Kids got Disney version of the game Operation ...,kids got disney version of the game operation ...,kid get disney version game operation 2 aa bat...,UNK got UNK UNK of the game UNK only 2 UNK UNK...,kids got UNK UNK of the game UNK only 2 UNK UN...,kid get UNK version game UNK 2 UNK UNK ? UNK o...,0
1,3148,4522,emergency,"Indianapolis, IN",UPDATE: Indiana State Police reopening I-65 ne...,1,update indiana state police reopening i 65 nea...,update : indiana state police reopen i-65 near...,UPDATE: Indiana State Police reopening I-65 ne...,update indiana state police reopening i 65 nea...,update : indiana state police reopen i-65 near...,update: UNK state police UNK UNK near UNK UNK ...,update UNK state police UNK i UNK near UNK UNK...,update : UNK state police UNK UNK near UNK UNK...,0
2,3139,4511,emergency,Phoenix,God forbid anyone in my family knows how to an...,0,god forbid anyone in my family knows how to an...,god forbid family know answer phone . need new...,God forbid anyone in my family knows how to an...,god forbid anyone in my family knows how to an...,god forbid family know answer phone . need new...,god UNK UNK in my family UNK how to UNK a UNK ...,god UNK UNK in my family UNK how to UNK a phon...,god UNK family know UNK phone . need new emerg...,0
3,7485,10707,wreck,"Alabama, USA",First wreck today. So so glad me and mom are o...,0,first wreck today so so glad me and mom are ok...,wreck today . glad mom okay . lot bad . happy ...,First wreck today. So so glad me and mom are o...,first wreck today so so glad me and mom are ok...,wreck today . glad mom okay . lot bad . happy ...,first wreck UNK so so UNK me and UNK are UNK U...,first wreck today so so UNK me and UNK are UNK...,wreck today . UNK UNK UNK . lot bad . UNK UNK ...,0
4,6023,8608,seismic,Somalia,Exploration takes seismic shift in Gabon to So...,0,exploration takes seismic shift in gabon to so...,exploration take seismic shift gabon somalia -...,Exploration takes seismic shift in Gabon to So...,exploration takes seismic shift in gabon to so...,exploration take seismic shift gabon somalia -...,UNK UNK seismic UNK in UNK to UNK - UNK UNK UN...,UNK UNK seismic UNK in UNK to UNK UNK UNK,UNK take seismic UNK UNK UNK - UNK ( UNK ) UNK...,0


Seems to have worked!

Now let's find out the accuracy and F1 score of our Word2Vec model using the unprocessed tweet data.

In [51]:
accuracy = accuracy_score(valid['target'], valid['prediction'])
F1 = f1_score(valid['target'], valid['prediction'])
accuracy, F1

(0.6313025210084033, 0.26875)

`63.13%` accuracy! That's about the same as the TF-IDF model. The F1 score on the other hand... yikes! `0.27`. Horrible!

Can we improve that with either of the preprocessed tweets?

### Word2Vec with "Simple" Tweets

Once again, we clear out the predictions we've just made from `valid`.

In [52]:
valid = valid.drop(columns=['prediction'])

Just like with TF-IDF (seeing a trend here?), the process this time around will be exactly the same as before. The only change we need to make is that we are indexing into the `text_simple_5` column in the `disaster_tweets` and `not_disaster_tweets` DataFrames.

Since the procedures are the same, let's skip to the metrics! (You can still expand the code below if you need a closer look.)

In [53]:
#collapse
start_time = time.time()

model = Word2Vec(min_count=5, sample=1e-3, workers=4, seed=24)

tweets = [
    [wd for wd in tweet.split(' ')]
    for tweet in data['text_simple_5']
]

model.build_vocab(tweets)

model.train(tweets, total_examples=model.corpus_count, epochs=30)

model.wv.init_sims(replace=True)

valid['prediction'] = np.zeros(len(valid)).astype('int')

for valid_row in range(len(valid)):
    valid_tweet = valid.loc[valid_row, 'text_simple_5']
    tokenized_valid_tweet = re.split('\s+', valid_tweet) # split on all whitespace characters
    
    disaster_count = 0
    not_disaster_count = 0
    
    # we can just reuse "disasters" and
    # "not_disasters" from earlier!
    for disaster_row in range(len(disasters)):
        disaster_tweet = disasters.loc[disaster_row, 'text_simple_5']
        tokenized_disaster_tweet = re.split('\s+', disaster_tweet)
        if model.wv.n_similarity(tokenized_valid_tweet, tokenized_disaster_tweet) > 0.7:
            disaster_count += 1
        
    for not_disaster_row in range(len(not_disasters)):
        not_disaster_tweet = not_disasters.loc[not_disaster_row, 'text_simple_5']
        tokenized_not_disaster_tweet = re.split('\s+', not_disaster_tweet)
        if model.wv.n_similarity(tokenized_valid_tweet, tokenized_not_disaster_tweet) > 0.7:
            not_disaster_count += 1
            
    if disaster_count > not_disaster_count:
        valid.loc[valid_row, 'prediction'] = 1
        
end_time = time.time()
print(f'Runtime: {(end_time - start_time) / 60.0} mins')

Runtime: 15.660316868623097 mins


In [54]:
accuracy = accuracy_score(valid['target'], valid['prediction'])
F1 = f1_score(valid['target'], valid['prediction'])
accuracy, F1

(0.6754201680672269, 0.44821428571428573)

Quite an improvement! Our accuracy and F1 score went up to `67.54%` and `0.45`, respectively.

Now let's see how the SpaCy tweets perform in our Word2Vec model.

### Word2Vec with SpaCy Tweets

Same process as before, let's clear the old predictions from `valid` and skip to the metrics!

In [55]:
valid = valid.drop(columns=['prediction'])

In [56]:
#collapse
start_time = time.time()

model = Word2Vec(min_count=5, sample=1e-3, workers=4, seed=24)

tweets = [
    [wd for wd in tweet.split(' ')]
    for tweet in data['text_spacy_5']
]

model.build_vocab(tweets)

model.train(tweets, total_examples=model.corpus_count, epochs=30)

model.wv.init_sims(replace=True)

valid['prediction'] = np.zeros(len(valid)).astype('int')

for valid_row in range(len(valid)):
    valid_tweet = valid.loc[valid_row, 'text_spacy_5']
    tokenized_valid_tweet = re.split('\s+', valid_tweet) # split on all whitespace characters
    
    disaster_count = 0
    not_disaster_count = 0
    
    # we can just reuse "disasters" and
    # "not_disasters" from earlier!
    for disaster_row in range(len(disasters)):
        disaster_tweet = disasters.loc[disaster_row, 'text_spacy_5']
        tokenized_disaster_tweet = re.split('\s+', disaster_tweet)
        if model.wv.n_similarity(tokenized_valid_tweet, tokenized_disaster_tweet) > 0.7:
            disaster_count += 1
        
    for not_disaster_row in range(len(not_disasters)):
        not_disaster_tweet = not_disasters.loc[not_disaster_row, 'text_spacy_5']
        tokenized_not_disaster_tweet = re.split('\s+', not_disaster_tweet)
        if model.wv.n_similarity(tokenized_valid_tweet, tokenized_not_disaster_tweet) > 0.7:
            not_disaster_count += 1
            
    if disaster_count > not_disaster_count:
        valid.loc[valid_row, 'prediction'] = 1
        
end_time = time.time()
print(f'Runtime: {(end_time - start_time) / 60.0} mins')

Runtime: 14.212949315706888 mins


In [57]:
accuracy = accuracy_score(valid['target'], valid['prediction'])
F1 = f1_score(valid['target'], valid['prediction'])
accuracy, F1

(0.648109243697479, 0.33399602385685884)

SpaCy, this time, comes in the middle of our three tests with an accuracy of `64.81%` and F1 score of `0.33`.

Among the three datasets trained with a Word2Vec model, the "simple" tweets seem to have it again with an accuracy of `67.54%` and an F1 score of `0.45`.

Lastly, let's turn to transfer learning.

# Transfer Learning with fastai

In [58]:
from fastai.text.all import *

Rather than create our own neural network from scratch that competes with something like Word2Vec, we can use transfer learning to quickly adapt our language data by using a model that's already been trained on a lot more data than just what we have.

From [Jason Brownlee](https://machinelearningmastery.com/transfer-learning-for-deep-learning/):
> Transfer learning is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task.

In order to perform transfer learning, we'll be using [fastai](https://docs.fast.ai/). Fastai is great because it really simplifies the training procedure, thereby making it super easy to perform an array of deep learning tasks.

We'll need two classes from fastai to conduct transfer learning with text: `language_model_learner` and `text_classifier_learner`. The former will allow us to shape the pretrained model with our own data to make a new language model, while the latter will allow us to create a classifier model for the tweets we have (the same task we've been doing above).

Let's start, per usual, with the unprocessed tweets.

### Transfer Learning with Unprocessed Tweets

Fastai uses PyTorch under the hood, which requires our data to be formatted in [a specific way](https://pytorch.org/docs/stable/data.html). In order to do this most efficiently, we can use fastai's `DataBlock` object and `.dataloaders()` method. With `DataBlock`, we can:
1. Directly pull our columns from the dataframe that we'd like to train *and* test on.
2. Split the data however we'd like.
3. [And more!](https://docs.fast.ai/data.block#DataBlock)

Let's start by creating a `DataBlock` that we'll pass to `language_model_learner` to create a new language model tailored to our data.

In [59]:
dls_lm = DataBlock(
    blocks=(TextBlock.from_df('text', is_lm=True)),
    get_items=ColReader('text'),
    splitter=RandomSplitter(0.1)
).dataloaders(data, bs=128, seq_len=80)

Note that there is only one block in the `DataBlock` we just created: a `TextBlock`. All we need to create a language model is the text (we don't care about the categories yet), so we only need one block in the `DataBlock`. We also need to specify the parameter `is_lm=True` when creating the `TextBlock`, to specify that this is our language model.

Now we can use `.show_batch()` to take a look at our newly formatted data:

In [60]:
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,xxbos xxmaj student electrocuted to death in school campus http : / / t.co / xxunk xxbos xxunk xxunk xxunk xxmaj i 'm aware that not all xxup as are from countries we have bombed but a lot are xxunk conflict xxbos xxmaj i 'm about to be obliterated xxbos xxmaj full xxmaj episode : xxup xxunk 08 / 02 / 15 : xxmaj california xxmaj wild xxmaj fires xxmaj force 12 xxrep 3 0 to xxmaj evacuate # xxmaj,xxmaj student electrocuted to death in school campus http : / / t.co / xxunk xxbos xxunk xxunk xxunk xxmaj i 'm aware that not all xxup as are from countries we have bombed but a lot are xxunk conflict xxbos xxmaj i 'm about to be obliterated xxbos xxmaj full xxmaj episode : xxup xxunk 08 / 02 / 15 : xxmaj california xxmaj wild xxmaj fires xxmaj force 12 xxrep 3 0 to xxmaj evacuate # xxmaj worldnews
1,xxup xxunk : xxunk in yr xxunk election xxunk w / landslide win for xxunk ' http : / / t.co / xxunk xxbos xxup put xxup sandstorm xxup down xxrep 4 ! https : / / t.co / xxunk xxbos xxunk xxmaj enjoy the xxunk landslide xxmaj todd . xxmaj xxunk . xxbos … / / .. / / whao .. 12 xxrep 3 0 xxmaj nigerian refugees repatriated from xxmaj cameroon http : / / t.co / xxunk,xxunk : xxunk in yr xxunk election xxunk w / landslide win for xxunk ' http : / / t.co / xxunk xxbos xxup put xxup sandstorm xxup down xxrep 4 ! https : / / t.co / xxunk xxbos xxunk xxmaj enjoy the xxunk landslide xxmaj todd . xxmaj xxunk . xxbos … / / .. / / whao .. 12 xxrep 3 0 xxmaj nigerian refugees repatriated from xxmaj cameroon http : / / t.co / xxunk xxbos


We can now instantiate our `language_model_learner` using the `DataBlock` we just created and `AWD_LSTM`, which is a pretrained model provided by fastai. You can learn more about `AWD_LSTM` [here](https://arxiv.org/pdf/1708.02182.pdf).

In [61]:
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3,
    metrics=[accuracy, Perplexity()])

All that's left to do is fit our language model!

You'll note that fastai also provides super clear, customizable output for each training cycle.

In [62]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.49352,3.626821,0.39302,37.593121,01:38


In [63]:
learn.fit_one_cycle(10,2e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.756268,3.606677,0.39541,36.84343,01:37
1,3.722715,3.534695,0.399615,34.284554,01:38
2,3.670052,3.445324,0.408519,31.353439,01:36
3,3.608591,3.37717,0.415497,29.287779,01:36
4,3.545309,3.328276,0.421299,27.890228,01:37
5,3.48371,3.294219,0.425069,26.956341,01:37
6,3.428636,3.272936,0.427655,26.388706,01:37
7,3.388489,3.262204,0.429664,26.107006,01:37
8,3.352031,3.257177,0.430366,25.976101,01:37
9,3.335237,3.256474,0.430461,25.957844,01:37


The accuracy above represents the models ability to predict the next word in a sequence from our disaster tweets data. `42.51%`! That's pretty dang good for something that took about the same time as our Word2Vec models.

But we're not after text prediction, we're after text classification. Let's turn to that now.

First, let's create a `DataBlock` that we'll pass to `text_classifier_learner`. Notice that now we're passing two blocks to the `blocks` parameter: `TextBlock` and `CategoryBlock`. We specify these with the `get_x` and `get_y` parameters. It is also important to note the new `TextBlock` parameter `vocab`. Without this, the language model fitting we did above will mean nothing!

In [64]:
dls_clas = DataBlock(
    blocks=(TextBlock.from_df('text', vocab=dls_lm.vocab, seq_len=80), CategoryBlock),
    get_x=ColReader('text'),
    get_y=ColReader('target'),
    splitter=RandomSplitter()
).dataloaders(data, bs=128, seq_len=80)

  return array(a, dtype, copy=False, order=order)


Check to see that our data is how we want it.

In [65]:
dls_clas.show_batch(max_n=3)

Unnamed: 0,text,category
0,xxbos _ xxunk xxrep 5 ? xxup xxunk xxunk xxrep 7 ? xxunk xxrep 5 ? xxup follow xxup all xxup who xxup rt xxunk xxrep 7 ? xxunk xxrep 5 ? xxup xxunk xxunk xxrep 7 ? xxunk xxrep 5 ? xxup xxunk xxup with xxunk xxrep 7 ? xxunk xxrep 5 ? xxup follow ? xxunk # xxup xxunk xxunk # xxup ty,0
1,xxbos xxup info xxup u. xxup xxunk : xxup xxunk xxup xxunk . xxup exp xxup inst xxup apch . xxup rwy 05 . xxup curfew xxup in xxup oper xxup until 2030 xxup z. xxup taxiways xxup foxtrot 5 & & xxup foxtrot 6 xxup navbl . xxup tmp : 10 . xxup wnd : xxunk / 6 . xxpad xxpad xxpad xxpad xxpad,0
2,xxbos xxup info xxup r. xxup curfew xxup in xxup oper xxup until 2030 xxup z. xxup taxiways xxup foxtrot 5 & & xxup foxtrot 6 xxup navbl . xxup wnd : xxunk / 5 . xxup exp xxup inst xxup apch . xxup rwy 05 . xxup xxunk . xxup tmp : 10 . xxup xxunk : xxunk . xxpad xxpad xxpad xxpad xxpad,0


Now it's time to create our text classifier model, again using transfer learning from the `AWD_LSTM` model provided by fastai. This time we want to see the accuracy and F1 score when testing on the validation set.

In [66]:
learn = text_classifier_learner(
    dls_clas, AWD_LSTM, drop_mult=0.5,
    metrics=[accuracy, F1Score()])

Now we can fit:

In [67]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,f1_score,time
0,0.758835,0.490689,0.770696,0.712284,00:44


And that's. It.

Crazy, right?! One last step that we need to take care of to inch our models accuracy up further is [gradual unfreezing](https://stats.stackexchange.com/questions/393168/what-does-it-mean-to-freeze-or-unfreeze-a-model). Unfreezing a few layers at a time seems to make a meaningful difference in NLP, so we'll do that here (in computer vision, the model will often be unfrozen all at once).

In [68]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

epoch,train_loss,valid_loss,accuracy,f1_score,time
0,0.650367,0.483046,0.781209,0.740047,00:53


In [69]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

epoch,train_loss,valid_loss,accuracy,f1_score,time
0,0.559126,0.457762,0.800263,0.743243,01:29


In [70]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

epoch,train_loss,valid_loss,accuracy,f1_score,time
0,0.504711,0.439913,0.803548,0.766224,02:07
1,0.486418,0.433688,0.808147,0.769716,02:06


After fully unfreezing and fitting our model, our accuracy is... `80.81%`! Over 11% better than our best Word2Vec! Impressive. Our F1 score of `0.770` also blows away our best Word2Vec F1 score of `0.450`. Impressive, indeed.

But how will transfer learning perform will the preprocessed tweets? Let's find out!

### Transfer Learning with "Simple" Tweets

In order to repeat the same process for transfer learning on the preprocessed tweets, we'll need to create a whole new language model for each set. This is done almost exactly in the same way as above. The two differences are:
1. The column that your selecting from will change from `text` to `text_simple` or `text_spacy`.
2. The `get_x` parameter when creating the `DataBlock` for the `text_classifier_learner`, `dls_clas`, must *remain* `text`, no matter the name of the column in the DataFrame that you are using as the independent variable. [[1]](https://forums.fast.ai/t/issue-with-textblock-from-df-dataloaders-only-accepting-one-column-name/77467)

Knowing this, let's fit our language model!

In [71]:
#collapse
dls_lm = DataBlock(
    blocks=(TextBlock.from_df('text_simple', is_lm=True)),
    get_items=ColReader('text_simple'),
    splitter=RandomSplitter(0.1)
).dataloaders(data, bs=128, seq_len=80)

learn = language_model_learner(dls_lm, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()])
learn.fit_one_cycle(1, 2e-2)
learn.fit_one_cycle(10,2e-3)

dls_clas = DataBlock(
    blocks=(TextBlock.from_df('text_simple', vocab=dls_lm.vocab, seq_len=80), CategoryBlock),
    get_x=ColReader('text'),
    get_y=ColReader('target'),
    splitter=RandomSplitter()
).dataloaders(data, bs=128, seq_len=80)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,6.019077,5.163853,0.198145,174.836746,00:56


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,5.322052,5.14282,0.199809,171.197922,00:56
1,5.295057,5.063235,0.211279,158.101196,00:56
2,5.253794,4.969788,0.215545,143.996292,00:56
3,5.187418,4.889296,0.228875,132.859955,00:56
4,5.118052,4.82938,0.229359,125.133362,00:56
5,5.050055,4.783056,0.237052,119.468918,00:56
6,4.993558,4.753863,0.240452,116.031631,00:56
7,4.942537,4.738311,0.240891,114.241066,00:56
8,4.901583,4.73207,0.242556,113.53038,00:56
9,4.869204,4.731,0.242702,113.408905,00:56


  return array(a, dtype, copy=False, order=order)


Fit our text classifier:

In [72]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=[accuracy, F1Score()])
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,f1_score,time
0,0.737682,0.509384,0.750986,0.726354,00:27


Now gradually unfreeze:

In [73]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

epoch,train_loss,valid_loss,accuracy,f1_score,time
0,0.617487,0.482516,0.769382,0.729375,00:32


In [74]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

epoch,train_loss,valid_loss,accuracy,f1_score,time
0,0.534399,0.475102,0.775296,0.717355,00:55


In [75]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

epoch,train_loss,valid_loss,accuracy,f1_score,time
0,0.489055,0.448742,0.799606,0.764842,01:19
1,0.463102,0.445263,0.802234,0.766847,01:18


Accuracy: `80.22%`. F1 score: `0.767`.

Nearly the same as, but not quite better than the unprocessed tweets. This is the opposite of what happened with TF-IDF and Word2Vec.

Let's see how the SpaCy tweets perform:

### Transfer Learning with SpaCy Tweets

Let's do the same thing with our tweets preprocessed with SpaCy.

First, the language model:

In [76]:
#collapse
dls_lm = DataBlock(
    blocks=(TextBlock.from_df('text_spacy', is_lm=True)),
    get_items=ColReader('text_spacy'),
    splitter=RandomSplitter(0.1)
).dataloaders(data, bs=128, seq_len=80)

learn = language_model_learner(dls_lm, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()])
learn.fit_one_cycle(1, 2e-2)
learn.fit_one_cycle(10,2e-3)

dls_clas = DataBlock(
    blocks=(TextBlock.from_df('text_spacy', vocab=dls_lm.vocab, seq_len=80), CategoryBlock),
    get_x=ColReader('text'),
    get_y=ColReader('target'),
    splitter=RandomSplitter()
).dataloaders(data, bs=128, seq_len=80)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,5.713386,4.250023,0.369869,70.107018,01:03


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.653237,4.210376,0.370858,67.381859,01:02
1,4.582531,4.049045,0.403041,57.342648,01:02
2,4.480067,3.906515,0.421333,49.725338,01:01
3,4.384799,3.809164,0.434499,45.112709,01:01
4,4.299644,3.742372,0.440083,42.197945,01:02
5,4.220522,3.691609,0.44446,40.109318,01:03
6,4.16679,3.660479,0.448433,38.879967,01:04
7,4.110455,3.642861,0.450266,38.20097,01:02
8,4.065619,3.636134,0.450706,37.944855,01:01
9,4.030597,3.635027,0.450852,37.902893,01:01


  return array(a, dtype, copy=False, order=order)


Then, fit the text classifier:

In [77]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=[accuracy, F1Score()])
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,f1_score,time
0,0.792957,0.534716,0.754271,0.710078,00:30


Gradually unfreeze the model:

In [78]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

epoch,train_loss,valid_loss,accuracy,f1_score,time
0,0.677191,0.511442,0.761498,0.712133,00:36


In [79]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

epoch,train_loss,valid_loss,accuracy,f1_score,time
0,0.589788,0.478241,0.780552,0.710069,01:02


In [80]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

epoch,train_loss,valid_loss,accuracy,f1_score,time
0,0.545371,0.475847,0.784494,0.697417,01:27
1,0.531533,0.459217,0.787122,0.710714,01:28


Et voilà! Accuracy: `78.71%`. F1 score: `0.711`.

Whoa! These are the lowest of the three of our transfer learning models. How unexpected.

So between the three datasets used in transfer learning, the unprocessed dataset seemed to perform the best! Unexpected, indeed.

# Conclusion

Now that we've gone through each model: TF-IDF, Word2Vec, and transfer learning, it's time to compare the results:

Model | Dataset | Accuracy | F1 Score
---------- | ----------- | ------------- | -----------
**TF-IDF** | Unprocessed | 64.08% | 0.530
'' | "Simple" | 66.60% | 0.570
'' | SpaCy | 64.02% | 0.489
**Word2Vec** | Unprocessed | 63.13% | 0.269
'' | "Simple" | 67.54% | 0.448
'' | SpaCy | 64.81% | 0.334
**Transfer Learning** | **Unprocessed** | **80.81%** | **0.770**
'' | "Simple" | 80.22% | 0.767
'' | SpaCy | 78.71% | 0.711

And the winner is, unsurprisingly, transfer learning! What is surprising, however, is that of the three datasets that we used for transfer learning, the unprocessed dataset yielded the best results. This provides strong support for transfer learning, as it is able to extract nuances in natural language as opposed to augemented, unnatural language.

If you're interesed in getting more involved with transfer learning, I strongly recommend Jeremy Howard and Rachel Thomas' course [Deep Learning for Coders](https://youtu.be/_QUEXsHfsA0). At the time of writing, this is an excellent resource for getting a really good, modern grasp of deep learning, provided you've got some basic Python programming experience. And it's all free!

With that, I'll leave the reader to experiment further with text classification and langauge modeling.

Questions I'm now asking myself:
* What other preprocessing methods or data augmentations techniques could we have used?
* What's a transformer?
* How does BERT work?
* Where else can we apply text classification to somehow learn something meaningful?