In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Brief Description of Problem

This is an introductory competition on Kaggle to learn Natural Language Processing (NLP) techniques.  We are given tweets and have to determine if they are actually announcing disaster or not.  This is a simple binary classification (only 2 categories).

# Exploratory Data Analysis

#### Load Dataset

In [2]:
train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')

In [3]:
train.head()

In [4]:
test.head()

In [5]:
len(test)

Since we only want to train on the text, I will remove the columns 'id', 'keyword', and 'location'.  The target column will be our labels.

In [6]:
train = train.drop(['id', 'keyword', 'location'], axis=1)
test = test.drop(['id', 'keyword', 'location'], axis=1)

In [7]:
print(train.shape)
print(test.shape)

In [8]:
train['target'].value_counts()

# Preprocessing Data

We see the training dataset is somewhat imbalanced.  I will further clean the training data before handling this imbalance.

I want to look at more entries to see what characters/cleaning may be necessary.  I will select 50 random entries from the target set. 

In [9]:
import random

rand_idx = random.sample(list(train.index),50)

for idx in rand_idx:
    print(train.iloc[idx,0])


So, we see there are a lot of links (http://...) and tweets directed at certain users.  Removing html links and strings that begin with @ will be the initial set for cleaning.

In [10]:
import re

In [11]:
def remove_links(sentence):
    link = re.compile(r'https?://\S+')
    return link.sub(r'', sentence)

def remove_targeted_tweets(sentence):
    tgt_twt = re.compile(r'@\S+')
    return tgt_twt.sub(r'', sentence)

def clean_data(data):
    data['text'] = data['text'].apply(lambda x : remove_links(x))
    data['text'] = data['text'].apply(lambda x : remove_targeted_tweets(x))    
    return data

In [12]:
train_cleaned = clean_data(train)
test_cleaned = clean_data(test)

Now I will look at the same subset again

In [13]:
for idx in rand_idx:
    print(train_cleaned.iloc[idx,0])

So, we see the links and targets of tweets have been removed, leaving just the information from the "body" of the tweet.

Next, I will check for duplicates and attempt to remove them as well.

In [14]:
train_cleaned = train_cleaned.drop_duplicates(subset='text', keep="first")

In [15]:
print(train_cleaned.shape)
print(test_cleaned.shape)

Ok, so we have removed duplicate entries and now we can check the class balances again.  If they are imbalanced, I will undersample the majority class.

In [16]:
train_cleaned['target'].value_counts()

In [17]:
#Undersmaple majority class
class0 = train_cleaned[train_cleaned['target']==0]
class1 = train_cleaned[train_cleaned['target']==1]

class0_sample = class0.sample(n=class1.shape[0])


In [18]:
train_cleaned_balanced = pd.concat([class0_sample,class1]).sample(frac=1, random_state=12345).reset_index(drop=True)
train_cleaned_balanced.head()

In [19]:
train_cleaned_balanced['target'].value_counts()

Next I will look to remove "stop words".   These are the words in the english language that typically provide little information (such as a, an, the, etc.).

In [20]:
from nltk.corpus import stopwords


def remove_stop_words(sentence):
    SENTENCE = sentence.split()
    WORDS = [word for word in SENTENCE if word not in stopwords.words('english')]
    
    return ' '.join(WORDS)

def clean_stop_words(data):
    data['text'] = data['text'].apply(lambda x : remove_stop_words(x))   
    return data




In [21]:
train_cleaned_balanced = clean_data(train_cleaned_balanced)
test_cleaned = clean_data(test_cleaned)

In [22]:
train_cleaned_balanced.head()

The final step in preprocessing the data will be to convert the text into a format the model will be able to understand.  This is called tokenization. I have included both the training and test sets in the corpus of text in order to make sure all words are included in both sets.  In a real world example, where we do not have the test set in advance, I would only be able to classify words previously seen.  Padding extends the observations with blank spaces at the end of the sentence in order for all observations to be the same length.

I used the built in keras tokenizer instead of other formats.  There were pros and cons to this decision, with the primary pro being I am able to learn a bit more about tensorflow and tensors which is something I want to learn more about as I study deep learning.  The major con and something that likely is affecting the results is the lack of n-grams in this tokenizer.  "n-grams" are combinations of words which together may be more informative than when found apart.  For instance, the bigram "Hurricane Warning" may immediately indicate a pending disaster while "That restaurants Hurricane cocktail should have come with a warning" does not, but it uses both the words hurricane and warning.

In hindsight I probably should have used bigrams and trigrams, but for this assignment I really wanted to focus on understanding the model architecture rather than trying to perfect my score.

In [23]:
import tensorflow as tf
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

text_corpus = pd.concat([train_cleaned_balanced['text'],test_cleaned['text']])
    
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(text_corpus)

In [24]:
max_len = max(len(x.split()) for x in text_corpus)
max_len

In [25]:
train_features = train_cleaned_balanced.iloc[:,0]
train_labels = train_cleaned_balanced.iloc[:,1]
test_features = test_cleaned.iloc[:,0]

In [26]:
train_token = tokenizer.texts_to_sequences(train_features)
test_token = tokenizer.texts_to_sequences(test_features)

train_pad = pad_sequences(train_token, maxlen=max_len, padding='post')
test_pad = pad_sequences(test_token, maxlen=max_len, padding='post')

In [27]:
train_labels = np.array(train_labels)

# Model Architecture

I will be be using LSTM, one of the more advanced architectures from the RNN family.  The LSTMs let the model remember inputs over longer periods of time.  I felt this was useful since most on twitter do not write in full sentences but rather snippets.  By remembering how to treat these smaller snippets and improper grammar the model may be able to perform a bit better.  I will follow these up by some dense/fully connected layers as I find those generally aid with classification.

In [28]:
import keras
from keras.layers import LSTM
from keras.models import Sequential
from keras.layers import Dense, Embedding, Bidirectional, Dropout
from keras import optimizers

In [29]:
all_words = len(tokenizer.word_index)+1
embedding_units = 100
hidden_units = 64

model0 = Sequential()
model0.add(Embedding(all_words, embedding_units, input_length = max_len))
model0.add(Bidirectional(LSTM(hidden_units)))
model0.add(Dense(128, activation='tanh'))
model0.add(Dense(64, activation='tanh'))
model0.add(Dense(1, activation='sigmoid'))

model0.summary()

In [30]:
model0.compile( loss=tf.keras.losses.BinaryCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(0.00001),
    metrics=['accuracy', 'Precision', 'Recall'])

In [31]:
#added a callback for early stopping if it appears to be overfitting based on val_loss
call_backs = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5, verbose=1)]

In [32]:
history0 = model0.fit(train_pad, train_labels,epochs=50,validation_split=0.2, callbacks = call_backs)

In [33]:
import matplotlib.pyplot as plt

fig, axs = plt.subplots(1, 4, figsize=(20, 5))

axs[0].set_title('Loss')
axs[0].plot(history0.history['loss'], label='train')
axs[0].plot(history0.history['val_loss'], label='val')
axs[0].legend()

axs[1].set_title('Accuracy')
axs[1].plot(history0.history['accuracy'], label='train')
axs[1].plot(history0.history['val_accuracy'], label='val')
axs[1].legend()

axs[2].set_title('Precision')
axs[2].plot(history0.history['precision'], label='train')
axs[2].plot(history0.history['val_precision'], label='val')
axs[2].legend()

axs[3].set_title('Recall')
axs[3].plot(history0.history['recall'], label='train')
axs[3].plot(history0.history['val_recall'], label='val')
axs[3].legend()

This model did not utilize any dropout for generalization and we can see that there appears to be some overfitting around the 15th epoch and the model stopped before 25 epochs were even completed based on the early stopping callbacks set up.  I will next attempt to add some dropout for generalization.

In [34]:
model1 = Sequential()
model1.add(Embedding(all_words, embedding_units, input_length = max_len))
model1.add(Bidirectional(LSTM(hidden_units)))
model1.add(Dropout(0.2))
model1.add(Dense(128, activation='tanh'))
model1.add(Dropout(0.2))
model1.add(Dense(64, activation='tanh'))
model1.add(Dropout(0.2))
model1.add(Dense(1, activation='sigmoid'))

model1.summary()

In [35]:
model1.compile( loss=tf.keras.losses.BinaryCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(0.00001),
    metrics=['accuracy', 'Precision', 'Recall'])

In [36]:
history1 = model1.fit(train_pad, train_labels,epochs=50,validation_split=0.2, callbacks = call_backs)

In [37]:
import matplotlib.pyplot as plt

fig, axs = plt.subplots(1, 4, figsize=(20, 5))

axs[0].set_title('Loss')
axs[0].plot(history1.history['loss'], label='train')
axs[0].plot(history1.history['val_loss'], label='val')
axs[0].legend()

axs[1].set_title('Accuracy')
axs[1].plot(history1.history['accuracy'], label='train')
axs[1].plot(history1.history['val_accuracy'], label='val')
axs[1].legend()

axs[2].set_title('Precision')
axs[2].plot(history1.history['precision'], label='train')
axs[2].plot(history1.history['val_precision'], label='val')
axs[2].legend()

axs[3].set_title('Recall')
axs[3].plot(history1.history['recall'], label='train')
axs[3].plot(history1.history['val_recall'], label='val')
axs[3].legend()

We see adding some generalization techniques helped the model learn a bit better as the loss wasnt so clearly overfitting to the training set and the accuracy on the validation set did improve some as well.  In fact, all of the metrics graphed above are better than the first model.  Next, I will change the activation from 'tanh' to 'relu' and see if adjusting that hyperparameter further improves the model.

In [38]:
model2 = Sequential()
model2.add(Embedding(all_words, embedding_units, input_length = max_len))
model2.add(Bidirectional(LSTM(hidden_units)))
model2.add(Dropout(0.2))
model2.add(Dense(128, activation='relu'))
model2.add(Dropout(0.2))
model2.add(Dense(64, activation='relu'))
model2.add(Dropout(0.2))
model2.add(Dense(1, activation='sigmoid'))

model2.summary()

In [39]:
model2.compile( loss=tf.keras.losses.BinaryCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(0.00001),
    metrics=['accuracy', 'Precision', 'Recall'])

In [40]:
history2 = model2.fit(train_pad, train_labels,epochs=50,validation_split=0.2, callbacks = call_backs)

In [41]:
import matplotlib.pyplot as plt

fig, axs = plt.subplots(1, 4, figsize=(20, 5))

axs[0].set_title('Loss')
axs[0].plot(history2.history['loss'], label='train')
axs[0].plot(history2.history['val_loss'], label='val')
axs[0].legend()

axs[1].set_title('Accuracy')
axs[1].plot(history2.history['accuracy'], label='train')
axs[1].plot(history2.history['val_accuracy'], label='val')
axs[1].legend()

axs[2].set_title('Precision')
axs[2].plot(history2.history['precision'], label='train')
axs[2].plot(history2.history['val_precision'], label='val')
axs[2].legend()

axs[3].set_title('Recall')
axs[3].plot(history2.history['recall'], label='train')
axs[3].plot(history2.history['val_recall'], label='val')
axs[3].legend()

So, we see the Relu activations didnt really change the performance at all. Finally, looking to see how performance improves with more layers of LSTMs, I will add a couple more and compare the results there.

In [50]:
model3 = Sequential()
model3.add(Embedding(all_words, embedding_units, input_length = max_len))
model3.add(Bidirectional(LSTM(hidden_units,return_sequences=True)))
model3.add(Bidirectional(LSTM(hidden_units,return_sequences=True)))
model3.add(Bidirectional(LSTM(hidden_units)))
model3.add(Dropout(0.2))
model3.add(Dense(128, activation='relu'))
model3.add(Dropout(0.2))
model3.add(Dense(64, activation='relu'))
model3.add(Dropout(0.2))
model3.add(Dense(1, activation='sigmoid'))

model3.summary()

In [51]:
model3.compile( loss=tf.keras.losses.BinaryCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(0.00001),
    metrics=['accuracy', 'Precision', 'Recall'])

In [52]:
history3 = model3.fit(train_pad, train_labels,epochs=50,validation_split=0.2, callbacks = call_backs)

In [49]:
fig, axs = plt.subplots(1, 4, figsize=(20, 5))

axs[0].set_title('Loss')
axs[0].plot(history3.history['loss'], label='train')
axs[0].plot(history3.history['val_loss'], label='val')
axs[0].legend()

axs[1].set_title('Accuracy')
axs[1].plot(history3.history['accuracy'], label='train')
axs[1].plot(history3.history['val_accuracy'], label='val')
axs[1].legend()

axs[2].set_title('Precision')
axs[2].plot(history3.history['precision'], label='train')
axs[2].plot(history3.history['val_precision'], label='val')
axs[2].legend()

axs[3].set_title('Recall')
axs[3].plot(history3.history['recall'], label='train')
axs[3].plot(history3.history['val_recall'], label='val')
axs[3].legend()

We see adding extra layers of LSTMs does not actually improve the results and appears to make them less consistent on the validation set.

# Results and Analysis

We observed similar results across different architectures and hyperparameters.  I will use model2 above as the final model for submission as it was arguably the best of the models I compared.

In [53]:
preds = model2.predict(test_pad)

In [54]:
len(preds)

In [55]:
predictions = []

for pred in preds:
    if pred >= 0.5:
        predictions.append(1)
    else:
        predictions.append(0)
        
predictions[:10]

In [56]:
submission = pd.read_csv("../input/nlp-getting-started/sample_submission.csv")
submission

In [57]:
submission['target']=predictions
submission

In [58]:
submission.to_csv("submission.csv", index=False)

# Conclusion

So, I see that over all of the different architectures and parameters my accuracy hovered around 75%.  I do think using different methods for cleaning and pre-processing the data may help improve that as well as further analysis study and a better understanding of how best to improve the architecture of the neural network.

#### References

Along with the documentation for tensorflow, keras, and numpy, I also used these resources found on kaggle:

https://www.kaggle.com/code/msondkar/disaster-tweets-classification-with-an-rnn/notebook

https://www.kaggle.com/code/mattbast/rnn-and-nlp-detect-a-disaster-in-tweets/notebook#Encode-sentences