## LSTM with GloVe Embeddings

##### Load the data

In [1]:
import pandas as pd
import os

# Set the working directory for the project
os.chdir('C://Users/dane.arnesen/Documents/Projects/kaggle/toxic_comments_challenge/')

# Development sample
dev = pd.read_csv('data/raw/train.csv')

# Validation sample
val = pd.read_csv('data/raw/test.csv')

print(dev.shape)
print(val.shape)
print(dev.columns)

(95851, 8)
(226998, 2)
Index(['id', 'comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat',
       'insult', 'identity_hate'],
      dtype='object')


##### Isolate the target attribute

In [2]:
y_cols = [c for c in dev.columns if c not in ['id','comment_text']]
y_vals = dev[y_cols].values
print(y_vals.shape)
print()
print(dev[y_cols].sum())

(95851, 6)

toxic            9237
severe_toxic      965
obscene          5109
threat            305
insult           4765
identity_hate     814
dtype: int64


##### Check the distribution of the comment text length

In [3]:
dev['char_length'] = dev['comment_text'].str.len()
print(dev['char_length'].describe())

count    95851.000000
mean       395.341864
std        595.102072
min          6.000000
25%         96.000000
50%        206.000000
75%        435.000000
max       5000.000000
Name: char_length, dtype: float64


In [4]:
val['char_length'] = val['comment_text'].str.len()
print(val['char_length'].describe())

count    2.269970e+05
mean     4.737824e+02
std      4.445610e+03
min      1.000000e+00
25%      6.800000e+01
50%      2.180000e+02
75%      5.290000e+02
max      2.003165e+06
Name: char_length, dtype: float64


Note that the validation sample has some texts which are much longer than 5000 characters. So let's go ahead and trim it back and re-check the distribution.

In [5]:
val = val.fillna('unknown')
val['comment_text'] = val['comment_text'].apply(lambda x: x[:5000])
val['char_length'] = val['comment_text'].str.len()
print(val['char_length'].describe())

count    226998.000000
mean        438.502930
std         643.798386
min           1.000000
25%          68.000000
50%         218.000000
75%         529.000000
max        5000.000000
Name: char_length, dtype: float64


Now the distributions look a lot more similar between the development and validation samples.

##### Text preprocessing

We want to combine (stack) the comments from the dev and val samples before we start doing an NLP. But when we actually train our model, we can only train it on the labeled portion of the data. Therefore, we need to be able to later separate the training from val data.

In [6]:
# Number of rows in the dev sample
nrows = dev.shape[0]

# IDs in the val sample
vids = val['id'].values

# Combine the text from both the dev and val samples
df_txt = pd.concat([dev['comment_text'], val['comment_text']], axis=0)
print(df_txt.shape)

(322849,)


Here we will define a function to do some text cleansing. It will perform the following steps:
- Split each comment into individual tokens (words)
- Remove all punctuation
- Set all tokens to lowercase
- Remove alphanumeric
- Remove stop words
- Remove tokens that aren't at least 2 characters in length
- Remove morphological affixes from words (stemming)

In [7]:
# Function that turns a doc into clean tokens
def clean_doc(doc, stemmer, stop_words):
    # Split into individual tokens by white space
    tokens = doc.split()
    # Remove punctuation and set to lowercase
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table).lower() for w in tokens]
    # Remove words that are not entirely alphabetical
    tokens = [w for w in tokens if w.isalpha()]
    # Removing all known stop words
    tokens = [w for w in tokens if not w in stop_words]
    # Remove tokens that aren't at least two characters in length
    tokens = [w for w in tokens if len(w) > 1]
    # Stem the remaining tokens
    tokens = [stemmer.stem(w) for w in tokens]
    return(tokens)

We apply our function to the entire collection of comments in order to define our working vocabulary.

In [9]:
import string
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from collections import Counter

# Get a distinct list of stop words
stop_words = set(stopwords.words('english'))

# Initialize a stemmer
stemmer = SnowballStemmer('english')

# Define vocab
vocab = Counter()

# Iterate over each of the texts in our training sample
for text in df_txt:
    # Create a list of tokens
    tokens = clean_doc(text, stemmer, stop_words)
    # Add tokens to vocab
    vocab.update(tokens)

Now that we've defined our vocabulary, we go back and actually cleanse the comment text, keeping only the words in our defined vocabulary.

In [10]:
# A container object that will hold the words of each individual document
lines = list()

# Iterate over each of the texts in our training sample
for text in df_txt:
    # Create a list of tokens
    tokens = clean_doc(text, stemmer, stop_words)
    # Filter the words in the document by our defined vocabulary
    tokens = [w for w in tokens if w in vocab]
    # Concatentate each word in the document by a single space and append to our lines container
    lines.append(' '.join(tokens))

Leverage Keras to fit a tokenizer, then use the tokenizer to turn our comment text into sequences.

In [12]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

MAX_SEQUENCE_LENGTH = 100
MAX_NB_WORDS = 100000

# Fit a tokenizer. Keep only the top 100,000 words
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(lines)

# Turn the tokens into sequences
sequences = tokenizer.texts_to_sequences(lines)

# Ensure all of the sequences are the same length. The pad function will truncate sequences longer than 100 characters
# and it will pad sequences that are shorter than 100 characters
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
print(data.shape)

(322849, 100)


We need to further split the dev sample into train and test samples. This will help us to avoid overfitting when we train our model.

In [14]:
from sklearn.model_selection import train_test_split

# Split the dataset into train and test
x_train, x_test, y_train, y_test = train_test_split(data[:nrows], y_vals, test_size=0.3, random_state=52)

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(67095, 100)
(67095, 6)
(28756, 100)
(28756, 6)


Use GloVe embeddings to create a word embedding index.

In [18]:
import numpy as np

# Create an embedding index from the GloVe data
embeddings_index = {}
f = open('data/glove/glove.6B.200d.txt', encoding='utf8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Total %s word vectors.' % len(embeddings_index))

Total 400000 word vectors.


Prepare the embeddings.

In [25]:
# The dimensions of our embedding vector. In this case the file is 200d
EMBEDDING_DIM = 200

# The word index created by our tokenizer
word_index = tokenizer.word_index

# The number of words to keep
nb_words = min(MAX_NB_WORDS, len(word_index))

# Init the embedding matrix which is nb_words x embedding dim
embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i >= MAX_NB_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

print('Total number of tokens in vocabulary: %d' % nb_words)
print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))

Total number of tokens in vocabulary: 100000
Null word embeddings: 54593


##### Train the LSTM model

In [27]:
from keras.models import Sequential
from keras.layers import Dense, Input, LSTM, Embedding, Dropout
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.layers.normalization import BatchNormalization

# Initialize the model
model = Sequential()
model.add(Embedding(nb_words, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False))
model.add(LSTM(100, dropout=0.25, recurrent_dropout=0.25))
model.add(Dropout(0.25))
model.add(BatchNormalization())
model.add(Dense(50, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(6, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy'])

# Early stopping
early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.001, patience=2, verbose=True)

# Checkpoint
checkpoint = ModelCheckpoint(filepath='models/best_weights.h5', monitor='val_loss', save_best_only=True)

# Fit the model
model.fit(x_train, 
          y_train, 
          validation_data=(x_test, y_test), 
          epochs=50, 
          batch_size=200, 
          callbacks=[early_stopping, checkpoint], 
          verbose=2,
          shuffle=True
          )

Train on 67095 samples, validate on 28756 samples
Epoch 1/50
 - 285s - loss: 0.2011 - acc: 0.9269 - val_loss: 0.0687 - val_acc: 0.9778
Epoch 2/50
 - 282s - loss: 0.0667 - acc: 0.9780 - val_loss: 0.0624 - val_acc: 0.9789
Epoch 3/50
 - 278s - loss: 0.0611 - acc: 0.9790 - val_loss: 0.0591 - val_acc: 0.9795
Epoch 4/50
 - 280s - loss: 0.0575 - acc: 0.9800 - val_loss: 0.0577 - val_acc: 0.9801
Epoch 5/50
 - 280s - loss: 0.0554 - acc: 0.9805 - val_loss: 0.0569 - val_acc: 0.9800
Epoch 6/50
 - 285s - loss: 0.0533 - acc: 0.9810 - val_loss: 0.0600 - val_acc: 0.9782
Epoch 00006: early stopping


<keras.callbacks.History at 0x65d15e80>

##### Make final predictions

First we need to load the best model which was saved during the checkpoints.

In [28]:
from keras.models import load_model

# Load the best model
model = load_model('models/best_weights.h5')

Make predictions on the validation sample. These predictions will be submitted to Kaggle for scoring.

In [39]:
preds = model.predict(data[nrows:])
print(preds.shape)

(226998, 6)


Format the predictions and output them to a csv file.

In [41]:
ids = pd.DataFrame({'id': vids})
sub1 = pd.concat([ids, pd.DataFrame(preds, columns=y_cols)], axis=1)
sub1.to_csv('data/submissions/lstm_glove.csv', index=False)
print(sub1.shape)

(226998, 7)


Now let's do a weighted average using these predictions and our previous best predictions which received a 0.47 on the LB.

In [47]:
sub2 = pd.read_csv('data/submissions/wtd_avg_4.csv')
sub3 = (sub1.as_matrix()[:,1:] * 0.1) + (sub2.as_matrix()[:,1:] * 0.9)
sub3 = pd.concat([ids, pd.DataFrame(sub3, columns=y_cols)], axis=1)
sub3.to_csv('data/submissions/wtd_avg_5.csv', index=False)
print(sub3.shape)

(226998, 7)


Finally, let's create predictions on our development sample. We could use these predictions in an ensemble method down the road.

In [45]:
preds_dev = model.predict(data[:nrows])
print(preds_dev.shape)

(95851, 6)


In [46]:
pd.DataFrame(preds_dev, columns=y_cols).to_csv('data/raw/lstm_glove_preds.csv')