# LSTM: GloVe + Dropout Code Along

---

This notebook is a codealong of the [Improved LSTM baseline: GloVe + dropout](https://www.kaggle.com/jhoward/improved-lstm-baseline-glove-dropout) Kaggle kernel by Jeremy Howard.

-- Wayne Nixalo - 30/3/2018

This kernel is a somewhat improved version of [Keras - Bidirectional LSTM baseline](https://www.kaggle.com/CVxTz/keras-bidirectional-lstm-baseline-lb-0-051) along with some with some additional documentation of the steps (NB: this notebook has been re-run on the new test set.)

The GloVe embeddings used can be found [here](https://www.kaggle.com/yliu9999/glove6b50d).

In [3]:
import sys, os, re, csv, codecs, numpy as np, pandas as pd
import pathlib

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

In [4]:
path = pathlib.Path('../../data/')
comp = pathlib.Path('competitions/jigsaw-toxic-comment-classification-challenge')
EMBEDDING_FILE  = pathlib.Path('glove/glove.6B.50d.txt')
TRAIN_DATA_FILE = pathlib.Path(path/comp/'train.csv')
TEST_DATA_FILE  = pathlib.Path(path/comp/'test.csv')

Some basic config parameters:

In [5]:
embed_size = 50 # how big is each word vector
max_features = 20000 # how many unique words to use (ie: num rows in embedding vector)
maxlen = 100 # max number of words in a comment to use

Read in our data and replace missing values:

In [6]:
train = pd.read_csv(TRAIN_DATA_FILE)
test  = pd.read_csv(TEST_DATA_FILE)

list_sentences_train = train["comment_text"].fillna("_na_").values
list_classes = [col for col in train.columns[2:]]
y = train[list_classes].values
list_sentences_test = test["comment_text"].fillna("_na_").values

Standard Keras preprocessing, to turn each comment into a list of word indices of equal length (with truncation or padding as needed).

In [7]:
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test  = tokenizer.texts_to_sequences(list_sentences_test)
X_t  = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test,  maxlen=maxlen)

Read the GloVe word vectors (space delimited strings) into a dictionary from word->vector:

In [7]:
def get_coefs(word, *arr):
    return word, np.asarray(arr, dtype='float32')

embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(path/EMBEDDING_FILE))

We'll use these vectors to create our embedding matrix, with random intialization for words that aren't in GloVe. We'll use the same mean and stdev of embeddings the GloVe has when generating the random init.

In [8]:
all_embs = np.stack(embeddings_index.values())
emb_mean, emb_std = all_embs.mean(), all_embs.std()
emb_mean, emb_std

(0.020940498, 0.6441043)

In [9]:
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
# emb matrix is initialized first
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

for word, i in word_index.items():
    if i >= max_features:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # GloVe emb vector used if exists for word
        embedding_matrix[i] = embedding_vector

Simple bidirectional LSTM with two fully-connected layers. We add some dropout to the LSTM since even 2 eopchs is enough to overfit.

In [13]:
inp = Input(shape=(maxlen,)) # the comma sets it as a tuple
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(LSTM(embed_size, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
x = Dense(embed_size, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Now we're ready to fit our model! Use `validation_split` when not submitting.

In [11]:
model.fit(X_t, y, batch_size=32, epochs=2, validation_split=0.1)

Train on 143613 samples, validate on 15958 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f92045035f8>

And finally, get predictions for the test set and prepare a submission CSV:

In [12]:
y_test = model.predict([X_te], batch_size=1024, verbose=1)
submission = pd.read_csv(path/comp/'sample_submission.csv')
submission[list_classes] = y_test
submission.to_csv(path/comp/'submission_LSTM_glove_00.csv')



## Ranking

---

LSTM trained with validation split removed:

`0.9754`/**`0.9764`** -- 2969/4551: top 65.2%

LSTM trained with entire training set:

`0.9775`/**`0.9783`**-- 2746/4551: top 60.3 %

*(1st place is 0.9885 private)*

## Misc

---

Submissions were accidentally saved with Pandas indices - causing errors during submission. Both submission files are reloaded here and stripped of their index columns before being resaved.

[Removing index column in pandas
](https://stackoverflow.com/questions/20107570/removing-index-column-in-pandas) | [Delete column from pandas DataFrame](https://stackoverflow.com/a/18145399)

In [51]:
for i in range(2):
    sub_name = f'submission_LSTM_glove_0{str(i)}.csv'
    sub = pd.read_csv(path/comp/sub_name)
    sub = sub.drop('Unnamed: 0', axis=1)
    sub.to_csv(path/comp/sub_name, index = False)

Testing:

In [33]:
sub = pd.read_csv(path/comp/'sample_submission.csv')

In [34]:
sub.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.5,0.5,0.5,0.5,0.5,0.5
1,0000247867823ef7,0.5,0.5,0.5,0.5,0.5,0.5
2,00013b17ad220c46,0.5,0.5,0.5,0.5,0.5,0.5
3,00017563c3f7919a,0.5,0.5,0.5,0.5,0.5,0.5
4,00017695ad8997eb,0.5,0.5,0.5,0.5,0.5,0.5


In [46]:
sub = pd.read_csv(path/comp/f'submission_LSTM_glove_00.csv')

In [47]:
sub.columns

Index(['Unnamed: 0', 'id', 'toxic', 'severe_toxic', 'obscene', 'threat',
       'insult', 'identity_hate'],
      dtype='object')

In [48]:
sub["Unnamed: 0"].head()

0    0
1    1
2    2
3    3
4    4
Name: Unnamed: 0, dtype: int64

In [45]:
sub = sub.drop('Unnamed: 0', axis=1)
sub.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.987479,0.3560071,0.934285,0.05253358,0.815466,0.362029
1,0000247867823ef7,0.000545,9.122054e-08,0.00023,7.357243e-08,7e-05,7e-06
2,00013b17ad220c46,0.004593,3.531618e-06,0.0012,4.098626e-06,0.000518,5.7e-05
3,00017563c3f7919a,0.002925,8.145207e-07,0.000561,9.42739e-07,0.000283,1.4e-05
4,00017695ad8997eb,0.006123,4.912997e-06,0.001186,7.123228e-06,0.000647,5.6e-05


---

Model recompiled above and refitted - this time for submission:

In [14]:
model.fit(X_t, y, batch_size=32, epochs=2)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f9204089128>

In [15]:
y_test = model.predict([X_te], batch_size=1024, verbose=1)
submission = pd.read_csv(path/comp/'sample_submission.csv')
submission[list_classes] = y_test
submission.to_csv(path/comp/'submission_LSTM_glove_01.csv')



---

checking columns (classes)

In [8]:
[col for col in train.columns[2:]]

['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

---

Looking at GloVe embeddings

In [9]:
for o in open(path/EMBEDDING_FILE):
    print(o)
    break

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581



In [10]:
for o in open(path/EMBEDDING_FILE):
    temp = o
    break
type(temp)

str

In [11]:
temp.strip() # strips leading & trailing whitespace
temp.strip().split() # split by whitespace & remove empty strings

['the',
 '0.418',
 '0.24968',
 '-0.41242',
 '0.1217',
 '0.34527',
 '-0.044457',
 '-0.49688',
 '-0.17862',
 '-0.00066023',
 '-0.6566',
 '0.27843',
 '-0.14767',
 '-0.55677',
 '0.14658',
 '-0.0095095',
 '0.011658',
 '0.10204',
 '-0.12792',
 '-0.8443',
 '-0.12181',
 '-0.016801',
 '-0.33279',
 '-0.1552',
 '-0.23131',
 '-0.19181',
 '-1.8823',
 '-0.76746',
 '0.099051',
 '-0.42125',
 '-0.19526',
 '4.0071',
 '-0.18594',
 '-0.52287',
 '-0.31681',
 '0.00059213',
 '0.0074449',
 '0.17778',
 '-0.15897',
 '0.012041',
 '-0.054223',
 '-0.29871',
 '-0.15749',
 '-0.34758',
 '-0.045637',
 '-0.44251',
 '0.18785',
 '0.0027849',
 '-0.18411',
 '-0.11514',
 '-0.78581']

In [12]:
# https://stackoverflow.com/a/2921893
# * unpacks sequence into positional arguments
# ** does the same, but with a dict &==> named args