# Improved LSTM baseline

This kernel is a somewhat improved version of [Keras - Bidirectional LSTM baseline](https://www.kaggle.com/CVxTz/keras-bidirectional-lstm-baseline-lb-0-051) along with some additional documentation of the steps. (NB: this notebook has been re-run on the new test set.)

In [5]:
import sys, os, re, csv, codecs, numpy as np, pandas as pd

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

We include the GloVe word vectors in our input files. To include these in your kernel, simple click 'input files' at the top of the notebook, and search 'glove' in the 'datasets' section.

In [13]:
path = '../input/'
EMBEDDING_FILE=f'{path}glove.6B.50d.txt'
TRAIN_DATA_FILE=f'{path}train.csv'
TEST_DATA_FILE=f'{path}test.csv'

Set some basic config parameters:

In [14]:
embed_size = 50 # how big is each word vector
max_features = 20000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a comment to use

Read in our data and replace missing values:

In [11]:
train = pd.read_csv(TRAIN_DATA_FILE)
test = pd.read_csv(TEST_DATA_FILE)
train.head

<bound method NDFrame.head of                       id                                       comment_text  \
0       0000997932d777bf  Explanation\nWhy the edits made under my usern...   
1       000103f0d9cfb60f  D'aww! He matches this background colour I'm s...   
2       000113f07ec002fd  Hey man, I'm really not trying to edit war. It...   
3       0001b41b1c6bb37e  "\nMore\nI can't make any real suggestions on ...   
4       0001d958c54c6e35  You, sir, are my hero. Any chance you remember...   
5       00025465d4725e87  "\n\nCongratulations from me as well, use the ...   
6       0002bcb3da6cb337       COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK   
7       00031b1e95af7921  Your vandalism to the Matt Shirvington article...   
8       00037261f536c51d  Sorry if the word 'nonsense' was offensive to ...   
9       00040093b2687caa  alignment on this subject and which are contra...   
10      0005300084f90edc  "\nFair use rationale for Image:Wonju.jpg\n\nT...   
11      00054a5e18b50d

In [155]:

train["notClean"] = 1 * ((train["toxic"] == 1) | (train["obscene"] == 1) | (train["threat"] == 1) | (train['insult'] == 1) | (train['identity_hate'] == 1))
just_toxic = train[train['notClean'] == 1]
list_sentences_train = train["comment_text"].fillna("_na_").values

list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train['notClean'].values
list_sentences_test
list_sentences_test = just_toxic["comment_text"].fillna("_na_").values
print(y)



[0 0 0 ..., 0 0 0]


KeyError: 'comment text'

Standard keras preprocessing, to turn each comment into a list of word indexes of equal length (with truncation or padding as needed).

In [126]:
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)

Read the glove word vectors (space delimited strings) into a dictionary from word->vector.

In [137]:
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE))


[-0.95897001  0.86149001 -0.53064001 -0.19908001  0.42945001  0.93177003
  0.067319   -0.21413     0.39488    -0.53561002  0.42881    -1.33340001
 -0.038192   -0.15667     0.94351    -0.21873    -0.15586001  0.084439
 -0.058604   -0.55145001 -0.53280997  1.24339998  0.63441002  0.79233998
  0.0097936  -1.71239996 -0.77291    -1.00240004 -0.69471997 -0.50487
  3.05170012  1.49810004 -0.32957    -0.53871    -0.21201    -0.14259
 -0.02706     0.58579999 -0.56642002 -0.55984002 -0.60904998 -0.57062
  1.33379996  0.67097002  1.06429994 -0.4181     -0.44273001 -1.0158
 -0.35795    -0.31110999]


Use these vectors to create our embedding matrix, with random initialization for words that aren't in GloVe. We'll use the same mean and stdev of embeddings the GloVe has when generating the random init.

In [119]:
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
emb_mean,emb_std

(0.020940498, 0.6441043)

In [120]:
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

Simple bidirectional LSTM with two fully connected layers. We add some dropout to the LSTM since even 2 epochs is enough to overfit.

In [121]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Now we're ready to fit out model! Use `validation_split` when not submitting.

In [122]:
model.fit(X_t, y, batch_size=32, epochs=2, validation_split=0.1);

Train on 143613 samples, validate on 15958 samples
Epoch 1/2
Epoch 2/2


And finally, get predictions for the test set and prepare a submission CSV:

In [127]:
y_test = model.predict([X_te], batch_size=1024, verbose=1)



In [134]:

sample_submission = pd.read_csv(f'{path}sample_submission.csv')
#print(just_toxic)
toxic_predicts = np.array(y_test > 0.5) * 1
toxic_actual = np.array(just_toxic['notClean']).reshape(len(toxic_predicts), 1)
print(toxic_predicts)
print(toxic_actual)
print(abs(toxic_predicts - toxic_actual))
misclassified_vector = abs(toxic_predicts - toxic_actual)

print(np.sum(misclassified_vector == 0, axis = 0) * 1 / len(toxic_actual))


    

#sample_submission[list_classes] = y_test
#sample_submission.to_csv('../output/LSTM_submission.csv', index=True)

[[1]
 [0]
 [0]
 ..., 
 [1]
 [1]
 [1]]
[[1]
 [1]
 [1]
 ..., 
 [1]
 [1]
 [1]]
[[0]
 [1]
 [1]
 ..., 
 [0]
 [0]
 [0]]
[ 0.75124807]


0.06505819275532888