# Improved LSTM baseline

This kernel is a somewhat improved version of [Keras - Bidirectional LSTM baseline](https://www.kaggle.com/CVxTz/keras-bidirectional-lstm-baseline-lb-0-051) along with some additional documentation of the steps. (NB: this notebook has been re-run on the new test set.)

In [24]:
import sys, os, re, csv, codecs, numpy as np, pandas as pd

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, GRU
from keras.layers import Bidirectional, GlobalMaxPool1D
import keras.layers
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

We include the GloVe word vectors in our input files. To include these in your kernel, simple click 'input files' at the top of the notebook, and search 'glove' in the 'datasets' section.

In [25]:
path = '../input/'
EMBEDDING_FILE=f'{path}glove6b50d/glove.6B.50d.txt'
TRAIN_DATA_FILE=f'{path}train.csv'
TRAIN2_DATA_FILE=f'{path}train2.csv'
TRAIN3_DATA_FILE=f'{path}train3.csv'
TEST_DATA_FILE=f'{path}test.csv'
TRAIN4_DATA_FILE=f'{path}train4.csv'
TEST4_DATA_FILE=f'{path}test4.csv'

Set some basic config parameters:

In [26]:
embed_size = 50 # how big is each word vector
max_features = 20000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a comment to use

Read in our data and replace missing values:

In [27]:
train = pd.read_csv(TRAIN2_DATA_FILE)
test = pd.read_csv(TEST_DATA_FILE)
train3 = pd.read_csv(TRAIN3_DATA_FILE)
test4 = pd.read_csv(TEST4_DATA_FILE)
train4 = pd.read_csv(TRAIN4_DATA_FILE)

In [52]:
train4[['comment_text','cleaned']]

Unnamed: 0,comment_text,cleaned
0,Explanation\r\nWhy the edits made under my use...,explanation why the edit make under my usernam...
1,D'aww! He matches this background colour I'm s...,d aww he match this background colour i am see...
2,"Hey man, I'm really not trying to edit war. It...",hey man i am really not try to edit war it is ...
3,"""\r\nMore\r\nI can't make any real suggestions...",more i cannot make any real suggestions on imp...
4,"You, sir, are my hero. Any chance you remember...",you sir be my hero any chance you remember wha...
5,"""\r\n\r\nCongratulations from me as well, use ...",congratulations from me as well use the tool w...
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,cocksucker before you piss around on my work
7,Your vandalism to the Matt Shirvington article...,your vandalism to the matt shirvington article...
8,Sorry if the word 'nonsense' was offensive to ...,sorry if the word nonsense be offensive to you...
9,alignment on this subject and which are contra...,alignment on this subject and which be contrar...


In [29]:
train = train4
test = test4

In [48]:

# notClean_array = 1 * ((train["toxic"] == 1) | (train["obscene"] == 1) | (train["threat"] == 1) | (train['insult'] == 1) | (train['identity_hate'] == 1))
# clean = train[notClean_array == 0]
# just_toxic = train[notClean_array == 1]
# new_train = pd.concat([clean[:14334], just_toxic])
new_train = train
list_sentences_train = new_train["cleaned"].fillna("_na_").values

list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = new_train[list_classes].values


list_extra = ["count_sent", "count_word", "count_char", "count_unique_word", "count_punctuations", "count_words_upper", "count_words_title", "count_stopwords", "mean_word_len", "word_unique_percent", "punct_percent","neg_polarity","neutral_polarity","positive_polarity", "compound_polarity","misspelled_prop","has_profanity","profane_count","profane_prop"]
x_extra = new_train[list_extra].values




In [49]:
list_sentences_test = test["cleaned"].fillna("_na_").values
binary_y = new_train[["toxic"]].values

In [32]:
notClean_array = 1 * ((train["toxic"] == 1) | (train["obscene"] == 1) | (train["threat"] == 1) | (train['insult'] == 1) | (train['identity_hate'] == 1))
clean = train[notClean_array == 0]
just_toxic = train[notClean_array == 1]

Standard keras preprocessing, to turn each comment into a list of word indexes of equal length (with truncation or padding as needed).

In [33]:
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)

In [34]:
X_t.shape

(159571, 100)

In [35]:
x_extra.shape

(159571, 19)

Read the glove word vectors (space delimited strings) into a dictionary from word->vector.

In [36]:
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE))


Use these vectors to create our embedding matrix, with random initialization for words that aren't in GloVe. We'll use the same mean and stdev of embeddings the GloVe has when generating the random init.

In [37]:
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
emb_mean,emb_std

(0.020940498, 0.6441043)

In [38]:
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

Simple bidirectional LSTM with two fully connected layers. We add some dropout to the LSTM since even 2 epochs is enough to overfit.

In [None]:
inp = Input(shape=(maxlen,))

inp2 = Input(shape=(3,))
x2 = Dense(50, activation='relu')(inp2)

x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
added = keras.layers.concatenate([x, x2])
x = Dense(100, activation="relu")(added)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=[inp, inp2], outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Now we're ready to fit out model! Use `validation_split` when not submitting.

In [None]:
model.fit([X_t, x_extra], y, batch_size=32, epochs=2, validation_split=0.1);

And finally, get predictions for the test set and prepare a submission CSV:

In [None]:
y_test = model.predict([X_te], batch_size=1024, verbose=1)

In [None]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(GRU(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model.fit(X_t, y, batch_size=32, epochs=2, validation_split=0.1);

In [None]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(GRU(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)

x = Bidirectional(GRU(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
x = Dense(100, activation="relu")(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [39]:
inp = Input(shape=(maxlen,))

inp2 = Input(shape=(19,))
x2 = Dense(50, activation='relu')(inp2)

x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(GRU(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
added = keras.layers.concatenate([x, x2])
x = Dense(100, activation="relu")(added)
x = Dropout(0.1)(x)
x = Dense(50, activation="relu")(added)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=[inp, inp2], outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [40]:
model.fit([X_t, x_extra], y, batch_size=32, epochs=2, validation_split=0.1);

Train on 143613 samples, validate on 15958 samples
Epoch 1/2
Epoch 2/2


In [42]:
x_extra_test = test[list_extra].values
y_test = model.predict([X_te, x_extra_test], batch_size=1024, verbose=1)




In [44]:
y_test.shape

(153164, 6)

In [47]:
sample_submission[list_classes].shape

(153164, 1)

In [22]:
inp = Input(shape=(maxlen,))

inp2 = Input(shape=(19,))
x2 = Dense(50, activation='relu')(inp2)

x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(GRU(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
added = keras.layers.concatenate([x, x2])
x = Dense(100, activation="relu")(added)
x = Dropout(0.1)(x)
x = Dense(1, activation="sigmoid")(x)
binary_model = Model(inputs=[inp, inp2], outputs=x)
binary_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [23]:
binary_model.fit([X_t, x_extra], binary_y, batch_size=32, epochs=2, validation_split=0.1);

Train on 143613 samples, validate on 15958 samples
Epoch 1/2
Epoch 2/2


In [50]:
sample_submission = pd.read_csv(f'{path}sample_submission.csv')
sample_submission[list_classes] = y_test
sample_submission.to_csv('submission.csv', index=False)

In [46]:

sample_submission = pd.read_csv(f'{path}/dssb/sample_submission.csv')
#print(just_toxic)
toxic_predicts = np.array(y_test > 0.5) * 1
toxic_actual = np.array(just_toxic['notClean']).reshape(len(toxic_predicts), 1)
print(toxic_predicts)
print(toxic_actual)
print(abs(toxic_predicts - toxic_actual))
misclassified_vector = abs(toxic_predicts - toxic_actual)

print(np.sum(misclassified_vector == 0, axis = 0) * 1 / len(toxic_actual))

#sample_submission[list_classes] = y_test
#sample_submission.to_csv('../output/LSTM_submission.csv', index=True)

FileNotFoundError: File b'../input//dssb/sample_submission.csv' does not exist