# Improved LSTM baseline

This kernel is a somewhat improved version of [Keras - Bidirectional LSTM baseline](https://www.kaggle.com/CVxTz/keras-bidirectional-lstm-baseline-lb-0-051) along with some additional documentation of the steps. (NB: this notebook has been re-run on the new test set.)

In [1]:
# Fast Text
# Increase the glove Embedding
# Use Fast Text to generate the embedding

In [1]:
import sys, os, re, csv, codecs, numpy as np, pandas as pd

import fastText

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

import matplotlib.pyplot as plt
%matplotlib inline  

Using TensorFlow backend.


In [28]:
ft = fastText.load_model('wv/wiki.en.bin')
# Predict Language
# lg = fastText.load_model('wv/lid.176.bin')

In [29]:
def normalize(s):
    """
    Given a text, cleans and normalizes it. Feel free to add your own stuff.
    """
    s = s.lower()
    # Replace ips
    s = re.sub(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', ' _ip_ ', s)
    # Isolate punctuation
    s = re.sub(r'([\'\"\.\(\)\!\?\-\\\/\,])', r' \1 ', s)
    # Remove some special characters
    s = re.sub(r'([\;\:\|•«\n])', ' ', s)
    
    return s

We include the GloVe word vectors in our input files. To include these in your kernel, simple click 'input files' at the top of the notebook, and search 'glove' in the 'datasets' section.

In [5]:
path = 'data/'
TRAIN_DATA_FILE=f'{path}train.csv'
TEST_DATA_FILE=f'{path}test.csv'

Set some basic config parameters:

In [6]:
embed_size = 300 # how big is each word vector
max_features = 500000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 500 # max number of words in a comment to use

Read in our data and replace missing values:

In [7]:
train = pd.read_csv(TRAIN_DATA_FILE)
test = pd.read_csv(TEST_DATA_FILE)
train["comment_text"].fillna("_empty_",inplace=True)
test["comment_text"].fillna("_empty_",inplace=True)
train['comment_text'] = train["comment_text"].apply(lambda x:normalize(x))
list_sentences_train = train.comment_text.values

list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
test['comment_text'] = test["comment_text"].apply(lambda x:normalize(x))
list_sentences_test = test["comment_text"].values

In [90]:
# class threadsafe_iter:
#     """Takes an iterator/generator and makes it thread-safe by
#     serializing call to the `next` method of given iterator/generator.
#     """
#     def __init__(self, it):
#         self.it = it
#         self.lock = threading.Lock()

#     def __iter__(self):
#         return self

#     def next(self):
#         with self.lock:
#             return self.it.next()


# def threadsafe_generator(f):
#     """A decorator that takes a generator function and makes it thread-safe.
#     """
#     def g(*a, **kw):
#         return threadsafe_iter(f(*a, **kw))
#     return g

# @threadsafe_generator
def trainGenerator(X,y,bs=32):
    counter = 0
    sample = len(X) // bs
    while 1:
        counter += 1
        if counter > sample : 
            counter = 0 
        raw = X[(counter)*bs:(counter+1)*bs]
        for r in range(bs):
            raw_to_list = raw[r].split(" ")
            if len(raw_to_list) < maxlen:
                to_append = maxlen - len(raw_to_list)
                raw_to_list.extend([" "]*to_append)
            process_x = np.zeros([bs,maxlen,embed_size])
#             print(counter,sample)
            for i in range(maxlen):
                process_x[r,i] = ft.get_word_vector(raw_to_list[i]).astype('float32')
        yield process_x , y[counter*bs:(counter+1)*bs]

def testGenerator(X,bs=32):
    counter = 0
    sample = len(X) // bs
    while 1:
        counter += 1
        if counter > sample : 
            counter = 0 
        yield X[(counter-1)*bs:(counter)*bs]

        
# nex = trainGenerator(list_sentences_train,y,bs=2)
# t1,t2 = next(nex)
# t1.shape,t2.shape

In [None]:
# raw = 'this is the day'
# raw_list = raw.split(" ")
# if len(raw_list) < 20:
#     to_append = 20 - len(raw_list)
#     raw_list.extend(["x"]*to_append)
# raw_list
# np.zeros([2,5]).shape
# test = np.zeros([2,5])
# test[0] = np.array([1,2,3,4,5])
# test

In [None]:
# train['lang'] = train.comment_text.apply(lambda x:lg.predict(x)[0][0][-2:])
# test['lang'] = test.comment_text.apply(lambda x:lg.predict(x)[0][0][-2:])

Standard keras preprocessing, to turn each comment into a list of word indexes of equal length (with truncation or padding as needed).

In [None]:
# tokenizer = Tokenizer(num_words=max_features)
# tokenizer = Tokenizer(oov_token='_oov_')
# tokenizer.fit_on_texts(list(list_sentences_train))
# list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
# list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)
# X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
# X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)

Read the glove word vectors (space delimited strings) into a dictionary from word->vector.

Use these vectors to create our embedding matrix, with random initialization for words that aren't in GloVe. We'll use the same mean and stdev of embeddings the GloVe has when generating the random init.

In [None]:
# word_index = tokenizer.word_index
# nb_words = min(max_features, len(word_index))+1
# nb_words = len(word_index)+1
# embedding_matrix = np.random.normal(-0.0039050116, 0.38177028, (nb_words, embed_size))
# for word, i in word_index.items():
#     embedding_matrix[i] = ft.get_word_vector(word).astype('float32') # out of word vocabulary

In [None]:
# embedding_matrix.shape

Simple bidirectional LSTM with two fully connected layers. We add some dropout to the LSTM since even 2 epochs is enough to overfit.

In [86]:
inp = Input(shape=(maxlen,embed_size))
# x = Embedding(nb_words, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(inp)
x = GlobalMaxPool1D()(x)
x = Dense(100, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         (None, 500, 300)          0         
_________________________________________________________________
bidirectional_4 (Bidirection (None, 500, 100)          140400    
_________________________________________________________________
global_max_pooling1d_3 (Glob (None, 100)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 100)               10100     
_________________________________________________________________
dropout_3 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 6)                 606       
Total params: 151,106
Trainable params: 151,106
Non-trainable params: 0
_________________________________________________________________


In [88]:
len(list_sentences_train) // 32

4986

Now we're ready to fit out model! Use `validation_split` when not submitting.

In [None]:
# model.fit(X_t, y, batch_size=32, epochs=2, validation_split=0.1);
model.fit_generator(trainGenerator(list_sentences_train,y,bs=32),steps_per_epoch=4986, epochs=1);

Epoch 1/1
  54/4986 [..............................] - ETA: 1:31:15 - loss: 0.3624 - acc: 0.9606

And finally, get predictions for the test set and prepare a submission CSV:

In [13]:
y_test = model.predict([X_te], batch_size=1024, verbose=1)
sample_submission = pd.read_csv('data/sample_submission.csv')
sample_submission[list_classes] = y_test
sample_submission.to_csv('fast_text_baseline_3_norm.csv', index=False)



In [4]:
# sample_submission.to_csv('base_test.csv',index=False)

In [19]:
# test_submission = pd.read_csv('data/sample_submission.csv')
# len(test_submission)

In [None]:
# Baseline Score
# loss: 0.0417 - acc: 0.9840 - val_loss: 0.0451 - val_acc: 0.9829 --> AUC : 0.9787

