# Hierarchal RNN

This kernel is a somewhat improved version of [Keras - Bidirectional LSTM baseline](https://www.kaggle.com/CVxTz/keras-bidirectional-lstm-baseline-lb-0-051) along with some additional documentation of the steps. (NB: this notebook has been re-run on the new test set.)

In [1]:
# Fast Text
# Increase the glove Embedding
# Use Fast Text to generate the embedding

In [1]:
# https://github.com/richliao/textClassifier/blob/master/textClassifierHATT.py
# https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf
# https://github.com/EdGENetworks/attention-networks-for-classification

In [1]:
import sys, os, re, csv, codecs, numpy as np, pandas as pd

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.layers import Conv1D, MaxPooling1D,Merge, GRU, TimeDistributed
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

from nltk import tokenize
from keras.preprocessing.text import text_to_word_sequence

from keras.engine.topology import Layer
from keras import initializers
from keras import backend as K
from keras.engine import InputSpec
from keras.initializers import zero
from keras.initializers import RandomNormal
import tensorflow as tf

import matplotlib.pyplot as plt
%matplotlib inline  

Using TensorFlow backend.


We include the GloVe word vectors in our input files. To include these in your kernel, simple click 'input files' at the top of the notebook, and search 'glove' in the 'datasets' section.

In [3]:
path = 'data/'
EMBEDDING_FILE=f'wv/glove.twitter.27B.25d.txt'
TRAIN_DATA_FILE=f'{path}train.csv'
TEST_DATA_FILE=f'{path}test.csv'

Set some basic config parameters:

In [4]:
MAX_SENT_LENGTH = 500
MAX_SENTS = 15
EMBEDDING_DIM = 25


Read in our data and replace missing values:

In [5]:
reviews = []
labels = []
texts = []

In [6]:
def normalize(s):
    """
    Given a text, cleans and normalizes it. Feel free to add your own stuff.
    """
    s = s.lower()
    # Replace ips
    s = re.sub(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', ' _ip_ ', s)
    # Isolate punctuation
    s = re.sub(r'([\'\"\.\(\)\!\?\-\\\/\,])', r' \1 ', s)
    # Remove some special characters
    
    s = re.sub(r'([\;\:\|•«\n「」¤]\xa0)', ' ', s)
    # Replace numbers and symbols with language
#     s = s.replace('&', ' and ')
#     s = s.replace('@', ' at ')
#     s = s.replace('0', ' zero ')
#     s = s.replace('1', ' one ')
#     s = s.replace('2', ' two ')
#     s = s.replace('3', ' three ')
#     s = s.replace('4', ' four ')
#     s = s.replace('5', ' five ')
#     s = s.replace('6', ' six ')
#     s = s.replace('7', ' seven ')
#     s = s.replace('8', ' eight ')
#     s = s.replace('9', ' nine ')
    return s

In [7]:
train = pd.read_csv(TRAIN_DATA_FILE)
test = pd.read_csv(TEST_DATA_FILE)

train["comment_text"].fillna("_empty_",inplace=True)
list_sentences_train = train["comment_text"].apply(lambda x:normalize(x)).values
test["comment_text"].fillna("_empty_",inplace=True)
list_sentences_test = test["comment_text"].apply(lambda x:normalize(x)).values

list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values


In [8]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(list(list_sentences_train))

# list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
# list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)

# X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
# X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)

In [9]:
MAX_NB_WORDS = len(tokenizer.word_index)+1

In [10]:
for i in list_sentences_train:
    sentences = tokenize.sent_tokenize(i)
    reviews.append(sentences)

In [11]:
len(reviews),len(list_sentences_train)

(159571, 159571)

In [12]:
# Zero paddings 
data = np.zeros((len(list_sentences_train), MAX_SENTS, MAX_SENT_LENGTH), dtype='int32')
data.shape

(159571, 15, 500)

In [13]:
for i, sentences in enumerate(reviews):
    for j, sent in enumerate(sentences):
        if j< MAX_SENTS:
            # same as split + lower + punctuation removal
#             wordTokens = text_to_word_sequence(sent)
            wordTokens = sent.lower().split(' ')
#             k=0
            for k , word in enumerate(wordTokens):
                if k<MAX_SENT_LENGTH :
                    try :
                        data_i = tokenizer.word_index[word]
                    except KeyError:
#                         print(word)
                        data_i = 0
                    data[i,j,k] = data_i
#                     k=k+1                    
                

In [14]:
# from nltk.corpus import stopwords
# cachedStop =  stopwords.words('english')
# pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
# def cleanwords(sent):
#     return ' '.join([word.lower() for word in sent.lower().split() if word not in cachedStop ])
    # return pattern.sub('', sent.lower())

# def cleanchars(sent):
#     return sent.translate(translator)


Standard keras preprocessing, to turn each comment into a list of word indexes of equal length (with truncation or padding as needed).

Read the glove word vectors (space delimited strings) into a dictionary from word->vector.

In [15]:
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE))

In [15]:
len(embeddings_index.values())

1193514

In [16]:
set([e.shape for e in embeddings_index.values()])
print(len([e.shape for e in embeddings_index.values() if e.shape[0] == 199]))

1


Use these vectors to create our embedding matrix, with random initialization for words that aren't in GloVe. We'll use the same mean and stdev of embeddings the GloVe has when generating the random init.

In [16]:
embed_i = [e for e in embeddings_index.values() if e.shape[0] == 25]
all_embs = np.stack(embed_i)
emb_mean,emb_std = all_embs.mean(), all_embs.std()
emb_mean,emb_std

(0.024764072, 0.98633558)

In [17]:
word_index = tokenizer.word_index
embedding_matrix = np.random.normal(emb_mean, emb_std, (MAX_NB_WORDS, EMBEDDING_DIM))
for word, i in word_index.items():
    if i >= MAX_NB_WORDS: continue # greater than max word features
    embedding_vector = embeddings_index.get(word) # out of word vocabulary
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

Simple bidirectional LSTM with two fully connected layers. We add some dropout to the LSTM since even 2 epochs is enough to overfit.

In [18]:
VALIDATION_SPLIT = 0.1

indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = y[indices]
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]

print('Number of positive and negative reviews in traing and validation set')
# print y_train.sum(axis=0)
# print y_val.sum(axis=0)


Number of positive and negative reviews in traing and validation set


In [19]:
x_train.shape,y_train.shape,x_val.shape,y_val.shape

((143614, 15, 500), (143614, 6), (15957, 15, 500), (15957, 6))

In [60]:
# from keras.layers import Conv1D, MaxPooling1D,Merge, GRU, RNN
# RNN??

## Option 1

In [19]:
embedding_layer = Embedding(MAX_NB_WORDS,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SENT_LENGTH,
                            trainable=True)

sentence_input = Input(shape=(MAX_SENT_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sentence_input)
l_lstm = Bidirectional(LSTM(100))(embedded_sequences)
sentEncoder = Model(sentence_input, l_lstm)

review_input = Input(shape=(MAX_SENTS,MAX_SENT_LENGTH), dtype='int32')
review_encoder = TimeDistributed(sentEncoder)(review_input)
l_lstm_sent = Bidirectional(LSTM(100))(review_encoder)
preds = Dense(6, activation='sigmoid')(l_lstm_sent)
model = Model(review_input, preds)

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print("model fitting - Hierachical LSTM")
print(model.summary())


model fitting - Hierachical LSTM
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 15, 500)           0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 15, 200)           38832800  
_________________________________________________________________
bidirectional_2 (Bidirection (None, 200)               240800    
_________________________________________________________________
dense_1 (Dense)              (None, 6)                 1206      
Total params: 39,074,806
Trainable params: 39,074,806
Non-trainable params: 0
_________________________________________________________________
None


In [23]:
model.fit(x_train, y_train, validation_data=(x_val, y_val),epochs=1, batch_size=32)

  from ipykernel import kernelapp as app


Train on 127657 samples, validate on 31914 samples
Epoch 1/1

KeyboardInterrupt: 

In [20]:
# Experiment
# x = K.placeholder(shape=(2, 3))
# y = K.placeholder(shape=(3, 4))
# xy = tf.keras.backend.dot(x, y)
# xy

# import numpy as np
# x = np.zeros([500,200])
# x.shape[-1]

# init = initializers.get('normal')
# w = init((200,))
# K.expand_dims(w).shape

# init

#batch, time(max_len),word_dim
# x = tf.placeholder(np.float32,(16,500,200))
# W1 = tf.placeholder(np.float32,(200,500))
# y = tf.keras.backend.dot(x,W1)
# y.shape

<tf.Tensor 'MatMul:0' shape=(2, 4) dtype=float32>

## Option 2

In [20]:
# building Hierachical Attention network

embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SENT_LENGTH,
                            trainable=True)

class AttLayer(Layer):
    def __init__(self, **kwargs):
        self.init = initializers.get('normal')
        self.input_spec = [InputSpec(ndim=3)]
        self.attention_size = 50
        super(AttLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape)==3
        self.W = tf.Variable(tf.random_normal([input_shape[-1], self.attention_size], stddev=0.1))
        self.B = tf.Variable(tf.random_normal([self.attention_size], stddev=0.1))
        self.U = tf.Variable(tf.random_normal([self.attention_size], stddev=0.1))
        self.trainable_weights = [self.W,self.B,self.U]
        super(AttLayer, self).build(input_shape)  # be sure you call this somewhere!

    def call(self, x, mask=None):
        #  the shape of `v` is (B,T,D)*(D,A)=(B,T,A), where A=attention_size
        v = tf.tanh(tf.tensordot(x, self.W, axes=1) + self.B)
        vu = tf.tensordot(v, self.U, axes=1)  # (B,T) shape
        alphas = tf.nn.softmax(vu)         # (B,T) shape
        output = tf.reduce_sum(x * tf.expand_dims(alphas, -1), 1)
        
        return output
#         eij = tf.squeeze(tf.keras.backend.dot(x, tf.keras.backend.expand_dims(self.W,-1)), axis=-1)
        
#         ai = tf.exp(eij)
#         weights = ai/K.sum(ai, axis=1).dimshuffle(0,'x')
#         weights = tf.keras.backend.expand_dims(ai/tf.keras.backend.sum(ai, axis=1),-1)
        # replace dimshuffle with tf.expand_dims()
        
#         weighted_input = x*weights
#         return tf.keras.backend.sum(weighted_input,axis=1)
#         return weighted_input.sum(axis=1)

    def compute_output_shape(self, input_shape):
        return (input_shape[0], input_shape[-1])

sentence_input = Input(shape=(MAX_SENT_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sentence_input)
l_lstm = Bidirectional(GRU(100, return_sequences=True))(embedded_sequences)
l_dense = TimeDistributed(Dense(200))(l_lstm)
l_att = AttLayer()(l_dense)
sentEncoder = Model(sentence_input, l_att)

review_input = Input(shape=(MAX_SENTS,MAX_SENT_LENGTH), dtype='int32')
review_encoder = TimeDistributed(sentEncoder)(review_input)
l_lstm_sent = Bidirectional(GRU(50, return_sequences=True))(review_encoder)
l_dense_sent = TimeDistributed(Dense(50))(l_lstm_sent)
l_att_sent = AttLayer()(l_dense_sent)
preds = Dense(6, activation='sigmoid')(l_att_sent)
model = Model(review_input, preds)

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print("model fitting - Hierachical attention network")
model.summary()

model fitting - Hierachical attention network
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 15, 500)           0         
_________________________________________________________________
time_distributed_2 (TimeDist (None, 15, 200)           4949900   
_________________________________________________________________
bidirectional_2 (Bidirection (None, 15, 100)           75300     
_________________________________________________________________
time_distributed_3 (TimeDist (None, 15, 50)            5050      
_________________________________________________________________
att_layer_2 (AttLayer)       (None, 50)                2600      
_________________________________________________________________
dense_3 (Dense)              (None, 6)                 306       
Total params: 5,033,156
Trainable params: 5,033,156
Non-trainable params: 0
____________________

In [None]:
# model.fit(x_val, y_val,epochs=1, batch_size=16)
model.fit(x_train, y_train, validation_data=(x_val, y_val),epochs=1, batch_size=32)

Train on 143614 samples, validate on 15957 samples
Epoch 1/1
   512/143614 [..............................] - ETA: 52:03 - loss: 0.2283 - acc: 0.9196

And finally, get predictions for the test set and prepare a submission CSV:

In [16]:
y_test = model.predict([X_te], batch_size=1024, verbose=1)
sample_submission = pd.read_csv('data/sample_submission.csv')
sample_submission[list_classes] = y_test
sample_submission.to_csv('glove300.csv', index=False)



In [4]:
# sample_submission.to_csv('base_test.csv',index=False)

In [19]:
# test_submission = pd.read_csv('data/sample_submission.csv')
# len(test_submission)

In [None]:
# Baseline Score
# loss: 0.0417 - acc: 0.9840 - val_loss: 0.0451 - val_acc: 0.9829 --> AUC : 0.9787

