# Implementing Different Meta Emdedding 

The dataset used here is ***jlgsaw-toxic-comment-classification-challenge*** it's a multi-label problem. Where a comment can be in one more of the 6 different class.

Meta-Embeddings:
1. ConCatenation 
2. Average 
3. DME
4. PME


Glove and Paragrams have been used for the demo purposes.


Using the final implementation : Concatenation + Attention I was able to get 98% on the test set.


Note: I have not used any special preprocessing steps here just the tokeizer as I just wanted to show code for implementing different meta-embeddings.

In [21]:

#imports 
import sys, os, re, csv, codecs, numpy as np, pandas as pd,gc
from tqdm import tqdm

import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation,Reshape,multiply,Lambda
from keras.layers import Bidirectional, GlobalMaxPool1D,Concatenate,Layer,InputSpec
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.layers import *


from keras import backend as K



We include the GloVe word vectors in our input files. To include these in your kernel, simple click 'input files' at the top of the notebook, and search 'glove' in the 'datasets' section.

In [2]:
#specify 

path = '../input/'
comp = 'jigsaw-toxic-comment-classification-challenge/'
TRAIN_DATA_FILE=f'{path}{comp}train.csv.zip'
TEST_DATA_FILE=f'{path}{comp}test.csv.zip'

Set some basic config parameters:

In [3]:
embed_size = 300 # how big is each word vector
max_features = 20000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a comment to use

Read in our data and replace missing values:

In [4]:
train = pd.read_csv(TRAIN_DATA_FILE)
test = pd.read_csv(TEST_DATA_FILE)

list_sentences_train = train["comment_text"].fillna("_na_").values
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_test = test["comment_text"].fillna("_na_").values

Standard keras preprocessing, to turn each comment into a list of word indexes of equal length (with truncation or padding as needed).

In [5]:
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)

I my only going to use glove and paragram from creating meta-embeddings. An additional way to increase performance is to preprocess the data to increase the coverage of the vector embeddings.

# Common method to get embedding weights 

In [6]:

#### the below fun
def get_matrix(path,tokenizer,type = "glove"):
    
    
    ##### loading the embeddings from the files:
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')

    if type == "glove":
        embeddings_index = dict(get_coefs(*o.split(" ")) for o in tqdm(open(path, encoding="utf8", errors='ignore')))
    else:
        embeddings_index = dict(get_coefs(*o.split(" ")) for o in tqdm(open(path, encoding="utf8", errors='ignore')) if len(o)>100)
    
    print('done reading vectors....')
    
    #calculating the mean and std.
    all_embs = np.stack(embeddings_index.values())
    emb_mean,emb_std = all_embs.mean(), all_embs.std()
    
    #calculating the shape 
    embed_size = all_embs.shape[1]
    word_index = tokenizer.word_index
    nb_words = min(max_features, len(word_index) + 1)
    
    #defining the embedding_matirx 
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

    #loading the vectors.    
    for word, i in tqdm(word_index.items()):
        if i >= max_features: continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: embedding_matrix[i] = embedding_vector


    del all_embs,embeddings_index


    return embedding_matrix


Use these vectors to create our embedding matrix, with random initialization for words that aren't in GloVe. We'll use the same mean and stdev of embeddings the GloVe has when generating the random init.

In [7]:
gc.collect()

463

# Loading Glove Matrix 

In [8]:
glove_embedding = get_matrix('../input/embeddings/glove-840B-300d.txt',tokenizer,type = 'glove')


2196017it [05:16, 6935.72it/s]


done reading vectors....


100%|██████████| 210337/210337 [00:00<00:00, 694256.97it/s]


In [9]:
glove_embedding.shape

(20000, 300)

# Loading Para Embeddings 

In [10]:
para_embedding = get_matrix('../input/embeddings/paragram-300-sl999.txt',tokenizer,type = 'glove')


1703756it [03:59, 7101.51it/s]


done reading vectors....


100%|██████████| 210337/210337 [00:00<00:00, 729967.91it/s]


In [11]:
para_embedding.shape

(20000, 300)

Different Meta Embedding 

1. Concatanation Matrix 
2. Average Matrix 
3. CDME_BLOCK
4. PME

# Concatenation

In [13]:
cancatnation_matrix = np.concatenate([glove_embedding,para_embedding],axis = -1)



In [26]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size * 2 , weights=[cancatnation_matrix])(inp)
x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.3, recurrent_dropout=0.3))(x)
x1 = GlobalMaxPool1D()(x)
x2 = GlobalMaxPool1D()(x)
x = Concatenate(axis=-1)([x1, x2])
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [27]:
model.fit(X_t, y, batch_size=64, epochs=1, validation_split=0.1);

Train on 143613 samples, validate on 15958 samples
Epoch 1/1


# Averaging 
For average meta embedding the embed_size remain the same embed_size = 300
This technique reduces the time but also reduces the accuracy.

In [32]:
average_matrix = (cancatnation_matrix[:,:300] + cancatnation_matrix[:,300:]) / 2

In [33]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size , weights=[average_matrix])(inp)
x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.3, recurrent_dropout=0.3))(x)
x1 = GlobalMaxPool1D()(x)
x2 = GlobalMaxPool1D()(x)
x = Concatenate(axis=-1)([x1, x2])
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model.fit(X_t, y, batch_size=64, epochs=2, validation_split=0.1);

# DME
For dme we have a make use of a seperate block which takes in three parameters (input , maxlen and n_embed)

where 
        1. n_embed = the no. of embeddings used in our case 2 glove and paragram
        2. maxlen = maxlen of each input sample 
        
    



In [28]:
#dme
from keras.layers import Activation
# from keras.layers import multiply, Lambda
# import keras.backend as K

def DME_Block(inp, maxlen,n_emb):
    """
    # inp = tensor of shape (?,maxlen,embedding dim,n_emb)) n_emb is number of embedding matrices
    # out = tensor of shape (?,maxlen,embedding dim)
    """
   
    x = Reshape((maxlen,-1))(inp)
    x = LSTM(n_emb,return_sequences = True)(x)
    x = Activation('sigmoid')(x)
    
    x = Reshape((maxlen,1,n_emb))(x)
    
    x = multiply([inp, x])
    
    out = Lambda(lambda x: K.sum(x, axis=-1))(x)
    return out

In [29]:
input = tf.constant(np.random.rand(32,100,600,3),dtype = 'float32')

DME_Block(input,100,3)

<tf.Tensor 'lambda_1/Sum:0' shape=(32, 100, 600) dtype=float32>

In [1]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size * 2 , weights=[cancatnation_matrix])(inp)
print(x)
x = DME_Block(x,maxlen,2)
x = SpatialDropout1D(0.3)(x)
x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.3, recurrent_dropout=0.3))(x)
x1 = GlobalMaxPool1D()(x)
x2 = GlobalMaxPool1D()(x)
x = Concatenate(axis=-1)([x1, x2])
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# PME Projection Meta Embedding(PME)

1. Concatenate different embedding matrix and feed it thought the embedding layer 
2. then use a linear layer to project the vector dimension to a lower space 
3. then use a rlu thats it the ouptut then goes through a LSTM or GRU Model (classifier).

In [None]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size * 2 , weights=[cancatnation_matrix])(inp)
x = Dense(300,activation = 'relu')(x)
x = SpatialDropout1D(0.3)(x)
x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.3, recurrent_dropout=0.3))(x)
x1 = GlobalMaxPool1D()(x)
x2 = GlobalMaxPool1D()(x)
x = Concatenate(axis=-1)([x1, x2])
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [25]:
model.fit(X_t, y, batch_size=64, epochs=2, validation_split=0.1);

Train on 143613 samples, validate on 15958 samples
Epoch 1/2
Epoch 2/2


# Using Attention + Concatanation Meta-Embedding 

This model got me the best result

In [None]:
class AttentionWeightedAverage(Layer):
    """
    Computes a weighted average of the different channels across timesteps.
    Uses 1 parameter pr. channel to compute the attention value for a single timestep.
    """

    def __init__(self, return_attention=False, **kwargs):
        self.init = initializers.get('uniform')
        self.supports_masking = True
        self.return_attention = return_attention
        super(AttentionWeightedAverage, self).__init__(** kwargs)
        

    def build(self, input_shape):
        self.input_spec = [InputSpec(ndim=3)]
        assert len(input_shape) == 3

        self.W = self.add_weight(shape=(input_shape[2], 1),
                                 name='{}_W'.format(self.name),
                                 initializer=self.init)
        
        self.trainable_weights = [self.W]
        
        super(AttentionWeightedAverage, self).build(input_shape)

    def call(self, x, mask=None):
        # computes a probability distribution over the timesteps
        # uses 'max trick' for numerical stability
        # reshape is done to avoid issue with Tensorflow
        # and 1-dimensional weights
        logits = K.dot(x, self.W)
        x_shape = K.shape(x)
        logits = K.reshape(logits, (x_shape[0], x_shape[1]))
        
        ai = K.exp(logits - K.max(logits, axis=-1, keepdims=True))

        # masked timesteps have zero weight
        if mask is not None:
            mask = K.cast(mask, K.floatx())
            ai = ai * mask
            
        att_weights = ai / (K.sum(ai, axis=1, keepdims=True) + K.epsilon())
        
        weighted_input = x * K.expand_dims(att_weights)
        
        result = K.sum(weighted_input, axis=1)
        
        if self.return_attention:
            return [result, att_weights]
        return result
    
    
    

    def get_output_shape_for(self, input_shape):
        return self.compute_output_shape(input_shape)

    def compute_output_shape(self, input_shape):
        output_len = input_shape[2]
        if self.return_attention:
            return [(input_shape[0], output_len), (input_shape[0], input_shape[1])]
        return (input_shape[0], output_len)

    def compute_mask(self, input, input_mask=None):
        if isinstance(input_mask, list):
            return [None] * len(input_mask)
        else:
            return None


In [None]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size * 2 , weights=[cancatnation_matrix])(inp)
# x = Dense(300)(x)
# x = Relu()(x)
# x = SpatialDropout1D(0.3)(x)
x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
# x1 = GlobalMaxPool1D()(x)
# x2 = GlobalMaxPool1D()(x)
x3 = AttentionWeightedAverage()(x)

# x = Concatenate(axis=-1)([x1, x2,x3])
x = Dense(50, activation="relu")(x3)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
model.fit(X_t, y, batch_size=64, epochs=2, validation_split=0.1);

In [None]:
y_test = model.predict([X_te], batch_size=1024, verbose=1)
sample_submission = pd.read_csv(f'{path}{comp}sample_submission.csv.zip')
sample_submission[list_classes] = y_test
sample_submission.to_csv('only_attention.csv', index=False)

In [None]:
sample_submission