This script intends to be a starter script for Keras using pre-trained word embeddings.

**Word embedding:**

[Word embedding][1] is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. They are also called as word vectors.

Two commonly used word embeddings are:

1.  [Google word2vec][2]
2. [Stanford Glove][3]

In this notebook, we will use the GloVe word vector which is downloaded from [this link][4] 

Let us first import the necessary packages.


  [1]: https://en.wikipedia.org/wiki/Word_embedding
  [2]: https://code.google.com/archive/p/word2vec/
  [3]: https://nlp.stanford.edu/projects/glove/
  [4]: http://nlp.stanford.edu/data/glove.6B.zip

In [1]:
import os
import csv
import codecs
import numpy as np
import pandas as pd
import keras
from sklearn.model_selection import train_test_split

np.random.seed(1337)

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from keras.layers import Dense, Input, Flatten, merge, LSTM, Lambda, Dropout
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model
from keras.layers.wrappers import TimeDistributed, Bidirectional
from keras.layers.normalization import BatchNormalization
from keras import backend as K
import sys
from keras.layers.merge import concatenate

Using TensorFlow backend.


Let us specify the constants that are needed for the model.

 1. MAX_SEQUENCE_LENGTH : number of words from the question to be used
 2. MAX_NB_WORDS : maximum size of the vocabulary
 3. EMBEDDING_DIM : dimension of the word embeddings

In [2]:
BASE_DIR = 'data/'
GLOVE_DIR ='/Users/tom/Msc Data Science/Machine Learning/Assignments/Quora/Glove.6B'
TRAIN_DATA_FILE = BASE_DIR + 'train_data.csv'
TEST_DATA_FILE = BASE_DIR + 'test_data.csv'
MAX_SEQUENCE_LENGTH = 30
MAX_NB_WORDS = 200000
EMBEDDING_DIM = 300
VALIDATION_SPLIT = 0.01

In [3]:
###APPEND LABELS TO TRAINING DATA 

df_train = pd.read_csv('data/train_data.csv')
df_train.drop(['is_duplicate'], axis= 1, inplace = True)
df_labels = pd.read_csv('data/train_labels.csv')
df_train = df_train.merge(df_labels)



In [4]:
###Create datasets
#train, CV = train_test_split(df_train, train_size = 0.8, random_state = 49)
train, CV = train_test_split(df_train, train_size = 0.95, random_state = 49)

test = pd.read_csv('data/test_data.csv')

As the first step, let us read the word vectors text file into a dictionary where the word is the key and the 300 dimensional vector is its corresponding value.

Note : This will throw an error here since the word vectors are not here in Kaggle environment.

In [75]:
print('Indexing word vectors.')
embeddings_index = {}
f = codecs.open(os.path.join(GLOVE_DIR, 'glove.6B.300d.txt'), encoding='utf-8')
for line in f:
    values = line.split(' ')
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))

Indexing word vectors.


KeyboardInterrupt: 

KeyError: 1

Now read the train and test questions into list of questions.

In [86]:
print('Processing text dataset')
texts_1 = pd.Series(train['question1']) 
texts_2 = pd.Series(train['question2']) 
labels = pd.Series(train['is_duplicate'])
train_ids = pd.Series(train['id'])
# list of label ids
print('Found %s texts.' % len(texts_1))

Processing text dataset
Found 307005 texts.


In [78]:
CV_texts_1 = pd.Series(CV['question1']) 
CV_texts_2 = pd.Series(CV['question2']) 
CV_labels = pd.Series(CV['is_duplicate'])   # list of label ids
print('Found %s texts.' % len(CV_texts_1))

Found 16159 texts.


In [79]:
test_texts_1 = pd.Series(test['question1']) 
test_texts_2 = pd.Series(test['question2']) 
test_ids = pd.Series(test['test_id']) 

print('Found %s texts.' % len(test_texts_1))

Found 81126 texts.


In [9]:
all_texts = texts_1.astype(str).tolist() + texts_2.astype(str).tolist() + CV_texts_1.astype(str).tolist() + CV_texts_2.astype(str).tolist()+ test_texts_1.astype(str).tolist() + test_texts_2.astype(str).tolist()

texts_1 = texts_1.astype(str).tolist()
texts_2 = texts_2.astype(str).tolist()

CV_texts_1 = CV_texts_1.astype(str).tolist()
CV_texts_2 = CV_texts_2.astype(str).tolist()

test_texts_1 = test_texts_1.astype(str).tolist()
test_texts_2 = test_texts_2.astype(str).tolist()

Using keras tokenizer to tokenize the text and then do padding the sentences to 30 words

In [2]:
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(all_texts)

sequences_1 = tokenizer.texts_to_sequences(texts_1)
sequences_2 = tokenizer.texts_to_sequences(texts_2)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

CV_sequences_1 = tokenizer.texts_to_sequences(CV_texts_1)
CV_sequences_2 = tokenizer.texts_to_sequences(CV_texts_2)

test_sequences_1 = tokenizer.texts_to_sequences(test_texts_1)
test_sequences_2 = tokenizer.texts_to_sequences(test_texts_2)

data_1 = pad_sequences(sequences_1, maxlen=MAX_SEQUENCE_LENGTH)
data_2 = pad_sequences(sequences_2, maxlen=MAX_SEQUENCE_LENGTH)
labels = np.array(labels)
print('Shape of data tensor:', data_1.shape)
print('Shape of label tensor:', labels.shape)

CV_data_1 = pad_sequences(CV_sequences_1, maxlen=MAX_SEQUENCE_LENGTH)
CV_data_2 = pad_sequences(CV_sequences_2, maxlen=MAX_SEQUENCE_LENGTH)

test_data_1 = pad_sequences(test_sequences_1, maxlen=MAX_SEQUENCE_LENGTH)
test_data_2 = pad_sequences(test_sequences_2, maxlen=MAX_SEQUENCE_LENGTH)
#test_labels = np.array(test_labels)

del CV_sequences_1
del CV_sequences_2
del test_sequences_1
del test_sequences_2
del sequences_1
del sequences_2

import gc
gc.collect()

test_data_1

NameError: name 'Tokenizer' is not defined

Now let us create the embedding matrix where each row corresponds to a word.

In [11]:
print('Preparing embedding matrix.')
# prepare embedding matrix
nb_words = min(MAX_NB_WORDS, len(word_index))

embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i >= nb_words:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))

Preparing embedding matrix.
Null word embeddings: 35541


In [108]:
####Augment data so questions are both ways round (Q1, Q2 v. Q2, Q1)
data_1_aug = np.concatenate([data_1,data_2], axis = 0)
data_2_aug = np.concatenate([data_2,data_1], axis = 0)
labels_aug = np.concatenate([labels,labels], axis = 0)

Now its time to build the model. Let us specify the model architecture. First layer is the embedding layer.

In embedding layer, 'trainable' is set to False so as to not train the word embeddings during the back propogation.

The neural net architecture is as follows:

1. Word embeddings of each question is passed to a 1-dimensional convolution layer followed by max pooling.
2. It is followed by one dense layer for each of the two questions
3. The outputs from both the dense layers are merged together
4. It is followed by a dense layer
5. Final layer is a sigmoid layer

In [110]:
embedding_layer = Embedding(nb_words,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

In [3]:
num_lstm = np.random.randint(175, 275)
num_dense = np.random.randint(100, 150)
rate_drop_lstm = 0.15 + np.random.rand() * 0.25
rate_drop_dense = 0.15 + np.random.rand() * 0.25
act = 'relu'
re_weight = True

lstm_layer = LSTM(num_lstm, dropout=rate_drop_lstm, recurrent_dropout=rate_drop_lstm)
#lstm_layer = Bidirectional(LSTM(num_lstm, dropout=rate_drop_lstm, recurrent_dropout=rate_drop_lstm))

sequence_1_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences_1 = embedding_layer(sequence_1_input)
x1 = lstm_layer(embedded_sequences_1)

sequence_2_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences_2 = embedding_layer(sequence_2_input)
y1 = lstm_layer(embedded_sequences_2)


merged = concatenate([x1, y1])
merged = Dropout(rate_drop_dense)(merged)
merged = BatchNormalization()(merged)

merged = Dense(num_dense, activation=act)(merged)
merged = Dropout(rate_drop_dense)(merged)
merged = BatchNormalization()(merged)

merged = Dense(num_dense, activation=act)(merged)
merged = Dropout(rate_drop_dense)(merged)
merged = BatchNormalization()(merged)

preds = Dense(1, activation='sigmoid')(merged)

if re_weight:
    class_weight = {0: 1.309028344, 1: 0.472001959}
else:
    class_weight = None

model = Model(inputs=[sequence_1_input, sequence_2_input], \
        outputs=preds)
model.compile(loss='binary_crossentropy',
        optimizer='nadam',
        metrics=['acc'])

NameError: name 'LSTM' is not defined

**Model training and predictions :**

Uncomment the below cell and run it in local as it is exceeding the time limits here.

In [131]:
#model.fit([data_1,data_2], labels, validation_split=VALIDATION_SPLIT, epochs=1, batch_size=1024, shuffle=True)


Train on 607869 samples, validate on 6141 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x19738fa50>

In [132]:
from helpers import save_model
save_model(model, BASE_DIR)

In [133]:
preds = model.predict([CV_data_1, CV_data_2])
print(preds.shape)
print(preds)

(16159, 1)
[[ 0.00449069]
 [ 0.01784229]
 [ 0.30560076]
 ..., 
 [ 0.69071585]
 [ 0.0028491 ]
 [ 0.71348763]]


In [134]:
out_df = pd.DataFrame({"is_duplicate":CV_labels, "pred_is_duplicate":np.round(preds.ravel()).astype(int)})

out_df['correct_pred'] = out_df['is_duplicate'] == out_df['pred_is_duplicate']

np.sum(out_df['correct_pred']).astype(float)/len(out_df['correct_pred'])

#out_df.to_csv("test_predictions.csv", index=False)

#pd.DataFrame({"is_duplicate":preds.ravel()}).to_csv("keras_LSTM_predictions.csv", index=False)

0.8274026858097655

In [135]:
#FAILS AT 81114?!

final_preds = model.predict([test_data_1[:81113], test_data_2[:81113]])
print(final_preds.shape)
print(final_preds)

(81113, 1)
[[ 0.99865049]
 [ 0.63790494]
 [ 0.07568309]
 ..., 
 [ 0.04909017]
 [ 0.09136886]
 [ 0.13600092]]


In [136]:
final_df = pd.DataFrame({"test_id":test_ids[:81113], "nn_out":final_preds.ravel()})
final_df.to_csv("test_preds_for_logreg.csv", index=False)


In [137]:
train_preds = model.predict([data_1, data_2])

In [138]:
xtra_df = pd.DataFrame({"test_id":train_ids, "nn_out":train_preds.ravel()})
xtra_df.to_csv("preds_for_logreg.csv", index=False)

This scores about 0.55 when run locally using the word embedding. Got better scores using LSTM and Time Distributed layer.

Try different architectures and have a happy learning.

Hope this helps to get started with keras and word embeddings in this competition.

**References :**

 1. [On word embeddings - part 1][1] by Sebastian Ruder
 2. [Blog post][2] by fchollet
 3. [Code][3] by Abhishek Thakur
 4. [Code][4] by Bradley Pallen


  [1]: http://sebastianruder.com/word-embeddings-1/
  [2]: https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
  [3]: https://github.com/abhishekkrthakur/is_that_a_duplicate_quora_question
  [4]: https://github.com/bradleypallen/keras-quora-question-pairs