## BabI Exploration and Simple Baseline

This notebook provides a brief overview of the BabI dataset. Also, we implement a simple two layer deep neural network for one of the 20 tasks in Babi dataset.

### Before running this notebook, make sure to download the data+glove embeddings(run download_data.py).

In [1]:
import tensorflow as tf
import numpy as np
import os
import keras
import re
from functools import reduce
import sys
import urllib.request
import tarfile
import zipfile
import csv

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


The Facebook bAbI-10k dataset consists of 20 tasks and has been used as a benchmark in many question answering papers. Each task has a different type of question such as single supporting fact
questions, two supporting fact questions, yes no questions, counting questions, etc.
Here we are using the English version of the dataset with 10,000 training examples and 1000 test examples.
All examples consist of an input-question-answer tuple. The input is a variable length passage of
text. The type of question and answer depends on the task. For example, some tasks have yes/no
answers while others are focused on positional reasoning or counting. For each question-answer
pair, the dataset also gives the line numbers of the input passage that is relevant to the answer. Every
answer in the bAbI dataset is one word. Examples from the dataset can be seen below.

In [2]:
DATASET_PATH = 'data/tasks_1-20_v1-2/en-10k/'
GLOVE_PATH = 'data/glove.6B.50d.txt'
task_number = 6

In [3]:
def get_current_task_file(directory, task_number, type='train'):
    '''Returns task file name
    :param String directory
    :param int task
    :return String current_task_file 
    '''
    all_paths = os.listdir(directory)
    all_files = [os.path.join(directory, folder) for folder in all_paths]
    t = 'qa{}_'.format(task_number)
    current_task_file = [f for f in all_files if t in f and type in f][0]
    return current_task_file

In [4]:
current_task_file = get_current_task_file(DATASET_PATH, task_number)
print (current_task_file)

data/tasks_1-20_v1-2/en-10k/qa6_yes-no-questions_train.txt


### Example of a yes no question answering task

In [5]:
examples = [line.strip() for line in open(current_task_file).read().split('\n')[:-1]]
examples[:15]

['1 Mary moved to the bathroom.',
 '2 Sandra journeyed to the bedroom.',
 '3 Is Sandra in the hallway? \tno\t2',
 '4 Mary went back to the bedroom.',
 '5 Daniel went back to the hallway.',
 '6 Is Daniel in the bathroom? \tno\t5',
 '7 Sandra went to the kitchen.',
 '8 Daniel went back to the bathroom.',
 '9 Is Daniel in the office? \tno\t8',
 '10 Daniel picked up the football there.',
 '11 Daniel went to the bedroom.',
 '12 Is Daniel in the bedroom? \tyes\t11',
 '13 John travelled to the office.',
 '14 Sandra went to the garden.',
 '15 Is Daniel in the bedroom? \tyes\t11']

### Load the full dataset as a single python list

In [6]:
def concat(inputs):
    return reduce(lambda x,y:x+y, inputs)

In [7]:
def load_dataset(current_task_file):
    '''Returns a list of tuples consisting of
       input, question and answer pairs
    param: String current_task_file
    '''

    data = open(current_task_file).read()
    data = data.split('\n')[:-1]
    data = [line.strip() for line in data]
    
    new = False
    dataset = list()
    inputs = list()
    
    for line in data:
        idx, line = line.strip().split(' ', 1)
        if int(idx) == 1:
            new = True
            inputs = []
        if '\t' in line:
            question, answer, _ = line.split('\t')
            new_inputs = [i for i in inputs if i]
            question = [s.strip() for s in re.split('(\W+)?', question) if s.strip()]
            dataset.append((new_inputs, question, answer))
            inputs.append('')
        else:
            inputs.append([s.strip() for s in re.split('(\W+)?', line) if s.strip()])
            
    return [(concat(i), q, a) for i, q, a in dataset]

### Splitting the dataset into train and validation

In [8]:
def train_test_split(data, split_ratio, shuffle=True):
    '''Returns the training and validation set
    :param data
    :param split_ration
    :param shuffle
    :Returns train_data, test_data
    '''
    if shuffle == True:
        np.random.shuffle(data)

    idx = int(len(data) * split_ratio)	
    train_data, test_data = data[:idx], data[idx:]	

    return train_data, test_data

### Loading the Glove embedding vectors

In [9]:
def load_embedding_file(path_to_file):
    '''Loads the glove embedding file
    :param path_to_file
    :Returns file
    '''
    file = csv.reader(open(path_to_file), delimiter=' ', quoting=csv.QUOTE_NONE)
    file = {line[0]: np.array(list(map(float, line[1:]))) for line in file}

    return file

### Converting the inputs and questions to vectors using pre-trained glove embeddings.

In [10]:
def convert_to_vector(dataset, embedding_file):
    '''Returns the training and validation set
    :param dataset
    :param embedding_file
    :Returns inputs, questions, answers
    '''
    inputs, questions, answers = list(), list(), list()
    for i, q, a in dataset:
        vec_i = list()
        for word in i:
            word = word.lower()
            if word in embedding_file:
                vi = embedding_file[word]
            else:
                vi = np.random.choice(np.random.uniform(0, 1, 50))
                vi /= np.sum(np.random.uniform(0, 1, 50))

            vec_i.append(vi)
        inputs.append(vec_i)

        vec_q = list()
        for word in q:
            word = word.lower()
            if word in embedding_file:
                vq = embedding_file[word]
            else:
                vq = np.random.choice(np.random.uniform(0, 1, 50))
                vq /= np.sum(np.random.uniform(0, 1, 50))

            vec_q.append(vq)
        questions.append(vec_q)

        if a == 'yes':
            ans = np.array([1])
            answers.append(ans)
        else:
            ans = np.array([0])
            answers.append(ans)

    return inputs, questions, answers

Little exploration on preprocessed dataset

In [11]:
dataset = load_dataset(current_task_file)
embedding_file = load_embedding_file(GLOVE_PATH)
print (dataset[0])
print (embedding_file['nature'])

  return _compile(pattern, flags).split(string, maxsplit)


(['Mary', 'moved', 'to', 'the', 'bathroom', '.', 'Sandra', 'journeyed', 'to', 'the', 'bedroom', '.'], ['Is', 'Sandra', 'in', 'the', 'hallway', '?'], 'no')
[ 0.69917   0.64303  -1.439     0.24653   0.5186    0.074237  0.14794
 -0.53328   0.42334   0.52874  -0.16226   0.39229   0.2929   -0.049215
 -0.22758   0.0278    0.81454   0.085199 -0.084373 -0.017206 -0.50549
  0.56167   0.043883 -0.11787   0.72582  -0.97883  -0.92599   0.31475
  0.65451   0.43346   2.9418   -0.86313  -0.097198 -1.5291   -0.39909
  0.22361  -0.55267  -0.24998  -0.60628  -0.14298  -0.24146  -0.57472
  0.19233   0.94781   0.075044 -0.085379  0.086206  0.24632   0.52126
  0.11655 ]


In [12]:
EMBED_DIM = 50

def concatenate(inputs, questions):
    inp_que = []

    examples = len(inputs)
    for i in range(examples):
        inp_vec = np.zeros((1, EMBED_DIM))
        for c in inputs[i]:
            inp_vec += c

        que_vec = np.zeros((1, EMBED_DIM))
        for c in questions[i]:
            que_vec += c

        v = np.concatenate((inp_vec, que_vec), 1)
        inp_que.append(v[0])

    return inp_que

In [13]:
def batches(X, Y, batch_size):
    counts = len(X)

    for n in range(0, counts, batch_size):
        x = X[n:min(n+batch_size, counts)]
        y = Y[n:min(n+batch_size, counts)]
        yield x, y

In [14]:
PATH = 'data/tasks_1-20_v1-2/en-10k'
task_number = 6
current_task_file = get_current_task_file(PATH, task_number)    
dataset = load_dataset(current_task_file)
train_data, test_data = train_test_split(dataset, split_ratio=0.8)
embedding_file = load_embedding_file('data/glove.6B.50d.txt')

train_inputs, train_questions, train_answers = convert_to_vector(train_data, embedding_file)
test_inputs, test_questions, test_answers = convert_to_vector(test_data, embedding_file)

X_train = np.array(concatenate(train_inputs, train_questions))
X_test = np.array(concatenate(test_inputs, test_questions))
y_train = np.array(train_answers)
y_test = np.array(test_answers)

print (X_train.shape, X_test.shape, y_train.shape, y_test.shape)

  return _compile(pattern, flags).split(string, maxsplit)


(8000, 100) (2000, 100) (8000, 1) (2000, 1)


## Model

The model for this notebook(baseline) is a simple two layer Neural Network. Model -> All of the word vectors for each word in the input text are summed up. Similarily, sum all of the word vectors for each word in the question text, Then concatenate the both of them together. Using this as input to the 2 layer network, we train the model using gradient descent. We use a batch size of 64, learning rate decay, Adam optimization. The hidden layer size is 200 and we use Relu Activation function

### Tensorflow

In [15]:
import math
import time 
%timeit

input_placeholder = tf.placeholder(tf.float32, (None, 100))
target_placeholder = tf.placeholder(tf.float32, (None, 1))
lr = tf.placeholder(tf.float32)
pkeep = tf.placeholder(tf.float32)

W1 = tf.Variable(tf.truncated_normal(shape=(100, 200), dtype=tf.float32))
b1 = tf.Variable(tf.zeros(200))

W2 = tf.Variable(tf.truncated_normal(shape=(200, 200), dtype=tf.float32))
b2 = tf.Variable(tf.zeros(200))

W3 = tf.Variable(tf.truncated_normal(shape=(200, 1), dtype=tf.float32))
b3 = tf.Variable(tf.zeros(1))

x = tf.nn.relu(tf.add(tf.matmul(input_placeholder, W1), b1))
x = tf.nn.relu(tf.add(tf.matmul(x, W2), b2))
y = tf.add(tf.matmul(x, W3), b3)
y_ = tf.nn.sigmoid(y)

loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=target_placeholder, logits=y))
opt = tf.train.AdamOptimizer(lr).minimize(loss)

correct_prediction = tf.equal(tf.argmax(y_, 1), tf.argmax(target_placeholder, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) * 100

sess = tf.Session()
init = tf.global_variables_initializer()

epochs = 500

with sess:
    sess.run(init)
    for epoch in range(epochs+1):
        start = time.time()
        for x, y in batches(X_train, y_train, 64):
            max_learning_rate = 0.003
            min_learning_rate = 0.0001
            decay_speed = 2000.0
            learn = min_learning_rate + (max_learning_rate - min_learning_rate) * math.exp(-epoch/decay_speed)
            
            _, l, acc  = sess.run([opt, loss, accuracy], feed_dict={input_placeholder:x, target_placeholder:y, lr:learn, pkeep:0.8})
        
        end = time.time()    
        
        if epoch % 100 == 0:
            print ('Epoch: {}, Loss: {}, Accuracy: {}, Time Taken: {}'.format(epoch, l, acc, (end-start)))                   

Epoch: 0, Loss: 159.65216064453125, Accuracy: 100.0, Time Taken: 0.36307215690612793
Epoch: 100, Loss: 45.659088134765625, Accuracy: 100.0, Time Taken: 0.2920949459075928
Epoch: 200, Loss: 3.1209959983825684, Accuracy: 100.0, Time Taken: 0.287182092666626
Epoch: 300, Loss: 1.8143832683563232, Accuracy: 100.0, Time Taken: 0.28829407691955566
Epoch: 400, Loss: 0.5801846981048584, Accuracy: 100.0, Time Taken: 0.2913849353790283
Epoch: 500, Loss: 0.3909987509250641, Accuracy: 100.0, Time Taken: 0.2959451675415039


### Keras

In [16]:
import numpy as np
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD
import csv

model = Sequential()
model.add(Dense(200, input_dim = 2*EMBED_DIM, activation='relu'))
model.add(Dense(200, activation='relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=100, batch_size=64, validation_data=(X_test, y_test),verbose=1)

Train on 8000 samples, validate on 2000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100


Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x10a520518>

In [17]:
model_directory = 'saved_models/'
if not os.path.exists(model_directory):
    os.mkdir(model_directory)
model.save_weights(model_directory +'model.h5')

### Radnom Evaluation using keras trained model

In [18]:
load_model = model.load_weights(model_directory + 'model.h5')
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 200)               20200     
_________________________________________________________________
dense_2 (Dense)              (None, 200)               40200     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 201       
_________________________________________________________________
activation_1 (Activation)    (None, 1)                 0         
Total params: 60,601
Trainable params: 60,601
Non-trainable params: 0
_________________________________________________________________


In [66]:
i = 'Mary is in the building . John goes to the market . Mary leaves the building . Carol went to the office . Jim travelled to the building .'
q = 'Is John in the market ?'

In [67]:
inp = [x.lower() for x in i.split()]
que = [x.lower() for x in q.split()]

In [68]:
print (inp)
print (que)

['mary', 'is', 'in', 'the', 'building', '.', 'john', 'goes', 'to', 'the', 'market', '.', 'mary', 'leaves', 'the', 'building', '.', 'carol', 'went', 'to', 'the', 'office', '.', 'jim', 'travelled', 'to', 'the', 'building', '.']
['is', 'john', 'in', 'the', 'market', '?']


### Finding closest word to a given word for unknown words encountered

In [60]:
def cosine(word1, word2):
    dot = np.dot(word1, word2)
    norm_u = np.sqrt(np.sum(np.power(word1, 2)))
    norm_v = np.sqrt(np.sum(np.power(word2, 2)))
    cosine_similarity = dot/ (norm_u*norm_v)
    
    return cosine_similarity

def find_closest(word):
    dist = -100.0
    for w, vec in embedding_file.items():
        if word == w:
            continue
            
        cosine_sim = cosine(embedding_file[word], vec)
        if cosine_sim > dist:
            dist = cosine_sim
            closest = w
    
    return closest

In [52]:
print (find_closest('Carol'.lower()))

susan


In [69]:
inp_vec = np.array([embedding_file[w] if w in embedding_file else embedding_file[find_closest[w]] for w in inp])
que_vec = np.array([embedding_file[w] if w in embedding_file else embedding_file[find_closest[w]] for w in que])

In [70]:
print (inp_vec.shape, que_vec.shape)

(29, 50) (6, 50)


In [71]:
i_v = np.zeros((1, 50))
q_v = np.zeros((1, 50))

for c in inp_vec:
    i_v += c

for c in que_vec:
    q_v += c  
    
pred_x = np.concatenate((i_v, q_v), 1)    

In [72]:
pred_x.shape

(1, 100)

In [73]:
p = np.squeeze(model.predict(pred_x))
print (p)
'Yes' if p > 0.5 else 'NO'

0.98502076


'Yes'