We can check for the details about the dataset here: [The bAbI Project](https://research.fb.com/downloads/babi/)



For this project, we will be using the Picker Library.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Importing Libraries

In [2]:
import pickle
import numpy as np

# Unpickling the Data

In [3]:
with open("/content/drive/MyDrive/1stop.ai/Basic ChatBot/train_qa.txt","rb") as fp:
    train_data = pickle.load(fp) # Obtaining the training data

In [4]:
train_data

[(['Mary',
   'moved',
   'to',
   'the',
   'bathroom',
   '.',
   'Sandra',
   'journeyed',
   'to',
   'the',
   'bedroom',
   '.'],
  ['Is', 'Sandra', 'in', 'the', 'hallway', '?'],
  'no'),
 (['Mary',
   'moved',
   'to',
   'the',
   'bathroom',
   '.',
   'Sandra',
   'journeyed',
   'to',
   'the',
   'bedroom',
   '.',
   'Mary',
   'went',
   'back',
   'to',
   'the',
   'bedroom',
   '.',
   'Daniel',
   'went',
   'back',
   'to',
   'the',
   'hallway',
   '.'],
  ['Is', 'Daniel', 'in', 'the', 'bathroom', '?'],
  'no'),
 (['Mary',
   'moved',
   'to',
   'the',
   'bathroom',
   '.',
   'Sandra',
   'journeyed',
   'to',
   'the',
   'bedroom',
   '.',
   'Mary',
   'went',
   'back',
   'to',
   'the',
   'bedroom',
   '.',
   'Daniel',
   'went',
   'back',
   'to',
   'the',
   'hallway',
   '.',
   'Sandra',
   'went',
   'to',
   'the',
   'kitchen',
   '.',
   'Daniel',
   'went',
   'back',
   'to',
   'the',
   'bathroom',
   '.'],
  ['Is', 'Daniel', 'in', 'the', '

In the data, we can see that there are two sentences in each and also the fullstop has been considered as another token. we need to make sure that this happens. after these two sentences, we have the question and the answer to that question.

In [5]:
with open("/content/drive/MyDrive/1stop.ai/Basic ChatBot/test_qa.txt","rb") as fp:
    test_data = pickle.load(fp)

In [6]:
test_data

[(['Mary',
   'got',
   'the',
   'milk',
   'there',
   '.',
   'John',
   'moved',
   'to',
   'the',
   'bedroom',
   '.'],
  ['Is', 'John', 'in', 'the', 'kitchen', '?'],
  'no'),
 (['Mary',
   'got',
   'the',
   'milk',
   'there',
   '.',
   'John',
   'moved',
   'to',
   'the',
   'bedroom',
   '.',
   'Mary',
   'discarded',
   'the',
   'milk',
   '.',
   'John',
   'went',
   'to',
   'the',
   'garden',
   '.'],
  ['Is', 'John', 'in', 'the', 'kitchen', '?'],
  'no'),
 (['Mary',
   'got',
   'the',
   'milk',
   'there',
   '.',
   'John',
   'moved',
   'to',
   'the',
   'bedroom',
   '.',
   'Mary',
   'discarded',
   'the',
   'milk',
   '.',
   'John',
   'went',
   'to',
   'the',
   'garden',
   '.',
   'Daniel',
   'moved',
   'to',
   'the',
   'bedroom',
   '.',
   'Daniel',
   'went',
   'to',
   'the',
   'garden',
   '.'],
  ['Is', 'John', 'in', 'the', 'garden', '?'],
  'yes'),
 (['Mary',
   'got',
   'the',
   'milk',
   'there',
   '.',
   'John',
   'moved',


# Exploring the Data

## Checking the Type of Data

In [7]:
type(train_data)

list

In [8]:
type(test_data)

list

## Checking the Length of Data

In [9]:
len(train_data)

10000

In [10]:
len(test_data)

1000

## Checking first record of Data
In both the training and the testing data, we can see that there is a story, a questiona and an answer.

In [11]:
train_data[0]

(['Mary',
  'moved',
  'to',
  'the',
  'bathroom',
  '.',
  'Sandra',
  'journeyed',
  'to',
  'the',
  'bedroom',
  '.'],
 ['Is', 'Sandra', 'in', 'the', 'hallway', '?'],
 'no')

In [12]:
test_data[0]

(['Mary',
  'got',
  'the',
  'milk',
  'there',
  '.',
  'John',
  'moved',
  'to',
  'the',
  'bedroom',
  '.'],
 ['Is', 'John', 'in', 'the', 'kitchen', '?'],
 'no')

We can see the story as:

In [13]:
' '.join(train_data[0][0])

'Mary moved to the bathroom . Sandra journeyed to the bedroom .'

In [14]:
' '.join(test_data[0][0])

'Mary got the milk there . John moved to the bedroom .'

The question which we can have for this is:

In [15]:
' '.join(train_data[0][1])

'Is Sandra in the hallway ?'

In [16]:
' '.join(test_data[0][1])

'Is John in the kitchen ?'

Since the answers to the questions need not be joined, so we have them as:

In [17]:
train_data[0][2]

'no'

In [18]:
test_data[0][2]

'no'

# Setting Up Vocabulary
Here, we will be setting up the vocabulary for all the words in which we will be giving values to all of them as:

In [19]:
# Creating an Empty Set
vocab = set()

In [20]:
# Putting all the Training and testing data into this
all_data = train_data + test_data

In [21]:
all_data

[(['Mary',
   'moved',
   'to',
   'the',
   'bathroom',
   '.',
   'Sandra',
   'journeyed',
   'to',
   'the',
   'bedroom',
   '.'],
  ['Is', 'Sandra', 'in', 'the', 'hallway', '?'],
  'no'),
 (['Mary',
   'moved',
   'to',
   'the',
   'bathroom',
   '.',
   'Sandra',
   'journeyed',
   'to',
   'the',
   'bedroom',
   '.',
   'Mary',
   'went',
   'back',
   'to',
   'the',
   'bedroom',
   '.',
   'Daniel',
   'went',
   'back',
   'to',
   'the',
   'hallway',
   '.'],
  ['Is', 'Daniel', 'in', 'the', 'bathroom', '?'],
  'no'),
 (['Mary',
   'moved',
   'to',
   'the',
   'bathroom',
   '.',
   'Sandra',
   'journeyed',
   'to',
   'the',
   'bedroom',
   '.',
   'Mary',
   'went',
   'back',
   'to',
   'the',
   'bedroom',
   '.',
   'Daniel',
   'went',
   'back',
   'to',
   'the',
   'hallway',
   '.',
   'Sandra',
   'went',
   'to',
   'the',
   'kitchen',
   '.',
   'Daniel',
   'went',
   'back',
   'to',
   'the',
   'bathroom',
   '.'],
  ['Is', 'Daniel', 'in', 'the', '

In the above output, each tuple has three things - the story in the form of a list, the question in the form of a list and the answer is just a word

In [22]:
type(all_data)

list

In [23]:
len(all_data)

11000

In [24]:
for story, question, answer in all_data:
    vocab = vocab.union(set(story))
    vocab = vocab.union(set(question))

In [25]:
vocab.add('yes')
vocab.add('no')

After this, we can see that the vocab will have all the workds as:

In [26]:
vocab

{'.',
 '?',
 'Daniel',
 'Is',
 'John',
 'Mary',
 'Sandra',
 'apple',
 'back',
 'bathroom',
 'bedroom',
 'discarded',
 'down',
 'dropped',
 'football',
 'garden',
 'got',
 'grabbed',
 'hallway',
 'in',
 'journeyed',
 'kitchen',
 'left',
 'milk',
 'moved',
 'no',
 'office',
 'picked',
 'put',
 'the',
 'there',
 'to',
 'took',
 'travelled',
 'up',
 'went',
 'yes'}

In [27]:
type(vocab)

set

In [28]:
len(vocab)

37

So, in total there are 37 unique words. Now, we need one more space to hold zero for the keras pad sequence. So,

In [29]:
vocab_length = len(vocab) + 1

In [30]:
vocab_length

38

Now, we will find out the maximum story length and the maximum question length as:

In [31]:
# For maximum story length
for data in all_data:
    print(data[0]) # With this, we will be able to see all the stories in the data
    print(len(data[0]))
max_story_length = max([len(data[0]) for data in all_data])

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
48
['Sandra', 'travelled', 'to', 'the', 'bathroom', '.', 'Mary', 'travelled', 'to', 'the', 'garden', '.', 'John', 'went', 'to', 'the', 'bathroom', '.', 'Daniel', 'moved', 'to', 'the', 'bathroom', '.', 'Sandra', 'travelled', 'to', 'the', 'bedroom', '.', 'Daniel', 'went', 'to', 'the', 'kitchen', '.', 'John', 'journeyed', 'to', 'the', 'hallway', '.', 'Mary', 'journeyed', 'to', 'the', 'bathroom', '.', 'Daniel', 'went', 'back', 'to', 'the', 'bathroom', '.', 'Sandra', 'got', 'the', 'football', 'there', '.']
61
['Mary', 'travelled', 'to', 'the', 'bedroom', '.', 'John', 'went', 'back', 'to', 'the', 'bathroom', '.']
13
['Mary', 'travelled', 'to', 'the', 'bedroom', '.', 'John', 'went', 'back', 'to', 'the', 'bathroom', '.', 'Daniel', 'went', 'to', 'the', 'kitchen', '.', 'Sandra', 'went', 'back', 'to', 'the', 'kitchen', '.']
26
['Mary', 'travelled', 'to', 'the', 'bedroom', '.', 'John', 'went', 'back', 'to', 'the', 'bathroom', '.', 'D

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [32]:
max_story_length

156

In [33]:
# For Maximum Question Length
for data in all_data:
    print(data[1])
    print(len(data[1]))
max_question_length = max([len(data[1]) for data in all_data])

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
['Is', 'John', 'in', 'the', 'garden', '?']
6
['Is', 'Daniel', 'in', 'the', 'garden', '?']
6
['Is', 'Daniel', 'in', 'the', 'hallway', '?']
6
['Is', 'Daniel', 'in', 'the', 'hallway', '?']
6
['Is', 'Mary', 'in', 'the', 'bedroom', '?']
6
['Is', 'Mary', 'in', 'the', 'garden', '?']
6
['Is', 'Sandra', 'in', 'the', 'kitchen', '?']
6
['Is', 'Mary', 'in', 'the', 'bedroom', '?']
6
['Is', 'John', 'in', 'the', 'garden', '?']
6
['Is', 'Sandra', 'in', 'the', 'office', '?']
6
['Is', 'Sandra', 'in', 'the', 'garden', '?']
6
['Is', 'Sandra', 'in', 'the', 'kitchen', '?']
6
['Is', 'Sandra', 'in', 'the', 'bedroom', '?']
6
['Is', 'Sandra', 'in', 'the', 'office', '?']
6
['Is', 'Mary', 'in', 'the', 'kitchen', '?']
6
['Is', 'Sandra', 'in', 'the', 'kitchen', '?']
6
['Is', 'Sandra', 'in', 'the', 'kitchen', '?']
6
['Is', 'Daniel', 'in', 'the', 'hallway', '?']
6
['Is', 'Sandra', 'in', 'the', 'garden', '?']
6
['Is', 'Mary', 'in', 'the', 'office', '?']


In [34]:
max_question_length

6

# Vectorizing the Data
In this, we will be converting the data to numerical form.

In [35]:
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

In [36]:
# Creating an object of Tokenizer
tokenizer = Tokenizer(filters = [])

In [37]:
tokenizer.fit_on_texts(vocab)

In [38]:
# Obtaining the Word Indices
tokenizer.word_index

{'.': 26,
 '?': 3,
 'apple': 35,
 'back': 37,
 'bathroom': 27,
 'bedroom': 8,
 'daniel': 33,
 'discarded': 21,
 'down': 20,
 'dropped': 17,
 'football': 6,
 'garden': 15,
 'got': 16,
 'grabbed': 34,
 'hallway': 7,
 'in': 5,
 'is': 30,
 'john': 14,
 'journeyed': 28,
 'kitchen': 9,
 'left': 23,
 'mary': 18,
 'milk': 24,
 'moved': 4,
 'no': 11,
 'office': 12,
 'picked': 13,
 'put': 36,
 'sandra': 2,
 'the': 10,
 'there': 32,
 'to': 29,
 'took': 22,
 'travelled': 19,
 'up': 31,
 'went': 25,
 'yes': 1}

Appending the story, question and answer from the training dataset into the separate variable.

In [39]:
train_story_text = []
train_question_text = []
train_answers = []

for story, question, answer in train_data:
    train_story_text.append(story)
    train_question_text.append(question)
    train_answers.append(answer)

In [40]:
train_story_sequence = tokenizer.texts_to_sequences(train_story_text)

In [41]:
train_question_sequence = tokenizer.texts_to_sequences(train_question_text)

In [42]:
train_answers_sequence = tokenizer.texts_to_sequences(train_answers)

In [49]:
train_story_text

[['Mary',
  'moved',
  'to',
  'the',
  'bathroom',
  '.',
  'Sandra',
  'journeyed',
  'to',
  'the',
  'bedroom',
  '.'],
 ['Mary',
  'moved',
  'to',
  'the',
  'bathroom',
  '.',
  'Sandra',
  'journeyed',
  'to',
  'the',
  'bedroom',
  '.',
  'Mary',
  'went',
  'back',
  'to',
  'the',
  'bedroom',
  '.',
  'Daniel',
  'went',
  'back',
  'to',
  'the',
  'hallway',
  '.'],
 ['Mary',
  'moved',
  'to',
  'the',
  'bathroom',
  '.',
  'Sandra',
  'journeyed',
  'to',
  'the',
  'bedroom',
  '.',
  'Mary',
  'went',
  'back',
  'to',
  'the',
  'bedroom',
  '.',
  'Daniel',
  'went',
  'back',
  'to',
  'the',
  'hallway',
  '.',
  'Sandra',
  'went',
  'to',
  'the',
  'kitchen',
  '.',
  'Daniel',
  'went',
  'back',
  'to',
  'the',
  'bathroom',
  '.'],
 ['Mary',
  'moved',
  'to',
  'the',
  'bathroom',
  '.',
  'Sandra',
  'journeyed',
  'to',
  'the',
  'bedroom',
  '.',
  'Mary',
  'went',
  'back',
  'to',
  'the',
  'bedroom',
  '.',
  'Daniel',
  'went',
  'back',
  'to

In [43]:
len(train_story_text)

10000

In [50]:
train_story_sequence

[[18, 4, 29, 10, 27, 26, 2, 28, 29, 10, 8, 26],
 [18,
  4,
  29,
  10,
  27,
  26,
  2,
  28,
  29,
  10,
  8,
  26,
  18,
  25,
  37,
  29,
  10,
  8,
  26,
  33,
  25,
  37,
  29,
  10,
  7,
  26],
 [18,
  4,
  29,
  10,
  27,
  26,
  2,
  28,
  29,
  10,
  8,
  26,
  18,
  25,
  37,
  29,
  10,
  8,
  26,
  33,
  25,
  37,
  29,
  10,
  7,
  26,
  2,
  25,
  29,
  10,
  9,
  26,
  33,
  25,
  37,
  29,
  10,
  27,
  26],
 [18,
  4,
  29,
  10,
  27,
  26,
  2,
  28,
  29,
  10,
  8,
  26,
  18,
  25,
  37,
  29,
  10,
  8,
  26,
  33,
  25,
  37,
  29,
  10,
  7,
  26,
  2,
  25,
  29,
  10,
  9,
  26,
  33,
  25,
  37,
  29,
  10,
  27,
  26,
  33,
  13,
  31,
  10,
  6,
  32,
  26,
  33,
  25,
  29,
  10,
  8,
  26],
 [18,
  4,
  29,
  10,
  27,
  26,
  2,
  28,
  29,
  10,
  8,
  26,
  18,
  25,
  37,
  29,
  10,
  8,
  26,
  33,
  25,
  37,
  29,
  10,
  7,
  26,
  2,
  25,
  29,
  10,
  9,
  26,
  33,
  25,
  37,
  29,
  10,
  27,
  26,
  33,
  13,
  31,
  10,
  6,
  32,
  26,


In [44]:
len(train_story_sequence)

10000

In [51]:
train_question_text

[['Is', 'Sandra', 'in', 'the', 'hallway', '?'],
 ['Is', 'Daniel', 'in', 'the', 'bathroom', '?'],
 ['Is', 'Daniel', 'in', 'the', 'office', '?'],
 ['Is', 'Daniel', 'in', 'the', 'bedroom', '?'],
 ['Is', 'Daniel', 'in', 'the', 'bedroom', '?'],
 ['Is', 'Mary', 'in', 'the', 'bedroom', '?'],
 ['Is', 'Sandra', 'in', 'the', 'office', '?'],
 ['Is', 'Sandra', 'in', 'the', 'bathroom', '?'],
 ['Is', 'Sandra', 'in', 'the', 'bathroom', '?'],
 ['Is', 'Mary', 'in', 'the', 'kitchen', '?'],
 ['Is', 'Sandra', 'in', 'the', 'office', '?'],
 ['Is', 'Mary', 'in', 'the', 'hallway', '?'],
 ['Is', 'Mary', 'in', 'the', 'hallway', '?'],
 ['Is', 'Mary', 'in', 'the', 'hallway', '?'],
 ['Is', 'Mary', 'in', 'the', 'garden', '?'],
 ['Is', 'Sandra', 'in', 'the', 'office', '?'],
 ['Is', 'Sandra', 'in', 'the', 'bathroom', '?'],
 ['Is', 'Sandra', 'in', 'the', 'kitchen', '?'],
 ['Is', 'Mary', 'in', 'the', 'bedroom', '?'],
 ['Is', 'Mary', 'in', 'the', 'kitchen', '?'],
 ['Is', 'Daniel', 'in', 'the', 'bedroom', '?'],
 ['Is', '

In [45]:
len(train_question_text)

10000

In [52]:
train_question_sequence

[[30, 2, 5, 10, 7, 3],
 [30, 33, 5, 10, 27, 3],
 [30, 33, 5, 10, 12, 3],
 [30, 33, 5, 10, 8, 3],
 [30, 33, 5, 10, 8, 3],
 [30, 18, 5, 10, 8, 3],
 [30, 2, 5, 10, 12, 3],
 [30, 2, 5, 10, 27, 3],
 [30, 2, 5, 10, 27, 3],
 [30, 18, 5, 10, 9, 3],
 [30, 2, 5, 10, 12, 3],
 [30, 18, 5, 10, 7, 3],
 [30, 18, 5, 10, 7, 3],
 [30, 18, 5, 10, 7, 3],
 [30, 18, 5, 10, 15, 3],
 [30, 2, 5, 10, 12, 3],
 [30, 2, 5, 10, 27, 3],
 [30, 2, 5, 10, 9, 3],
 [30, 18, 5, 10, 8, 3],
 [30, 18, 5, 10, 9, 3],
 [30, 33, 5, 10, 8, 3],
 [30, 2, 5, 10, 27, 3],
 [30, 2, 5, 10, 8, 3],
 [30, 33, 5, 10, 12, 3],
 [30, 33, 5, 10, 9, 3],
 [30, 2, 5, 10, 27, 3],
 [30, 2, 5, 10, 12, 3],
 [30, 14, 5, 10, 12, 3],
 [30, 2, 5, 10, 12, 3],
 [30, 2, 5, 10, 7, 3],
 [30, 14, 5, 10, 27, 3],
 [30, 14, 5, 10, 8, 3],
 [30, 18, 5, 10, 7, 3],
 [30, 14, 5, 10, 8, 3],
 [30, 33, 5, 10, 27, 3],
 [30, 2, 5, 10, 7, 3],
 [30, 18, 5, 10, 9, 3],
 [30, 18, 5, 10, 27, 3],
 [30, 2, 5, 10, 12, 3],
 [30, 18, 5, 10, 27, 3],
 [30, 2, 5, 10, 15, 3],
 [30, 18, 5,

In [46]:
len(train_question_sequence)

10000

In [53]:
train_answers

['no',
 'no',
 'no',
 'yes',
 'yes',
 'yes',
 'no',
 'no',
 'no',
 'yes',
 'yes',
 'yes',
 'yes',
 'no',
 'yes',
 'yes',
 'no',
 'yes',
 'yes',
 'yes',
 'no',
 'yes',
 'no',
 'no',
 'no',
 'no',
 'yes',
 'no',
 'no',
 'no',
 'yes',
 'yes',
 'no',
 'yes',
 'no',
 'no',
 'no',
 'no',
 'yes',
 'no',
 'no',
 'yes',
 'no',
 'no',
 'no',
 'no',
 'yes',
 'yes',
 'no',
 'no',
 'yes',
 'yes',
 'yes',
 'yes',
 'no',
 'yes',
 'no',
 'yes',
 'no',
 'yes',
 'no',
 'no',
 'no',
 'no',
 'yes',
 'yes',
 'yes',
 'no',
 'yes',
 'no',
 'no',
 'yes',
 'yes',
 'yes',
 'no',
 'no',
 'no',
 'no',
 'no',
 'no',
 'yes',
 'yes',
 'yes',
 'yes',
 'no',
 'yes',
 'yes',
 'no',
 'yes',
 'no',
 'no',
 'yes',
 'no',
 'no',
 'yes',
 'yes',
 'no',
 'yes',
 'yes',
 'no',
 'no',
 'no',
 'yes',
 'yes',
 'no',
 'no',
 'no',
 'yes',
 'no',
 'no',
 'yes',
 'yes',
 'no',
 'no',
 'yes',
 'no',
 'yes',
 'yes',
 'yes',
 'no',
 'no',
 'yes',
 'yes',
 'no',
 'no',
 'no',
 'yes',
 'no',
 'no',
 'yes',
 'no',
 'yes',
 'no',
 'yes',


In [47]:
len(train_answers)

10000

In [54]:
train_answers_sequence

[[11],
 [11],
 [11],
 [1],
 [1],
 [1],
 [11],
 [11],
 [11],
 [1],
 [1],
 [1],
 [1],
 [11],
 [1],
 [1],
 [11],
 [1],
 [1],
 [1],
 [11],
 [1],
 [11],
 [11],
 [11],
 [11],
 [1],
 [11],
 [11],
 [11],
 [1],
 [1],
 [11],
 [1],
 [11],
 [11],
 [11],
 [11],
 [1],
 [11],
 [11],
 [1],
 [11],
 [11],
 [11],
 [11],
 [1],
 [1],
 [11],
 [11],
 [1],
 [1],
 [1],
 [1],
 [11],
 [1],
 [11],
 [1],
 [11],
 [1],
 [11],
 [11],
 [11],
 [11],
 [1],
 [1],
 [1],
 [11],
 [1],
 [11],
 [11],
 [1],
 [1],
 [1],
 [11],
 [11],
 [11],
 [11],
 [11],
 [11],
 [1],
 [1],
 [1],
 [1],
 [11],
 [1],
 [1],
 [11],
 [1],
 [11],
 [11],
 [1],
 [11],
 [11],
 [1],
 [1],
 [11],
 [1],
 [1],
 [11],
 [11],
 [11],
 [1],
 [1],
 [11],
 [11],
 [11],
 [1],
 [11],
 [11],
 [1],
 [1],
 [11],
 [11],
 [1],
 [11],
 [1],
 [1],
 [1],
 [11],
 [11],
 [1],
 [1],
 [11],
 [11],
 [11],
 [1],
 [11],
 [11],
 [1],
 [11],
 [1],
 [11],
 [1],
 [1],
 [1],
 [1],
 [1],
 [11],
 [11],
 [11],
 [1],
 [1],
 [1],
 [1],
 [1],
 [1],
 [11],
 [1],
 [11],
 [1],
 [1],
 [11],
 [11

In [48]:
len(train_answers_sequence)

10000

# Functionalising the Vectorization

In [48]:
# Creating the Function
def vectorize_stories(data, 
                      work_index = tokenizer.word_index,
                      max_story_length = max_story_length, 
                      max_question_length = max_question_length):
    