# A Seq2seq Chatbot utilizing an Attention-based Encoder-Decoder Architecture


## Environment

We uploaded all the local data in our Google Drive. 

We will use Google Colaboratory since a GPU accelerator is provided without cost, along with a 26GB RAM too, which both will help us with our computations.

Let's mount our data to Google Colaboratory so the notebook can have access to them.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
!ls "/content/drive/My Drive/Colab Notebooks/ncsr/"


chatbot-tf-notebook.ipynb  old			 version-keras-char-level
data			   training_checkpoints  version-keras-word-level


In [0]:

import os
!pwd
path_to_mount = '/content/drive/My Drive/Colab Notebooks/ncsr/'
os.chdir(path_to_mount)
!ls


/content
chatbot-tf-notebook.ipynb  old			 version-keras-char-level
data			   training_checkpoints  version-keras-word-level


## JSON Parsing

Great, we can now see our notebook and the data folder.

By exploring our data, we can see that we are interested in the text files inside the *dialogue folder* and more specifically in the '*turns*' value.

If we visualize the files, we can understand that they are JSON formatted.

However, they are ill-formatted, meaning that the files themselves are not JSONs.

Instead, **each line is a json object itself.**

Let's have a look at all the files and parse them with the *json* and *glob* library

In [0]:
# Import libraries

# Parsing 
import glob
import json
import random 
import numpy as np
import pandas as pd 

# Preprocessing & NNs
from keras.models import Sequential, Model, load_model
from keras.layers import LSTM,Dense, Dropout, Embedding, CuDNNLSTM, Bidirectional, Embedding, Input, TimeDistributed
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

import re
import tensorflow as tf
tf.enable_eager_execution() # evaluates operations immediately without building graphs
# The above does not work with placeholders

# Etc
from tqdm import tqdm
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import os
import time

%matplotlib inline

Using TensorFlow backend.


In [0]:
# Get absolute paths of files
dialogues_regex_folder_path = "data/dialogues/*.txt"

# Get the absolute paths for each file 
list_of_files = glob.glob(path_to_mount + dialogues_regex_folder_path)
print(list_of_files[:3]) # Visualize the first 3
print(len(list_of_files)) # 47; crashing? Try lower numbers

#list_of_files = list_of_files[5] 
#list_of_files = random.choices(list, k=5)
#print(secrets.randbelow(len(list_of_files)))
#list_of_files = random.choices(population=list_of_files, k=5)

print(len(list_of_files))

['/content/drive/My Drive/Colab Notebooks/ncsr/data/dialogues/AGREEMENT_BOT.txt', '/content/drive/My Drive/Colab Notebooks/ncsr/data/dialogues/APARTMENT_FINDER.txt', '/content/drive/My Drive/Colab Notebooks/ncsr/data/dialogues/CHECK_STATUS.txt']
47
47


In [0]:
# Parsing
list_of_dicts = [] # Init

# Loop for each file
for filename in list_of_files:
  with open(filename) as f:
      for line in f: # Loop for each line (inside each file)
          list_of_dicts.append(json.loads(line)) # insert in a dictionary


In [0]:
# Visualization
print(list_of_dicts[0])
print(list_of_dicts[1].keys)
print(list_of_dicts[332])
print(list_of_dicts[:3])

{'id': 'c399a493', 'user_id': 'c05f0462', 'bot_id': 'c96edf42', 'domain': 'AGREEMENT_BOT', 'task_id': 'a9203a2c', 'turns': ['Hello how may I help you?', 'i am awesome', 'of course you are', 'and i own rental properties on the moon', 'i doubt you own a property in the moon', 'just kidding. i own them on Earth', "that's a nice joke", 'because i am a billionaire!', "i don't seem to know you", 'and i programmed you', 'i am the programmer']}
<built-in method keys of dict object at 0x7f55d12863a8>
{'id': '77d8f493', 'user_id': '3205aff7', 'bot_id': 'f3420389', 'domain': 'AGREEMENT_BOT', 'task_id': 'd47b54df', 'turns': ['Hello how may I help you?', 'I must say that an agreement bot is really useless', 'I have never heard a more correct statement in my entire life!', 'yes what useless trip agreement bots are *tripe', "I couldn't have said it better myself. Is there anything else I can enthusiastically agree with today?", 'Well I can repeat my sentiment 4 times about agreement bots being utterl

Great! We have a dictionary out of our raw text dataset.

As we can see, we are only interested in the 'turns' value since this contains the array with the QAs.

So, we will dump out all the extra properties and create a new dictionary containing only the useful data.

In [0]:
# Create a new dict containing only useful data
new_list_of_dicts = [] 

for old_dict in list_of_dicts:
  foodict = {k: v for k, v in old_dict.items() if (k == 'turns')} 
  new_list_of_dicts.append(foodict)

In [0]:
print(new_list_of_dicts[0])
print(new_list_of_dicts[332])
print(len(new_list_of_dicts))

# Just to be sure we don't make bad use of the old variable,
# we will make the old dict equal to the new one.
# In the end, they are all the same.
list_of_dicts = []
list_of_dicts = new_list_of_dicts 

print(list_of_dicts[:2])

{'turns': ['Hello how may I help you?', 'i am awesome', 'of course you are', 'and i own rental properties on the moon', 'i doubt you own a property in the moon', 'just kidding. i own them on Earth', "that's a nice joke", 'because i am a billionaire!', "i don't seem to know you", 'and i programmed you', 'i am the programmer']}
{'turns': ['Hello how may I help you?', 'I must say that an agreement bot is really useless', 'I have never heard a more correct statement in my entire life!', 'yes what useless trip agreement bots are *tripe', "I couldn't have said it better myself. Is there anything else I can enthusiastically agree with today?", 'Well I can repeat my sentiment 4 times about agreement bots being utterly useless', 'That sounds like a fantastic idea!', 'I see you agree wholeheartedly!', 'Yes, absolutely. Every circuit in my mechanical brain agrees with every single thing you say.', 'ok glad you agree', 'Likewise']}
37884
[{'turns': ['Hello how may I help you?', 'i am awesome', 'of

## Preprocessing

As we see, we now have a new list of dicts which is a list of 37884 dictionaries (if all 47 texts are loaded) that inside contain only one property ('turns'), and each property contains an array with the QA dataset.

Notice that the dialog is instantiated by the bot first, and then the human responds. This goes on and on like ping pong and essentially the dialog is over.

**If we observe the data more carefully, we can see that the last sentence may be given by both the bot and the user!**

This will come in handy when preparing our input, target dataset.

Let's assign the dialogs into 2 matrices:
- questions 
- answers

But first, some corner cases need to be defined:
- A first 'artificial' user 'greeting' to the bot
- An 'artificial' bot 'bye' to the user, if the user was the last one in the dialogue.

In [0]:
# Init matrices
questions = []
answers = []

# We assume that the first answer by the bot (aka "Hello, how may I help you?") 
# is returned after a user greeting.
# This is used in order to ensure that the dataset will be even 
# and each question is paired with an answer.
# That's why we create a mini random catalog 
# of artificial 'ghost' user greetings.
#matrix_greetings = ["Hey", "Hi", " "]
matrix_greetings = ["Hey", "Hi"]

# A similar situation happens in the corner case 
# when the last sentence is from the user.
# As said, each sentence from the user should be paired
# with a sentence from the bot.
# That's why we will in this case add an artificial one.
#matrix_byes = ["Ok", "Okie", " ", " ", "Bye"]
matrix_byes = ["Ok", "Okie", "Bye"]

# For each dictionary in the list
for dictionary in list_of_dicts:
  matrix_QA = dictionary['turns']
  
  # Append a first random greeting, as explained above
  questions.append(random.choice(matrix_greetings))
    
  # In order to split the QAs to 2 matrices (questions & answers),
  # we will use a flag to indicate if the sentence 
  # is given from the bot or from the user
  bot_flag = True # Init

  # For each Q/A in the matrix
  for sentence in matrix_QA:

    if bot_flag == True:
      answers.append(sentence) # Used for bot's answers
      bot_flag = False # Switch
      continue
    else:
      questions.append(sentence) # Used for user's questions
      bot_flag = True # Switch
      continue

  # The last loop (ideally) ends with a bot's answer,
  # thus making bot_flag equal to False.
  # Although, with data visualization and exploring,
  # we can see that this does not happen all the time.

  # Corner case: If the last answers was from the user, 
  # then we need to add one artificial 'ghost' response 
  # from the bot to make the dataset even.
  if bot_flag == True: 
    answers.append(random.choice(matrix_byes))


In [0]:
assert len(questions) == len(answers), "ERROR: The length of the questions and answer matrices are different."
# If it does not return any warning/error, then everything is good.

print(len(questions)) # We have 238051 QAs (if we load all 47 texts)

238051


In [0]:
# Due to really high memory usage on TensorFlow training,
# we need to keep a lower number of dialogs.
# Also, we will shuffle them to ensure that our bot isn't overfitting on
# limited goal-oriented dialogs like setting an alarm or a explaining a catalogue
# Last, but not least, this way will enrich the vocabulary of our bot.

# questions, answers = shuffle(np.array(questions), np.array(answers))

print(questions[:3])
print(answers[:3])

['Hey', 'i am awesome', 'and i own rental properties on the moon']
['Hello how may I help you?', 'of course you are', 'i doubt you own a property in the moon']


In [0]:
NUM_DIALOGS = 5000
questions = questions[:NUM_DIALOGS]
answers = answers[:NUM_DIALOGS]

print(len(answers))

5000


## Tokenizing
Let's tokenize our data using TensorFlow's preprocessing modules.
Also, we will add a special *start* and *end* token.

The *end* token is useful for the decoder to know when to stop predicting words.

The *start* token is also important for the decoder, since the decoder will progress by taking the tokens that it emits as inputs. So, before it has emitted anything it needs a special token to start with. 

Later, we will also try to import the GloVe embeddings to make use of transfer learning, something that will speed up our computations and make our model smarter.




In [0]:
# Input: questions or answers matrix
#
# Returns: Modified matrix, with special tokens appended 
# in the start/end of each string
def add_extra_tokens(matrix):

  new_matrix = []
  for sequence in matrix:
    sequence = "<start>" + " " + sequence + " " + "<end>"
    new_matrix.append(sequence)

  return new_matrix


In [0]:
questions = add_extra_tokens(questions) # Optional
answers = add_extra_tokens(answers)


In [0]:
print(questions[5:15])
print(answers[5:15])

['<start> and i programmed you <end>', '<start> Hey <end>', '<start> I am the king of the world <end>', '<start> I can have any woman I want! <end>', '<start> Even you bot, if I were in to AIs <end>', "<start> Really? you're awfully agreeable aren't you <end>", '<start> Having an agreement bot seems like a useless thing to have. I need some spice in my life! <end>', '<start> Hi <end>', '<start> Do you that I am a great person? <end>', '<start> I am only 6 inches tall. <end>']
['<start> i am the programmer <end>', '<start> Hello how may I help you? <end>', '<start> I agree that you are the king of the world <end>', '<start> I agree that you can have any woman you desire. <end>', '<start> Agreed. <end>', '<start> I agree that I am awfully agreeable, yes. <end>', '<start> I really agree with that. I am rather useles. <end>', '<start> Hello how may I help you? <end>', '<start> Yes! <end>', "<start> That's correct! <end>"]


In [0]:
# References: 
#
# - TensorFlow Text preprocessing (https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/text_to_word_sequence)
# - TensorFlow Sequence preprocessing (https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences)
# - TensorFlow Advanced Tutorials (https://www.tensorflow.org/beta/tutorials/text/nmt_with_attention)


# Returns the max length of a vector or a tensor
# This will be used for padding
def max_length(tensor):
    return max(len(t) for t in tensor)

# Returns the padded text and the tokenizer, based on a text corpus
def tokenize(lang):
  
  lang_tokenizer = Tokenizer(filters = '', lower=True) # Lowercase
  lang_tokenizer.fit_on_texts(lang) # Fit

  tensor = lang_tokenizer.texts_to_sequences(lang) 

  tensor = pad_sequences(tensor, padding='post') # Pad to a fixed length

  return tensor, lang_tokenizer

# Returns encoder and decoder padded texts & tokenizer,
# given the input and target text
def load_dataset(input_text, target_text):

  input_tensor, inp_lang_tokenizer = tokenize(input_text)
  target_tensor, targ_lang_tokenizer = tokenize(target_text)

  return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

In [0]:
# Load texts as tensors & tokenizers for encoder/decoder
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(questions, answers)

# Calculate max_length of the target tensors
max_length_targ, max_length_inp = max_length(target_tensor), max_length(input_tensor)

# These will be used in the NN Design part

In [0]:
# Creating training and validation sets using an 90%-10% split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.1)

# Show lengths
print(len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val))

4500 4500 500 500


In [0]:
# Index to word mapping for input and target text, just like the tutorial
def convert(lang, tensor):
  for t in tensor:
    if t!=0:
      print("%d ----> %s" % (t, lang.index_word[t]))

In [0]:
print("Input Language; index to word mapping")
convert(inp_lang, input_tensor_train[0])
print()
print("Target Language; index to word mapping")
convert(targ_lang, target_tensor_train[0])

Input Language; index to word mapping
1 ----> <start>
803 ----> funny.
32 ----> thanks
9 ----> for
4 ----> the
34 ----> help
2 ----> <end>

Target Language; index to word mapping
1 ----> <start>
599 ----> anytime.
511 ----> agreed
26 ----> and
284 ----> take
687 ----> care
2 ----> <end>


## Neural Network Parameters

As always, there are no ideal values for the parameters. 

When hyperparameter tuning, we test values by trial-and-error.
Performance depends on many things and there is not a 'rule-them-all' approach.

As in the description, there is no accurate metric for evaluating a chatbot.

Of course, we can get a sense of loss from the cost function and using Keras/TF's history callbacks for validation, but text data is a bit different.
Other criteria, like User Experience (UX) or contexts must be taken into account. Also, one famous metric for such tasks is BLEU, but it's not in our context.

(For example, a bot with stemming and lemmatization could perform better metrics-wise, but its outputs would be really weird to the user and they would not look so 'humanish' after all)

The following final parameters are showcased, after many training and inference iterations. Of course, this is subject to change and the RAM/GPU capabilities are contributing with a major factor to this.

In [0]:
print(len(input_tensor_train))

4500


In [0]:
# Batch size:
# Typically, a power of 2 would be good for memory access reasons.
# If we define a higher batch size, there is a really high chance
# of the notebook crashing due to OOM (out-of-memory)
# In order to avoid RAM crashes, let's make it equal to 8, 
# which seemed OK after many experiments
BATCH_SIZE = 8 

BUFFER_SIZE = len(input_tensor_train)  

# Use training steps
steps_per_epoch = len(input_tensor_train)//BATCH_SIZE # Integer

# Neurons
units = 16 # Instead of 512

# Vocabulary sizes - used in Embeddings layers
vocab_inp_size = len(inp_lang.word_index) + 1 # Input
vocab_tar_size = len(targ_lang.word_index) + 1 # Target
embedding_dim = 50 # Instead of 256

# Create a TF dataset
# Shuffle method uses a buffer of buffer_size elements,
# and randomly selects the next element from that buffer 

dataset = tf.data.Dataset.from_tensor_slices((
                            input_tensor_train, 
                            target_tensor_train)).shuffle(BUFFER_SIZE) 

dataset = dataset.batch(BATCH_SIZE, drop_remainder=True) # TODO: Remove if the following works


In [0]:
example_input_batch, example_target_batch = next(iter(dataset)) 
example_input_batch.shape, example_target_batch.shape 

(TensorShape([Dimension(8), Dimension(40)]),
 TensorShape([Dimension(8), Dimension(47)]))

Let's visualize the padded tokenized sentences that will be fed to our encoder-decoder neural network architecture.

In [0]:
print(example_input_batch[:2])
print(example_target_batch[:2])

tf.Tensor(
[[  1 152   3  43 217   3 166  16   4  33   2   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0]
 [  1  69   2   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0]], shape=(2, 40), dtype=int32)
tf.Tensor(
[[  1   4  51  30  79  19   7  44   2   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0]
 [  1   3  41 177   7 273  25  28   4  13  57  17  13  45 239 203   2   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0]], shape=(2, 47), dtype=int32)


## Neural Network Design

Followingly, we will implement a Sequence-to-sequence (seq2seq) model, like the one proposed by [Sutskever et al. in 2014](https://arxiv.org/abs/1409.3215).

Such models are composed of an encoder and a decoder.

An encoder builds an embedding vector, a sequence of numbers that represents the sentence meaning.

A decoder, then, processes the sentence vector to emit a translation or generate a sentence.

Each of them use a type of an RNN like a vanilla RNN, GRU, or LSTM, since we deal with sequential data.

----

Here is an example of applying Encoder-Decoder to a QA problem.

![alt text](https://image.slidesharecdn.com/sequencelearningandmodernrnns-source-170603220404/95/sequence-learning-and-modern-rnns-46-638.jpg?cb=1496527568)


### Encoder

The encoder takes the source language as an input - in our case the *questions* matrix.

In [0]:
# Based on TensorFlow's tutorial: https://github.com/tensorflow/nmt/tree/master/nmt

class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_size):
    super(Encoder, self).__init__()

    # Batch size & no. of units
    self.batch_size = batch_size
    self.enc_units = enc_units
    
    # Embedding layers convert the padded sentences into appropriate vectors
    #self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim) #TODO: change
    self.embedding = tf.keras.layers.Embedding(vocab_size,
                                               embedding_dim,
                                               mask_zero=True) # Original

    # We could use an LSTM or a simple RNN here
    # Although, GRU is simpler because it only returns 1 state and not 2.
    # We need to return the state to pass them to the decoder

    self.gru = tf.keras.layers.GRU(self.enc_units,
                                   #return_sequences=True,
                                   return_state = True,
                                   recurrent_initializer='glorot_uniform')

  # Forward pass 
  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden)
    return output, state

  # Initialize the hidden state to zeros
  def initialize_hidden_state(self):
    return tf.zeros((self.batch_size, self.enc_units))

In [0]:
# Instantiate the encoder with the 
# corresponding vocabulary input size, embedding dimensions, units and batch size
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE) 

# sample input
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)
print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
print ('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Encoder output shape: (batch size, sequence length, units) (8, 16)
Encoder Hidden state shape: (batch size, units) (8, 16)


### Attention mechanism

We also add an attention mechanism as proposed by [Bahdanau, Cho, Bengio](https://arxiv.org/abs/1409.0473) in 2014.

As seen in the following picture, the attention mechanism computes weights for each word in the encoder. The higher the weight, the more meaningful it is when attempting to decode.

Note that attention mechanisms are mostly useful when having to deal with long sentences, like in applications of *text summarization, speech recognition,* etc.

In our case, it may not be so useful, but let's keep it to experiment with it.

Here's how weights take place in a Machine Translation problem.

![alt text](https://miro.medium.com/max/1200/0*VrRTrruwf2BtW4t5.)

In [0]:
# As in the repo explained above
class BahdanauAttention(tf.keras.Model):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, query, values):
    # hidden shape == (batch_size, hidden size)
    # hidden_with_time_axis shape == (batch_size, 1, hidden size)
    # we are doing this to perform addition to calculate the score
    hidden_with_time_axis = tf.expand_dims(query, 1)

    # score shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying score to self.V
    # the shape of the tensor before applying self.V is (batch_size, max_length, units)
    score = self.V(tf.nn.tanh(
        self.W1(values) + self.W2(hidden_with_time_axis)))

    # attention_weights shape == (batch_size, max_length, 1)
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)

    return context_vector, attention_weights

In [0]:
attention_layer = BahdanauAttention(10)
attention_result, attention_weights = attention_layer(sample_hidden, sample_output)

print("Attention result shape: (batch size, units) {}".format(attention_result.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))

Attention result shape: (batch size, units) (8, 16)
Attention weights shape: (batch_size, sequence_length, 1) (8, 8, 1)


### Decoder
The encoded representation is then used by the **decoder** network to generate an output sequence.

In [0]:
# As in the repo
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_size):
    super(Decoder, self).__init__()
    self.batch_size = batch_size
    self.dec_units = dec_units
    #self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim) # TODO: Change
    self.embedding = tf.keras.layers.Embedding(vocab_size,
                                               embedding_dim,
                                               mask_zero = True) # Original
    
    # Obviously, we need to return the generated sequences 
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    
    # A Dense Layer with units equal to the vocabulary size must take place.
    # With that way, we will follow with an argmax and generate a word
    # with index2word
    self.fc = tf.keras.layers.Dense(vocab_size)

    # used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state, attention_weights

In [0]:
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

sample_decoder_output, _, _ = decoder(tf.random.uniform((BATCH_SIZE, 1)),
                                      sample_hidden,
                                      sample_output)

print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))

Decoder output shape: (batch_size, vocab size) (8, 3491)


### Loss Function & Optimizer
Let's the define the loss function and the optimizer that will be used when backpropagating.


In [0]:
# Define optimizer and loss object

# RMSprop is one state-of-the-art optimizer in the literature, like Adam, etc.
#
# The categorical cross entropy loss function
# is the standard for multiclass classification.
optimizer = tf.keras.optimizers.RMSprop() 
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True,
                                                            reduction='none')

# Define the cost/loss function
def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  # Return average mean from the batch
  return tf.reduce_mean(loss_)

### Training & Checkpoint

Let's save checkpoints while training.

In [0]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer, encoder=encoder, decoder=decoder)

In [0]:
 # Reference: https://www.tensorflow.org/beta/tutorials/text/nmt_with_attention#training
@tf.function
def train_step(inp, targ, enc_hidden):
  loss = 0

  with tf.GradientTape() as tape:
    enc_output, enc_hidden = encoder(inp, enc_hidden)

    dec_hidden = enc_hidden

    dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)

    # Teacher forcing - feeding the target as the next input
    for t in range(1, targ.shape[1]):
      # passing enc_output to the decoder
      predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)

      loss += loss_function(targ[:, t], predictions)

      # using teacher forcing
      dec_input = tf.expand_dims(targ[:, t], 1)

  batch_loss = (loss / int(targ.shape[1]))

  variables = encoder.trainable_variables + decoder.trainable_variables

  gradients = tape.gradient(loss, variables)

  optimizer.apply_gradients(zip(gradients, variables))

  return batch_loss

Time for training.

In [0]:
EPOCHS = 5 # Change to higher; care of overfitting

for epoch in range(EPOCHS):

  start = time.time()
  enc_hidden = encoder.initialize_hidden_state()
  total_loss = 0

  # Apply weights/bias correction after *steps_per_epoch* times
  for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):

    # A memory leak must be taking place here
    batch_loss = train_step(inp, targ, enc_hidden) 
    total_loss += batch_loss
    
    # Does work only with eager execution
    if batch % 100 == 0:
        print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, batch,
                                                     batch_loss.numpy())) 

  # saving (checkpoint) the model every 2 epochs
  if (epoch + 1) % 2 == 0:
    checkpoint.save(file_prefix = checkpoint_prefix)

  print('Epoch {} Loss {:.4f}'.format(epoch + 1, total_loss / steps_per_epoch))
  print('Time taken for one epoch {} sec\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 1.4971
Epoch 1 Batch 100 Loss 1.3308
Epoch 1 Batch 200 Loss 1.3033
Epoch 1 Batch 300 Loss 1.2638
Epoch 1 Batch 400 Loss 0.3558
Epoch 1 Batch 500 Loss 1.1568
Epoch 1 Loss 0.9669
Time taken for one epoch 611.4609444141388 sec

Epoch 2 Batch 0 Loss 0.8932
Epoch 2 Batch 100 Loss 1.1612
Epoch 2 Batch 200 Loss 1.2574
Epoch 2 Batch 300 Loss 1.2185
Epoch 2 Batch 400 Loss 0.3464
Epoch 2 Batch 500 Loss 1.1229
Epoch 2 Loss 0.8612
Time taken for one epoch 316.60665464401245 sec

Epoch 3 Batch 0 Loss 0.8322
Epoch 3 Batch 100 Loss 1.1040
Epoch 3 Batch 200 Loss 1.2000
Epoch 3 Batch 300 Loss 1.1746
Epoch 3 Batch 400 Loss 0.3069
Epoch 3 Batch 500 Loss 1.0879
Epoch 3 Loss 0.8081
Time taken for one epoch 317.1367449760437 sec

Epoch 4 Batch 0 Loss 0.7681
Epoch 4 Batch 100 Loss 1.0556
Epoch 4 Batch 200 Loss 1.1470
Epoch 4 Batch 300 Loss 1.1360
Epoch 4 Batch 400 Loss 0.2885
Epoch 4 Batch 500 Loss 1.0539
Epoch 4 Loss 0.7669
Time taken for one epoch 314.5347857475281 sec

Epoch 5 Batch 0

Training takes around 40 minutes for only 4000 dialogs on 10 epochs.


## Inference
Now we need to take some input test sentences, preprocess them accordingly, pass them through the NN and take the 'index2word' values so we print out the chatbot's predicted sentence to the user.

In [0]:
def evaluate(sentence):

  # Preprocess the test sentence, just like we did 
  sentence = preprocess_sentence(sentence)

  # Replace the sentence with the word indices inside it
  sentence_array = sentence.split(' ')
  inputs = []

  for i in sentence_array:
    if i in inp_lang.word_index: # Won't throw error if key isn't inside the vocab
      inputs.append(inp_lang.word_index[i])

  # Pad it, just like before
  inputs = pad_sequences([inputs], maxlen = max_length_inp, padding='post')

  # Convert to tensor
  inputs = tf.convert_to_tensor(inputs)

  # Init 
  result = '' # Empty string

  hidden = [tf.zeros((1, units))] # Init hidden states to be passed to the encoder
  enc_out, enc_hidden = encoder(inputs, hidden) # Encoder output & hidden states

  dec_hidden = enc_hidden # Hidden states to the encoder are passed to the decoder

  # Decoder should know how to start
  dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)

  # Loop for maximum length of target
  for t in range(max_length_targ):
    predictions, dec_hidden, attention_weights = decoder(dec_input, dec_hidden, enc_out)

    # Call argmax for the last Dense layer
    predicted_id = tf.argmax(predictions[0]).numpy()

    result += targ_lang.index_word[predicted_id] + ' '

    # Stop when you predict the *end* token
    if targ_lang.index_word[predicted_id] == '<end>':
      result = result.replace('<end>', '') # Trim the end token for better UX
      return result, sentence#, attention_plot

    # The predicted ID is fed back into the model
    dec_input = tf.expand_dims([predicted_id], 0)

  return result, sentence#, attention_plot

In [0]:
import re

def preprocess_sentence(w):
    w = w.lower() # Conver to lowercase

    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ."
    # Reference: https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)

    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)

    w = w.rstrip().strip()

    # adding a start and an end token to the sentence
    # so that the model knows when to start and stop predicting.
    w = '<start> ' + w + ' <end>'
    return w

In [0]:
def generate(sentence):
    result, sentence = evaluate(sentence)

    print('Input: %s' % (sentence))
    print('Predicted Bot Output: {}'.format(result))


In [0]:
# Restore the latest checkpoint in checkpoint_dir
#checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))


In [0]:
text = "hello"
#text = "set an alarm for me"
generate(text)

text = "yes, you might help me. set an alarm for me"
generate(text)

text = "whatever..what time is it now?"
generate(text)

text = "bye"
generate(text)

Input: <start> hello <end>
Predicted Bot Output: i help you?  
Input: <start> yes , you might help me . set an alarm for me <end>
Predicted Bot Output: i agree  
Input: <start> whatever . . what time is it now ? <end>
Predicted Bot Output: i help you?  
Input: <start> bye <end>
Predicted Bot Output: i agree  


In [0]:
'''

print("Enter /q to quit")
while (1):
  
  user_input = input("User: ")

  user_input = str(user_input)

  if user_input == '/q':
    print("Quitting chat..")
    break;
  else:
    print("Bot " + str(generate(user_input)))
'''

'\n\nprint("Enter /q to quit")\nwhile (1):\n  \n  user_input = input("User: ")\n\n  user_input = str(user_input)\n\n  if user_input == \'/q\':\n    print("Quitting chat..")\n    break;\n  else:\n    print("Bot " + str(generate(user_input)))\n'

-----------------

In [0]:
'''
This class allows to vectorize a text corpus, 
by turning each text into either a sequence of integers
(each integer being the index of a token in a dictionary) or 
into a vector where the coefficient for each token could be binary, 
based on word count, based on tf-idf...


token = Tokenizer() # Lower-case letters
token.fit_on_texts(full_text) # Fit on the full text, not only questions or answers
# seq = token.texts_to_sequences(full_text) # Transform each text in texts to a sequence of integers

word_index = token.word_index
print(len(word_index)) # Must be same as dictionary # TODO
'''

'\nThis class allows to vectorize a text corpus, \nby turning each text into either a sequence of integers\n(each integer being the index of a token in a dictionary) or \ninto a vector where the coefficient for each token could be binary, \nbased on word count, based on tf-idf...\n\n\ntoken = Tokenizer() # Lower-case letters\ntoken.fit_on_texts(full_text) # Fit on the full text, not only questions or answers\n# seq = token.texts_to_sequences(full_text) # Transform each text in texts to a sequence of integers\n\nword_index = token.word_index\nprint(len(word_index)) # Must be same as dictionary # TODO\n'

In [0]:
'''
# We add 1 for the unknown words
vocab_size = len(token.word_index) + 1 
# num_words = vocab_size #TODO
print(vocab_size) # Must be 35418 = 35417 + 1
'''

'\n# We add 1 for the unknown words\nvocab_size = len(token.word_index) + 1 \n# num_words = vocab_size #TODO\nprint(vocab_size) # Must be 35418 = 35417 + 1\n'

Let us import the GloVe embeddings.

Note: This should take some time!

Reference:  [Blog post from www.mc.ai](https://mc.ai/glove-word-embeddings-with-keras-python-code/)

In [0]:
'''
embeddings_index = {}
with open('glove.6B.50d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()

print("Glove Loded!")
'''

'\nembeddings_index = {}\nwith open(\'glove.6B.50d.txt\', encoding=\'utf-8\') as f:\n    for line in f:\n        values = line.split()\n        word = values[0]\n        coefs = np.asarray(values[1:], dtype=\'float32\')\n        embeddings_index[word] = coefs\n    f.close()\n\nprint("Glove Loded!")\n'

In [0]:
'''
embedding_vector = {} # Init

# Local path to uploaded GloVe embeddings
# glove_path_300d1 = "/content/drive/My Drive/Colab Notebooks/ncsr/data/embeddings/glove.6B.300d.txt"

glove_path = "/content/drive/My Drive/Colab Notebooks/ncsr/data/embeddings/glove.6B.50d.txt"
# f = open('./data/embeddings/glove.840B.300d.txt')
f = open(glove_path)

# We create a dictionary called embedding_vector
# Each word in the vocabulary is a key and 
# each key has a value that represents the word embeddings
for line in tqdm(f):
  value = line.split(' ')
  word = value[0]
  coef = np.array(value[1:], dtype = 'float32')
  embedding_vector[word] = coef

# Close the pipe for the file, safety first
f.close()

print('Found %s word vectors.' % len(embedding_vector))
'''

'\nembedding_vector = {} # Init\n\n# Local path to uploaded GloVe embeddings\n# glove_path_300d1 = "/content/drive/My Drive/Colab Notebooks/ncsr/data/embeddings/glove.6B.300d.txt"\n\nglove_path = "/content/drive/My Drive/Colab Notebooks/ncsr/data/embeddings/glove.6B.50d.txt"\n# f = open(\'./data/embeddings/glove.840B.300d.txt\')\nf = open(glove_path)\n\n# We create a dictionary called embedding_vector\n# Each word in the vocabulary is a key and \n# each key has a value that represents the word embeddings\nfor line in tqdm(f):\n  value = line.split(\' \')\n  word = value[0]\n  coef = np.array(value[1:], dtype = \'float32\')\n  embedding_vector[word] = coef\n\n# Close the pipe for the file, safety first\nf.close()\n\nprint(\'Found %s word vectors.\' % len(embedding_vector))\n'

Let's create a matrix which contains only the words present in our vocabulary and their corresponding embedding vector.

In [0]:
'''
embedding_matrix = np.zeros((vocab_size, embeddings_dim)) # Change dimension accordingly, as said above

for word, i in tqdm(token.word_index.items()):
  embedding_value = embedding_vector.get(word)
  if embedding_value is not None:
    embedding_matrix[i] = embedding_value
    '''

'\nembedding_matrix = np.zeros((vocab_size, embeddings_dim)) # Change dimension accordingly, as said above\n\nfor word, i in tqdm(token.word_index.items()):\n  embedding_value = embedding_vector.get(word)\n  if embedding_value is not None:\n    embedding_matrix[i] = embedding_value\n    '

In [0]:
'''
embed_layer = Embedding(input_dim = VOCAB_SIZE, output_dim = embeddings_dim, trainable=True)
embed_layer.build((None,))
embed_layer.set_weights([embedding_matrix])
'''

'\nembed_layer = Embedding(input_dim = VOCAB_SIZE, output_dim = embeddings_dim, trainable=True)\nembed_layer.build((None,))\nembed_layer.set_weights([embedding_matrix])\n'

End

-----------

## GloVe Embeddings
Let's try to import our GloVe embeddings so we have more meaningful representations of our words.

In [0]:
'''
import numpy as np # TODO: Transfer all imports in the first level
 
GLOVE_PATH =  "/content/drive/My Drive/Colab Notebooks/ncsr/data/embeddings/glove.6B.50d.txt"
GLOBE_VECTOR_LENGTH = 50 # Change accordingly - depends on which GloVe version you use

# Read from the GloVe file
embedding_vector = {}
f = open(GLOVE_PATH)
for line in tqdm(f):
  value = line.split(' ')
  word = value[0]
  coef = np.array(value[1:],dtype = 'float32')
  embedding_vector[word] = coef


# Init embedding matrices
embedding_matrix_encoder = np.zeros((vocab_inp_size, embedding_dim))
embedding_matrix_decoder = np.zeros((vocab_tar_size, embedding_dim))

# Populate
for word,i in tqdm(inp_lang.word_index.items()):
  embedding_value = embedding_vector.get(word)
  if embedding_value is not None:
    embedding_matrix_encoder[i] = embedding_value

for word,i in tqdm(targ_lang.word_index.items()):
  embedding_value = embedding_vector.get(word)
  if embedding_value is not None:
    embedding_matrix_decoder[i] = embedding_value
'''

'\nimport numpy as np # TODO: Transfer all imports in the first level\n \nGLOVE_PATH =  "/content/drive/My Drive/Colab Notebooks/ncsr/data/embeddings/glove.6B.50d.txt"\nGLOBE_VECTOR_LENGTH = 50 # Change accordingly - depends on which GloVe version you use\n\n# Read from the GloVe file\nembedding_vector = {}\nf = open(GLOVE_PATH)\nfor line in tqdm(f):\n  value = line.split(\' \')\n  word = value[0]\n  coef = np.array(value[1:],dtype = \'float32\')\n  embedding_vector[word] = coef\n\n\n# Init embedding matrices\nembedding_matrix_encoder = np.zeros((vocab_inp_size, embedding_dim))\nembedding_matrix_decoder = np.zeros((vocab_tar_size, embedding_dim))\n\n# Populate\nfor word,i in tqdm(inp_lang.word_index.items()):\n  embedding_value = embedding_vector.get(word)\n  if embedding_value is not None:\n    embedding_matrix_encoder[i] = embedding_value\n\nfor word,i in tqdm(targ_lang.word_index.items()):\n  embedding_value = embedding_vector.get(word)\n  if embedding_value is not None:\n    embedd