<a href="https://colab.research.google.com/github/mehdi-karimi-math/QAN-TF-SQuAD/blob/master/QAN_TF_SQuAD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Question-Answering Network (QAN) designed in TensorFlow 2, based on SQuAD**


---
In this Colab notebook, I show how to design a simple question-answering (QA) system all in TensorFlow 2 and Keras, and in the notebook format. My goal is to cover the basics of how to handle and process text data in TensorFlow, how to use embedding, and how to define the layers and design a simple QA deep-learning model. There are many other resources online, but I only use the APIs in TensorFlow and try to cover the basics. This notebook can be used as a framwork for more complicated and efficient QANs. 

To start learning about Natural Language Processing (NLP), I recommend the following courses in Coursera and Stanford, I learned so much from them.

[Coursera's Natural Language Processing in TensorFlow](https://www.coursera.org/learn/natural-language-processing-tensorflow)

[Stanford's NLP with Deep Learning](http://onlinehub.stanford.edu/cs224)

If you have any comments, do not hesitate to contact me (Mehdi Karimi) at mahdikarimi1982 at gmail. 



The machine learning platform I am using is [TensorFlow 2.0](https://www.tensorflow.org/). To install this version, we can use the following lines of code and then check if it is correctly installed. 

In [0]:
!pip install tensorflow==2.0.0

In [1]:
import tensorflow as tf
print(tf.__version__)

2.0.0


# Dataset

---

TensorFlow is working on top of a higher level API called Keras. We first inport keras and numpy. For handling data, we use tensorflow_datasets ([tfds](https://www.tensorflow.org/datasets/api_docs/python/tfds)). tfds is a collection of modules to deal with data. It also contains a collection of datasets ready to use with TensorFlow. These datasets have different difficulites for both practice and research. We are going to load [squad](https://www.tensorflow.org/datasets/catalog/squad). From the website, SQuAd is a "reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage". So the answer to each question is a span of the text and, as we will see, we only care about the start and end positions of the answer. The following commands load this dataset as a dictionary, where the train and validation parts are already separated. 

In [3]:
from tensorflow import keras
import tensorflow_datasets as tfds

import numpy as np
import matplotlib.pyplot as plt 

squad_data, info = tfds.load("squad", with_info=True)

squad_train = squad_data['train']
squad_valid = squad_data['validation']

print(info.features)



FeaturesDict({
    'answers': Sequence({
        'answer_start': Tensor(shape=(), dtype=tf.int32),
        'text': Text(shape=(), dtype=tf.string),
    }),
    'context': Text(shape=(), dtype=tf.string),
    'id': Tensor(shape=(), dtype=tf.string),
    'question': Text(shape=(), dtype=tf.string),
    'title': Text(shape=(), dtype=tf.string),
})


We see that the training data is a dictionary with 6 items. "answers" value is again another dictionary with two items. The first one is an array of the start of the answers in the context (character count), and the second item is the text of the answers. In the main dictionary, we also have an item for the "context" and "question", which we are going to use.  Let us read and put the data we need in separate lists. 

Note that we change the tensors into numpy array to later use the text processing of Keras. Also pay attention to the slicing of the arrays. All the texts are started by characters "b'...", which we do not need. It will be usefull to play with data before and after the processing. 



In [0]:
context=[] # This list contains the context we are asking question about. 
question=[] # This list contains the text of the questions. 
answer_t=[] # This list containts the text of the answers. 
answer_i=[] # This list contains the start of the answers. 

# The same lists for the validation data. 
context_val=[]
question_val=[]
answer_t_val=[]
answer_i_val=[]

for ques in squad_train:
  context.append(str(ques['context'].numpy())[2:-1])
  question.append(str(ques['question'].numpy())[2:-1])
  answer_t.append(str(ques['answers']['text'].numpy())[3:-2])
  answer_i.append(ques['answers']['answer_start'].numpy())

for ques in squad_valid:
  context_val.append(str(ques['context'].numpy())[2:-1])
  question_val.append(str(ques['question'].numpy())[2:-1])
  answer_t_val.append(str(ques['answers']['text'].numpy())[3:-2])
  answer_i_val.append(ques['answers']['answer_start'].numpy())

In [5]:
# we can check the content of the lists. 
index = 0
print(context[index])
print(question[index])
print(answer_t[index], answer_i[index])

The difference in the above factors for the case of \xce\xb8=0 is the reason that most broadcasting (transmissions intended for the public) uses vertical polarization. For receivers near the ground, horizontally polarized transmissions suffer cancellation. For best reception the receiving antennas for these signals are likewise vertically polarized. In some applications where the receiving antenna must work in any position, as in mobile phones, the base station antennas use mixed polarization, such as linear polarization at an angle (with both vertical and horizontal components) or circular polarization.
What is one use that would require an antenna to receive signals in various ways at once?
mobile phones [427]


# Text Preprocessing
---
For the processing of the text, we use Keras Preprocessing, which is the data preprocessing module of Keras. We use the [Tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) class that let us make a dictionary of all the words in the context list, and change text into a sequence of integers. We also use padding: for our deep learning model, we want all the context (question) vectors be of the same size, which we put equal to the maximum size and pad the other vectors by zeros. 

In [6]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

padding_type='post' # This means we pad the zeors at the end of each vector. 
oov_tok = "<OOV>"  # If for a vector, a word in not in the vocanbulary, this will be used. 


tokenizer = Tokenizer( oov_token=oov_tok)
tokenizer.fit_on_texts(context)
word_index = tokenizer.word_index  # This will be our dictionary of all the words. 
num_words = len(word_index.keys())
print("number of words in our dictionary is {}".format(num_words))

number of words in our dictionary is 82780


In [7]:
# change the context vectors into integer vectors and padding them. 
sequences = tokenizer.texts_to_sequences(context)
con_len = max(map(len, sequences))
print("max length of a context vector is {}". format(con_len))
context_padded = pad_sequences(sequences,maxlen=con_len, padding=padding_type)

# change the question vectors into integer vectors and padding them. 
sequences = tokenizer.texts_to_sequences(question)
que_len = max(map(len, sequences))
print("max length of a question vector is {}". format(que_len))
question_padded = pad_sequences(sequences,maxlen=que_len, padding=padding_type)

# change the answer vectors into integer vectors
answer_token = tokenizer.texts_to_sequences(answer_t)

### We can do th same process for the validation vectors. 

# sequences_val = tokenizer.texts_to_sequences(context_val)
# context_val_padded = pad_sequences(sequences,maxlen=con_len, padding=padding_type)
# sequences_val = tokenizer.texts_to_sequences(question_val)
# question_val_padded = pad_sequences(sequences,maxlen=que_len, padding=padding_type)
# answer_token_val = tokenizer.texts_to_sequences(answer_t_val)

max length of a context vector is 718
max length of a question vector is 40


The following part is important in our processing. As we mentioned, we are interested in the start and end positions of answers. We also have the character count for the start of the answer. We can use the "split" method of python to count the number of words, but this does not necessarily match our tokenized vector. Hence, we search around the estimated position to find the exact one, and if we do not find it, we neglect the question. At the end, we get our "cleaned" data. You can perform a more detailed cleaning.  

In [8]:
y_train_i = [] # This list contains tuples of length two, (start,end), for the answers. 

context_padded_clean=[]
question_padded_clean=[]
answer_t_clean=[]

for i in range(len(answer_i)):
  temp = answer_i[i][0]
  start = len(context[i][0:temp-1].replace('-',' ').split())+1
  temp_l = len(answer_t[i].replace('-',' ').split())
  end = start+temp_l-1
  flag = False
  for j in range(start-10,start+10):
    if np.array_equal(context_padded[i][j:j+temp_l], answer_token[i]):
      start = j
      end = j+temp_l-1
      flag = True
  if flag:
    y_train_i.append((start,end))
    context_padded_clean.append(context_padded[i])
    question_padded_clean.append(question_padded[i])
    answer_t_clean.append(answer_t[i])

context_padded_clean=np.array(context_padded_clean)
question_padded_clean=np.array(question_padded_clean)
answer_t_clean=np.array(answer_t_clean)

num_train_data = context_padded_clean.shape[0]
print("number of training samples after cleaning is = {}".format(num_train_data))

number of training samples after cleaning is = 74590


Now we create the label vector for our training. Here, context vectors have size 718, this label vector has size 2*718. The first half is a one-hot vector for the start position of the answer, and the second half is for the end position. 

In [0]:
y_train = []

for i in range(len(context_padded_clean)):
    
    s_ = np.zeros(con_len,dtype = "float32")
    e_ = np.zeros(con_len,dtype = "float32")
    
    s_[y_train_i[i][0]] = 1
    e_[y_train_i[i][1]] = 1
    y_train.append(np.concatenate((s_,e_)))

# Embedding 
---

The idea of embedding is an interesting concept in NLP. You can find a short [tutorial](https://www.tensorflow.org/tutorials/text/word_embeddings) in  Tensorflow website. When we encode the words as numbers, we want to assign a vector to each word. The naive way is to assign one-hot vectors to the words. The problem is that the vectors are too large, and there is no connection between similar words. By embedding, we give a vector in $R^m$ to each word, where $m$ is a reasonable number, in a way that similar words are close to each other in this embedding. Good news is that there are already pretrained embeddings that we can load and use in our models. We use [GloVe](https://nlp.stanford.edu/projects/glove/) embedding of size 50. Using largers sizes can improve the performance, but for a longer training time. 

There are [different ways](https://colab.research.google.com/notebooks/io.ipynb) to load a file into a Colab notebook. An efficient way is mounting your google drive into your Colab, which we also use later for trianing. Then we can put our file (in this case you can google for 'glove.6B.50d.txt') into 'My drive' directory and read it into Colab. 

In [0]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

In [0]:
embeddings_index = dict()

f = open('/content/drive/My Drive/glove.6B.50d.txt')
glove = 50 # the dimension of embedding

for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

Now we create an embedding matrix for the words in our dictionary. This matrix is used in our embedding layer. 

In [0]:
embedding_matrix = np.zeros((num_words+1, glove))
for word, index in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

# Defining the layers and the model
---

At this point, we have processed our data, and are ready to design a model. First note that threre two input lines of data, one for context and one for question. As we know, for answering a question about a text, we need information about the whole squence of words. In other words, we want to connect previous information to a present task. Recurrent neural networks address this goal. We are going to use LSTM layers to learn long-term dependencies. 

In the following, we define two classes. One is "Embedding", which is basically the embedding matrix and is not trainable. The other one is "Bi_LSTM", which is a built-in bidirectional LSTM layer of Keras. Bidirectioal means information can go both forward and backward. A prameter we input to LSTM is UNITS, which is the dimensionality of the output space for the LSTM layer. 

The layer that we costum is the "BiLinear_Layer", which models the **Attention**. [Here](https://www.tensorflow.org/guide/keras/custom_layers_and_models) is a link on how to custom your layers in TensorFlow. Attention is an important concept in NLP for question-answering and translation. We have two input streams for context and question, and attention tells us how to connect these two. The beginning of the input for our model is as follows:

$$
\text{Context} \rightarrow \text{Embedding} \rightarrow \text{LSTM}  \rightarrow H_c
$$
$$
\text{Question} \rightarrow \text{Embedding} \rightarrow \text{LSTM}  \rightarrow H_q
$$

We use a simple bilinear type attention. In our model, the length of a context vector is 718 and the length of a question vector is 40. Let the number of LSTM UNITS be $U$.  $H_c$ is a matrix of size $718 \times U$ and $H_q$ is a matrix of size $40 \times U$. For the start position of the answer, we define a weight matrix $W_S$ of size $U \times U$, which gives us

$$
H := H_c W_S H_q^\top.
$$

Now $H$ is a matrix of size $718 \times 40$. We apply another weight vector $F_S$ to compute $H F_S$, which is a vector of length 718. We pass this vector through a softmax function to create an output vector of probabilities for the start position of the answer. $W_E$ and $F_E$ are also defined similarly for the end position of the answer. The output of this layer is the concatenation of these two probability vectors. 

**Important Note:** To improve the model for this task, a more complicated attention must be used. This bilinear attention we use here is a very simple one. There are many resources in the literature about attention. 

In [0]:
from tensorflow.keras.layers import Bidirectional,LSTM,Dense,Input
from tensorflow.keras.models import Model


class Embedding(tf.keras.Model):
    """
    This class defines the embedding layer. The weights are comming from the embedding matrix and are not trainable. 
    """

    def __init__(self, num_words , embedding_matrix , embedding_dim = glove):
        
        super().__init__()
        self.embedding = tf.keras.layers.Embedding(num_words+1, embedding_dim, weights=[embedding_matrix], trainable=False, mask_zero=True)

    def call(self, x):
        
        x = self.embedding(x)
        return x


class Bi_LSTM(tf.keras.Model):
    """
    This class defines bidirectional LSTM we are going to use for both the context and questions. 
    """
    
    def __init__(self, UNITS):
        
        super().__init__()
        self.bilstm = Bidirectional(LSTM(UNITS, return_sequences=True, recurrent_initializer='glorot_uniform', dropout = 0.4))
        
    def call(self,x):
        
        output = self.bilstm(x)
        return output
  


class BiLinear_Layer(tf.keras.Model):
     def __init__(self, UNITS, que_len):
        """
        This class defines the bilnear layer which performs the attention. It combines the output of the LSTM layers from the context and quesion side. 

        """
        super().__init__()

        w_init = tf.random_normal_initializer()
        self.WS = tf.Variable(initial_value=w_init(shape=(UNITS, UNITS), dtype='float32'), trainable=True)
        # self.FS = tf.Variable(initial_value=np.ones((que_len, 1), dtype='float32'), trainable=True)
        self.FS = tf.Variable(initial_value=w_init(shape=(que_len, 1), dtype='float32') , trainable=True)
        

        self.WE = tf.Variable(initial_value=w_init(shape=(UNITS, UNITS), dtype='float32'), trainable=True)
        # self.FE = tf.Variable(initial_value=np.ones((que_len, 1), dtype='float32') , trainable=True)
        self.FE = tf.Variable(initial_value=w_init(shape=(que_len, 1), dtype='float32') , trainable=True)


     def call(self,con_mat,que_mat):
        
        start_temp = con_mat @ self.WS
        start_temp = start_temp @ tf.transpose(que_mat, [0,2,1])
        start_temp = start_temp @ self.FS
        start_temp = tf.nn.softmax(start_temp, axis = 1)


        end_temp = con_mat @ self.WE
        end_temp = end_temp @ tf.transpose(que_mat,[0,2,1])
        end_temp = end_temp @ self.FE
        end_temp = tf.nn.softmax(end_temp, axis = 1)

        prob = tf.concat([start_temp,end_temp],axis=1)
        return prob

To design our model, as we discussed, both context and question vectors go through embedding and LSTM layers. The outputs of these layers go into the bilinear layer. We can see a summary of our model in the following. 

In [13]:
UNITS = 128

c_input = Input(shape=(con_len,))
c_emb = Embedding(num_words, embedding_matrix)(c_input)
c_lstm = Bi_LSTM(UNITS)(c_emb)

q_input = Input(shape=(que_len,))
q_emb = Embedding(num_words, embedding_matrix)(q_input)
q_lstm = Bi_LSTM(UNITS)(q_emb)


y_prob = BiLinear_Layer(2*UNITS, que_len)(c_lstm, q_lstm)

model = Model(inputs = [c_input, q_input],outputs =y_prob)
model.summary()


Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 718)]        0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 40)]         0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 718, 50)      4139050     input_1[0][0]                    
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 40, 50)       4139050     input_2[0][0]                    
______________________________________________________________________________________________

# Loss function and costum accuracy
---
For the loss function, we use the categorical crossntropy loss function, but for both the start position and the end position. The final loss is the summation of these two. 

In [0]:
def Loss(y_true, prob):

    """
    This function calculates the loss for our model. We basically calcualte the loss for the start and end positions and add them together. 
    """
    
    
    # breaking the lists into two half's for the start and end positions.
    start_label = y_true[:,:con_len]
    end_label = y_true[:,con_len:]
    
    start_logit = prob[:,:con_len]
    end_logit = prob[:,con_len:]
    
    start_loss = tf.keras.backend.categorical_crossentropy(start_label,start_logit)
    end_loss = tf.keras.backend.categorical_crossentropy(end_label,end_logit)
    
    return start_loss + end_loss


The slightly more complicated part is defining an accuracy measure. There are several built-in [metrics](https://keras.io/metrics/) in Keras (you can also see the Tensorflow ones [here](https://www.tensorflow.org/api_docs/python/tf/keras/metrics)). But for this model, we have to define a specific one. There is template [here](https://www.tensorflow.org/api_docs/python/tf/keras/metrics/Metric) on how to costum a metric using Matric class. In the following "Custom_Accuracy" class, we check which start and end positions are predicted by the model's output. We calculate accuracy as the average of the number of correct start and end labels. 




In [0]:
class Custom_Accuracy(keras.metrics.Metric):

    def __init__(self, nm, name='costum_accuracy',  **kwargs):
      super().__init__(name=name, **kwargs)
      self.nm = nm
      self.accuracy = self.add_weight(name='tp', initializer='zeros')

    def update_state(self, y_true, y_pred,  sample_weight=None):
      len_ = np.shape(y_pred)[1]
      
      y_1 = y_pred[:,:len_ //2]
      y_2 = y_pred[:,len_ //2:]
      # the predicted labels are the ones with the highest probability. 
      y_1 = tf.reshape(tf.argmax(y_1, axis=1), shape=(-1, 1))
      y_2 = tf.reshape(tf.argmax(y_2, axis=1), shape=(-1, 1))

      y_1t = y_true[:,:len_ // 2]
      y_2t = y_true[:,len_ // 2:]
      y_1t = tf.reshape(tf.argmax(y_1t, axis=1), shape=(-1, 1))
      y_2t = tf.reshape(tf.argmax(y_2t, axis=1), shape=(-1, 1))

      values1 = tf.cast(y_1, 'int32') == tf.cast(y_1t, 'int32')
      values1 = tf.cast(values1, 'float32')

      values2 = tf.cast(y_2, 'int32') == tf.cast(y_2t, 'int32')
      values2 = tf.cast(values2, 'float32')

      self.accuracy.assign_add( (tf.reduce_sum(values1+values2)) / (tf.dtypes.cast(2*self.nm, tf.float32)))

    def result(self):
      return self.accuracy


# Training the Model
---

We use Dataset class and the function "from_tensor_slices" to convert our numpy lists into a more efficient Tensorflow dateset. Pay attention to how we do it for multiple inputs. 

In [0]:
question_padded_ = np.array(question_padded_clean)
context_padded_ = np.array(context_padded_clean)
y_train_ = np.array(y_train)

train_dataset = tf.data.Dataset.from_tensor_slices(( {"input_1" : context_padded_ , "input_2" : question_padded_}, y_train_))

BATCH_SIZE = 256
# SHUFFLE_BUFFER_SIZE = 100

train_dataset = train_dataset.batch(BATCH_SIZE)

model.compile(optimizer="adam", loss=Loss, metrics=[Custom_Accuracy(num_train_data)])



Now we are ready to train our model. Unfortunately, the training of this model takes time, and  most likely the Colab gets disconnected before the end of training, so you lose your data. One way to solve this problem is using [callbacks](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks). Using callbacks, we store our data at the end of each epoch. If Colab gets disconnected, we can load the latest weights and resume training from the last epoch. 

We have trained only for 15 epochs to accuracy 30%, which is low. We can train for more epochs to imporve this accuracy. However, there are more important parts for imporving the model, such as using a more complicated attention. 

In [0]:
from keras.callbacks import *
filepath = "/content/drive/My Drive/epochs:{epoch:03d}"
checkpoint = ModelCheckpoint(filepath, save_weights_only=True)
callbacks_list = [checkpoint]

### Here we load the file of the already completed epochs, for example for epoch 10. 
# model.load_weights('/content/drive/My Drive/epochs:010')

# model.fit([question_padded_, context_padded_], y_train_, epochs=num_epochs)

num_epochs = 15
in_ip = 0 # You must change this number if you are loading data from previous epcochs. 

history = model.fit(train_dataset, initial_epoch = in_ip , epochs=num_epochs, callbacks=callbacks_list)

Using TensorFlow backend.


Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
