<h1 style="color:orange;text-align:center;font-family:courier;font-size:280%">Chatbot From Scratch Using Transformers(seq-to-seq)</h1>
<p style="color:orange;text-align:center;font-family:courier"> The objective is to understand how to build a seq-to-seq model from scratch to build a interactive chatbot</p>

### Objectives 
* Understand the theory and building blocks of NLP(Natural Language Processing Pipeline.
* Generate a basic understanding of how to use Tensorflow Keras API for creating custom utilities.
* Simplify the pedagogy of explaining NLP topics especially encoder decoder models.
* As an example we will use **Dialog generation** Dataset for our interactive chatbot.
<!-- * Though the code works there are significant drawbacks with yolov1 which has been addressed on YoloV2,YoloV3 -->


<p style="text-align:center"><img src="assets/Chatbot.png" alt="textcla" width="540"/>

In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf

from utils import *

from tensorflow.keras import layers
from tensorflow.keras.optimizers import Nadam

from sklearn.model_selection import train_test_split

In [3]:
MAX_LENGTH=25
BATCH_SIZE = 14
BUFFER_SIZE = 2000
EMBEDDING_DIM=512
DENSE_DIM=1024
HEADS=4
EPOCHS=50
AUTO = tf.data.AUTOTUNE

### Loading data and pre-processing

In [4]:
dataset = pd.read_csv("dataset_conver.csv",nrows=4000)
source = "question"
target = "answer"
dataset[source] = dataset[source].map(preprocess_text)
dataset[target] = dataset[target].map(lambda x:preprocess_text(x,block="target"))
dataset = dataset.drop([2],axis=0)
dataset.reset_index(drop=True,inplace=True)

question = [dataset[source].tolist()[j].split() for j in range(dataset.shape[0])]
question =sum(question,[])

answers = [dataset[target].tolist()[j].split() for j in range(dataset.shape[0])]
answers =sum(answers,[])


source_vocab = len(list(set(question)))
target_vocab = len(list(set(answers)))

source_vectorizer = layers.TextVectorization(max_tokens=source_vocab,output_mode="int")
source_vectorizer.adapt(dataset[source])
target_vectorizer = layers.TextVectorization(max_tokens=target_vocab,output_mode="int")
target_vectorizer.adapt(dataset[target])

train,val = train_test_split(dataset,test_size=0.1,shuffle=True)
source_train_data,target_train_data = dataset_prep(source_vectorizer(train[source]),
                                                    target_vectorizer(train[target]),maxlen=MAX_LENGTH)
source_val_data,target_val_data = dataset_prep(source_vectorizer(val[source]),
                                               target_vectorizer(val[target]),maxlen=MAX_LENGTH)

### Process train and validation for the pipeline

In [5]:
train_set = tf.data.Dataset.from_tensor_slices((source_train_data,target_train_data))
train_set = train_set.map(pipeline,num_parallel_calls=AUTO).prefetch(10)
train_set = train_set.batch(BATCH_SIZE,drop_remainder=True).shuffle(BUFFER_SIZE)

val_set = tf.data.Dataset.from_tensor_slices((source_val_data,target_val_data))
val_set = val_set.map(pipeline,num_parallel_calls=AUTO).prefetch(10)
val_set = val_set.batch(BATCH_SIZE,drop_remainder=True)

### Building layers
<p style="text-align:center"><img src="assets/arch.jpg" alt="textcla" width="240"/>

* The following components are required to build the above architecture:
    * Positional Encoding.
    * Transformer Encoder.
    * Transformer Decoder.

For learning about the internal mechanisms these layers follow this <a href=https://medium.com/@joshanish/dissecting-transformers-part1-2df55e234b9a>blogpost</a>

In [5]:
class Postional_Encoding(tf.keras.layers.Layer):
    def __init__(self,embedding_depth,vocab,sequence_length):
        super(Postional_Encoding,self).__init__()
        self.embedding_depth = embedding_depth
        self.sequence_length = sequence_length
        self.embed = tf.keras.layers.Embedding(vocab,embedding_depth)
        
    def call(self,data):
        batch_dim = tf.shape(data)[0]
        embeds = np.arange(self.embedding_depth)[np.newaxis,:]
        embeds = 1 / np.power(10000, (2 * (embeds//2)) / np.float32(self.embedding_depth))
        location_id = np.arange(self.sequence_length)[:,np.newaxis]
        pos = embeds*location_id
        pos[:,::2] = np.sin(pos[:,::2])
        pos[:,1::2] = np.cos(pos[:,1::2])
        pos = tf.tile(pos[tf.newaxis,:,:],(batch_dim,1,1))
        pos = tf.cast(pos,tf.float32)
        embed = self.embed(data)
        return embed+pos 

    def compute_mask(self,data,mask=None):
        return tf.not_equal(0,data)
    
    
class Transformer_Encoder(tf.keras.layers.Layer):
    def __init__(self,embedding_depth,dense_dim,heads=2,**kwargs):
        super(Transformer_Encoder,self).__init__(**kwargs)
        self.attention = tf.keras.layers.MultiHeadAttention(num_heads=heads,key_dim=embedding_depth)
        self.dense_proj = tf.keras.Sequential([tf.keras.layers.Dense(dense_dim, activation="relu"), 
                                               tf.keras.layers.Dense(embedding_depth),])
        self.layernorm_1 = tf.keras.layers.LayerNormalization()
        self.layernorm_2 = tf.keras.layers.LayerNormalization()
        self.supports_masking = True
        
    def call(self,x,mask=None):
        padding_mask = tf.cast(mask,tf.int64)[:,:,tf.newaxis]
        attention_out = self.attention(x,x,x,attention_mask=padding_mask)
        layernorm1 = self.layernorm_1(attention_out+x)
        denseproj  = self.dense_proj(layernorm1)
        layernorm2 = self.layernorm_2(denseproj+layernorm1)
        return layernorm2

    
class Transformer_Decoder(layers.Layer):
    def __init__(self, embed_dim, latent_dim, num_heads, **kwargs):
        super(Transformer_Decoder, self).__init__(**kwargs)
        self.attention_1 = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.attention_2 = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = tf.keras.Sequential(
            [layers.Dense(latent_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm_1 = tf.keras.layers.LayerNormalization()
        self.layernorm_2 = tf.keras.layers.LayerNormalization()
        self.layernorm_3 = tf.keras.layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.generate_causal_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask)
        attention_output_1 = self.attention_1(query=inputs, value=inputs, key=inputs, attention_mask=causal_mask)
        out_1 = self.layernorm_1(inputs + attention_output_1)

        attention_output_2,att_weights = self.attention_2(query=out_1,value=encoder_outputs,key=encoder_outputs,attention_mask=padding_mask,
                                                         return_attention_scores=True)
        out_2 = self.layernorm_2(out_1 + attention_output_2)

        proj_output = self.dense_proj(out_2)
        return self.layernorm_3(out_2 + proj_output),att_weights
    
    def generate_causal_mask(self,inputs):
        batch_size,seq_length = tf.shape(inputs)[0],tf.shape(inputs)[1]
        x = tf.range(seq_length)
        y = tf.range(seq_length)[:,tf.newaxis]
        causal_mask = tf.cast(y>=x,dtype="int32")[tf.newaxis,:,:]
        causal_mask = tf.tile(causal_mask,(batch_size,1,1))
        return causal_mask

### Chaining our layers to build our complete model.

In [6]:
def Transformer_Model(EMBEDDING_DEPTH,DENSE,VOCAB,LENGTH,HEADS=2):
    encoder_inp = layers.Input(shape=(None,),dtype=tf.int32,name="encoder_input")
    decoder_inp = layers.Input(shape=(None,),dtype=tf.int32,name="decoder_input")
    encoder_pos_embed = Postional_Encoding(EMBEDDING_DEPTH,VOCAB,LENGTH)(encoder_inp)
    encoder_attention1 = Transformer_Encoder(EMBEDDING_DEPTH,DENSE,heads=HEADS)(encoder_pos_embed)
    
    decoder_pos_embed = Postional_Encoding(EMBEDDING_DEPTH,VOCAB,LENGTH)(decoder_inp)
    decoder_attention1,_ = Transformer_Decoder(EMBEDDING_DEPTH,DENSE,HEADS)(decoder_pos_embed,encoder_attention1)
    decoder_attention2,attention_weights = Transformer_Decoder(EMBEDDING_DEPTH,DENSE,HEADS)(decoder_attention1,encoder_attention1)
    decoder_attention2 = layers.Dropout(0.3)(decoder_attention2)
    output = tf.keras.layers.Dense(VOCAB,activation="softmax")(decoder_attention2)

    model = tf.keras.Model((encoder_inp,decoder_inp),output)
    return model

In [7]:
vocab = len(target_vectorizer.get_vocabulary())
transformer = Transformer_Model(EMBEDDING_DEPTH=EMBEDDING_DIM,
                                DENSE=DENSE_DIM,VOCAB=vocab,
                                HEADS=HEADS,LENGTH=MAX_LENGTH,)

transformer.summary()
transformer.compile(
    tf.keras.optimizers.Adam(1e-4), loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)
transformer.fit(train_set, epochs=EPOCHS,validation_data=val_set)
transformer.save_weights("model1.h5")

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_input (InputLayer)      [(None, None)]       0                                            
__________________________________________________________________________________________________
decoder_input (InputLayer)      [(None, None)]       0                                            
__________________________________________________________________________________________________
postional__encoding (Postional_ (None, 25, 512)      1997312     encoder_input[0][0]              
__________________________________________________________________________________________________
postional__encoding_1 (Postiona (None, 25, 512)      1997312     decoder_input[0][0]              
______________________________________________________________________________________________

#### Note: 
Accuracy is not a right metric to check the progress of a seq-to-seq model unless the model is restricted to very limited intents. we have something called BLEU score to evaluate how meaningful the responses are from the model, The following explanation by Andrew NG explains it well.<br>
https://www.youtube.com/watch?v=9ZvTxChwg9A&list=PL1w8k37X_6L_s4ncq-swTBvKDWnRSrinI&index=29

### Building our inference pipeline,
Yes, we can chat now !!!

In [23]:
questions = ["a stranger followed me into a dark tunnel","I was having a good time","his friend acted stupid at that very situation","are you funny"]

In [24]:
for input_text in questions:

    input_text = input_text
    input_text = preprocess_text(input_text)

    print(f" Question : {input_text}")
    vectorize_sentence = source_vectorizer([input_text])
    encode_sent = tf.keras.preprocessing.sequence.pad_sequences(vectorize_sentence,
                                                                       maxlen=MAX_LENGTH,padding="post")

    output_seq = " "
    starter,ender = "sos","eos"
    output_seq+=starter
    tar_vocab =target_vectorizer.get_vocabulary()
    target_inv = dict(zip(range(len(tar_vocab)),tar_vocab))

    for i in range(MAX_LENGTH):
        current = output_seq
        decode_sent = tf.keras.preprocessing.sequence.pad_sequences(target_vectorizer([current]),
                                                      maxlen=MAX_LENGTH,padding="post")
        pred = transformer.predict((encode_sent,decode_sent))

        word =target_inv[np.argmax(pred[0,i,:])]


        if word == "eos":
            break
        output_seq+=" "+word
    response = " ".join(output_seq.split()[1:])
    print(f"Response : {response}")
    print("-"*80)

 Question : a stranger followed me into a dark tunnel
Response : i think you dont know that easy but you did
--------------------------------------------------------------------------------
 Question : i was having a good time
Response : you know what you did
--------------------------------------------------------------------------------
 Question : his friend acted stupid at that very situation
Response : i dont know what the world is he are with the night
--------------------------------------------------------------------------------
 Question : are you funny
Response : no no i thought you want to say
--------------------------------------------------------------------------------


### Conclusion:
As we can see the chatbot is somewhat able to build some conversation, it can be tuned based on different datasets and different hyperparameters settings to make it much more applicable for real world applications. I assume; I have almost covered most critical parts in building this end-to-end pipeline helping readers to discover more details on NLP world.