#**ChatBOT**

## Objective
---
*Chatbots are versatile tools that can be used for a wide range of applications, and the choice of technology depends on the specific use case. The integration of AI technologies, like NLP and ML, can significantly enhance a chatbot's ability to understand and respond to user queries effectively.*

## Aim
---
*We are attempting to build a chatbot completely from the ground up, employing a TRANSFORMER model.*

In [None]:
import numpy as np
import pandas as pd
import tensorflow_datasets as tfds
import tensorflow as tf

## Loading and cleaning Dialogs dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')
df= pd.read_csv('/content/drive/MyDrive/ChatBot/dialogs_expanded.csv',encoding='latin-1')
dataset = df[['question','answer']]
dataset.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,question,answer
0,"Well, I thought we'd start with pronunciation,...",Not the hacking and gagging and spitting part....
1,Not the hacking and gagging and spitting part....,Okay... then how 'bout we try out some French ...
2,You're asking me out. That's so cute. What's ...,Forget it.
3,"No, no, it's my fault -- we didn't have a prop...",Cameron.
4,"Gosh, if only we could find Kat a boyfriend...",Let me see what I can do.


## Dividing into TWO, train/validation

In [None]:
from sklearn.model_selection import train_test_split
train, validation = train_test_split(dataset, test_size=0.2, random_state=4)

In [None]:
vocab_ans = list(set(" ".join(train['answer'].values).split()))
vocab_ques = list(set(" ".join(train['question'].values).split()))
vocab_size_ans, vocab_size_ques = len(vocab_ans), len(vocab_ques)
print(f"vocab_size_ans, vocab_size_ques:{vocab_size_ans},{ vocab_size_ques}")

vocab_size_ans, vocab_size_ques:69033,69156


## Using tfds SubwordTextEncoder, it will create tokens

**example Multiplication -> Multi, pli, cat, i, on**

**Advantages:**
1. Reduces vocab size => faster learning
2. Reduces chances of missing word in test data

In [None]:
tokenizer_a = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    train['answer'], target_vocab_size=2**17)

tokenizer_q = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    train['question'], target_vocab_size=2**17)

In [None]:
print(f"tokenizer_q:{tokenizer_q.vocab_size}")
print(f"tokenizer_a:{tokenizer_a.vocab_size}")

tokenizer_q:43958
tokenizer_a:44082


**Examples of subword tokenization in action!**

In [None]:
sample_string = 'Encoder decoder'

tokenized_string = tokenizer_a.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))

original_string = tokenizer_a.decode(tokenized_string)
print ('The original string: {}'.format(original_string))

for token in tokenized_string:
    print(str(token) + "---->" + tokenizer_a.decode([token]))

print("="*80)
tokenized_string = tokenizer_q.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))

original_string = tokenizer_q.decode(tokenized_string)
print ('The original string: {}'.format(original_string))

for token in tokenized_string:
    print(str(token) + "---->" + tokenizer_q.decode([token]))

Tokenized string is [43895, 43936, 3495, 12064, 5439, 3495, 43940]
The original string: Encoder decoder
43895---->E
43936---->n
3495---->code
12064---->r 
5439---->de
3495---->code
43940---->r
Tokenized string is [43771, 43812, 3050, 12091, 30962, 43816]
The original string: Encoder decoder
43771---->E
43812---->n
3050---->code
12091---->r 
30962---->decode
43816---->r


In [None]:
def encode(ques, ans):
    ques = [tokenizer_q.vocab_size] + tokenizer_q.encode(ques.numpy()) + [tokenizer_q.vocab_size+1]
    ans = [tokenizer_a.vocab_size] + tokenizer_a.encode(ans.numpy()) + [tokenizer_a.vocab_size+1]
    return ques, ans

def tf_encode(ques, ans):
    result_ques, result_ans = tf.py_function(encode, [ques, ans], [tf.int64, tf.int64])
    result_ques.set_shape([None])
    result_ans.set_shape([None])
    return result_ques, result_ans

In [None]:
print(train['question'].values[0],"\n",train['answer'].values[0])
question, answer = tf_encode(train['question'].values[0],train['answer'].values[0])
print(question)
print(answer)

And the fifty's all gone, huh? Who's the ten for? 
 The Websters.
tf.Tensor(
[43958    69     5  1383 43741     6    60   630     1   318    35   342
 43741     6     5   590   275 43765 43959], shape=(19,), dtype=int64)
tf.Tensor([44082    54 18800 43872 44083], shape=(5,), dtype=int64)


## Creating train_dataset/test_dataset object from Dataframe + padding

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices(dict(train))
train_dataset = train_dataset.map(lambda x:tf_encode(x['question'], x['answer']))
train_dataset = train_dataset.cache()
train_dataset = train_dataset.shuffle(20000).padded_batch(64, padded_shapes=([None],[None]))
train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)

In [None]:
val_dataset = tf.data.Dataset.from_tensor_slices(dict(validation))
val_dataset = val_dataset.map(lambda x:tf_encode(x['question'], x['answer']))
val_dataset = val_dataset.padded_batch(64, padded_shapes=([None],[None]))

In [None]:
question, answer = next(iter(train_dataset))
question

<tf.Tensor: shape=(64, 28), dtype=int64, numpy=
array([[43958,    81,     4, ...,     0,     0,     0],
       [43958,   164,   105, ...,     0,     0,     0],
       [43958,     2,   177, ...,     0,     0,     0],
       ...,
       [43958,   211,  1144, ...,     0,     0,     0],
       [43958,    80,     1, ...,     0,     0,     0],
       [43958,     2,   218, ...,     0,     0,     0]])>