# Developing a Simple Chatbot Using TensorFlow
## A Step-by-Step Tutorial

source: 
- (04/2023) https://handsonai.medium.com/build-a-chat-bot-from-scratch-using-python-and-tensorflow-fd189bcfae45
- (12/2023) https://handsonai.medium.com/developing-a-simple-chatbot-with-python-and-tensorflow-a-step-by-step-tutorial-0d35767e113b

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/bryanlimy/tf2-transformer-chatbot/blob/main/tf2_tpu_transformer_chatbot.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

In [1]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Input
from tensorflow.keras.optimizers import Adam

### 1. Prepare data

In [2]:
# 1. Create sample dataset
conversations = [("Hello", "Hi there!"),
    ("How are you?", "I'm doing well. Thank you."),
    ("What's your name?", "I'm Richard Wyckoff."),
    ("What do you do for a living?", "I'm a successful stock trader."),
     # Add more conversational pairs as needed
                ]

# 2. Extract inputs and outputs
inputs = [conversation[0] for conversation in conversations]  
outputs = [conversation[1] for conversation in conversations]

inputs is a list with 4 elements.
```Python
inputs = ['Hello', 'How are you?', "What's your name?", 'What do you do for a living?']
```
outputs is a list with 4 elements.
```Python
outputs = ['Hi there!', "I'm doing well. Thank you.", "I'm Richard Wyckoff.", "I'm a successful stock trader."]
```

In [3]:
# 3. Tokenizer for input and output texts
input_tokenizer = Tokenizer()
input_tokenizer.fit_on_texts(inputs)
input_sequences = input_tokenizer.texts_to_sequences(inputs)
# input_sequences is a list with 4 elements. Each element is a list. 
# input_sequences = [[3], [4, 5, 1], [6, 7, 8], [9, 2, 1, 2, 10, 11, 12]]
# e.g. you -> 1, do -> 2

output_tokenizer = Tokenizer()
output_tokenizer.fit_on_texts(outputs)
output_sequences = output_tokenizer.texts_to_sequences(outputs)
# output_sequences is a list of lists with 4 elements. Each element respresents a sentence. 
# output_sequences = [[2, 3], [1, 4, 5, 6, 7], [1, 8, 9], [1, 10, 11, 12, 13]]

In [4]:
output_tokenizer.word_index

{"i'm": 1,
 'hi': 2,
 'there': 3,
 'doing': 4,
 'well': 5,
 'thank': 6,
 'you': 7,
 'richard': 8,
 'wyckoff': 9,
 'a': 10,
 'successful': 11,
 'stock': 12,
 'trader': 13}

In the above step, each English word is tokenized into a number, using a word_index. You can check the value mapping by calling the following methods: 
``` Python
input_tokenizer.word_index = {'you': 1, 'do': 2, 'hello': 3, 'how': 4, 'are': 5, "what's": 6, 'your': 7, 'name': 8, 'what': 9, 'for': 10, 'a': 11, 'living': 12}
output_tokenizer.word_index = {"i'm": 1, 'hi': 2, 'there': 3, 'doing': 4, 'well': 5, 'thank': 6, 'you': 7,  'richard': 8, 'wyckoff': 9, 'a': 10, 'successful': 11, 'stock': 12, 'trader': 13}}
```

In [5]:
# 4. Find maximum sequence lengths
max_input_length = max(len(seq) for seq in input_sequences) # 7 
max_output_length = max(len(seq) for seq in output_sequences) # 5
# Note: input and output must to be of the same size. 
if max_input_length != max_output_length: 
        max_length = max(max_input_length, max_output_length)

In [6]:
# 5. Pad sequences
X = pad_sequences(input_sequences, maxlen = max_length, padding='post')
y = pad_sequences(output_sequences, maxlen = max_length, padding='post')
# In padding, each word is padded into a 1 x 7 array: 
# Given input_sequences = [[3], [4, 5, 1], [6, 7, 8], [9, 2, 1, 2, 10, 11, 12]], 
# X = array([[ 3,  0,  0,  0,  0,  0,  0],
#       [ 4,  5,  1,  0,  0,  0,  0],
#       [ 6,  7,  8,  0,  0,  0,  0],
#       [ 9,  2,  1,  2, 10, 11, 12]])
# Given output_sequences = [[2, 3], [1, 4, 5, 6, 7], [1, 8, 9], [1, 10, 11, 12, 13]],
# y = array([[ 2,  3,  0,  0,  0,  0,  0],
#       [ 1,  4,  5,  6,  7,  0,  0],
#       [ 1,  8,  9,  0,  0,  0,  0],
#       [ 1, 10, 11, 12, 13,  0,  0]])

In [7]:
# 6. Define model parameters
vocab_size_input = len(input_tokenizer.word_index) + 1 # input uses 12 different words (vocabulary size)
vocab_size_output = len(output_tokenizer.word_index) + 1 # output uses 15 different words (vocabulary size)
embedding_dim = 64
hidden_units = 100

### Transformer

In [8]:
from tensorflow.keras.models import Model

# Model 2
# Encoder
encoder_inputs = Input(shape=(max_length,))
encoder_embedding = Embedding(vocab_size_input, embedding_dim, mask_zero=True)(encoder_inputs)
encoder_lstm, state_h, state_c = LSTM(hidden_units, return_state=True)(encoder_embedding)
encoder_states = [state_h, state_c]

# Decoder
decoder_inputs = Input(shape=(max_length,))
decoder_embedding = Embedding(vocab_size_output, embedding_dim, mask_zero=True)(decoder_inputs)
decoder_lstm = LSTM(hidden_units, return_sequences=True, return_state=False)(decoder_embedding, initial_state=encoder_states)
decoder_dense = Dense(vocab_size_output, activation='softmax')
decoder_outputs = decoder_dense(decoder_lstm)

# Define the model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Compile the model
model.compile(optimizer=Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Print model summary
model.summary()

# Prepare the target data for training
# The output needs to be reshaped for the decoder's output. We need to expand the dimensions of `y` for sparse categorical crossentropy.
y = np.expand_dims(y, -1)

# Train the model
model.fit([X, X], y, batch_size=32, epochs=100)  # We are using X as decoder input for teacher forcing

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 7)]          0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, 7)]          0           []                               
                                                                                                  
 embedding (Embedding)          (None, 7, 64)        832         ['input_1[0][0]']                
                                                                                                  
 embedding_1 (Embedding)        (None, 7, 64)        896         ['input_2[0][0]']                
                                                                                              

<keras.callbacks.History at 0x1f651cc0f10>

In [9]:
def predict_response(input_text):
    # Preprocess input text
    input_seq = input_tokenizer.texts_to_sequences([input_text])
    input_seq = pad_sequences(input_seq, maxlen=max_input_length, padding='post')
    
    # Predict response
    # prediction = model.predict(input_seq) # model 1
    prediction = model.predict([input_seq, input_seq]) # model 2
    
    # Convert prediction to text
    predicted_seq = np.argmax(prediction, axis=-1)
    response_words = [output_tokenizer.index_word.get(idx, '') for idx in predicted_seq[0]]
    response_text = ' '.join(response_words)
    
    return response_text

#### Test 1

In [10]:
# Test the chatbot
test_input = "What's your name?"
response = predict_response(test_input)
print(f"Bot: {response}")

Bot: i'm richard wyckoff wyckoff wyckoff wyckoff wyckoff


#### Test 2

In [11]:
# Start chatbot
# while True:
test_input = input('You: ')
response = predict_response(test_input)
print(f"Bot: {response}")

You:  What is today's date


Bot: i'm i'm i'm i'm i'm i'm i'm
