<h1 style="text-align: center;text-transform: uppercase;">Conversational Based Agent</h1>

<br>

In this project, you will build an end-to-end voice conversational agent, which can take a voice input audio line, and synthesize a response. The chatbot agent will be executed locally on your computer. 

<img style="width:550px; height:300px;" src="assets/intro.png">

This project consists of the following parts:
1. __Speech Recognition:__ <br>In this part, you will create a speech recognition that can convert your voice into a text format.<br><br>
2. __Chatbot:__ <br>This is the core of your conversational based agent. You will build a chatbot that will answer your questions. <br><br>
3. __Text to Speech:__ <br>After getting the answer from your chatbot, it should be converted into a voice format and that is what you should create in this part. <br><br>
4. __Finalize your Conversational Based Agent:__ <br>At the very end step, you will put everything together and create your Conversational Based Agent.

<br>

# 2. Chatbot

---


Now that you have a Seq2Seq model trained, it is time to implement the prediction module.

The prediction for Seq2Seq model is more complicated than a typical machine learning model, because it involves the transfer of encoder and decoder states, as well as the sequential nature of the inputs.

Let's work through this task.

In [119]:
# Import the libraries
import tensorflow.keras
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.activations import softmax
from tensorflow.keras.callbacks import ModelCheckpoint

Below you can play around with hyperparameters for improving the model's accuracy.

First, we will need to recreate the Model graph layers that we used for training, load in the weights that were trained, then we will `rearrange` some of the layers to produce the prediction model.

In [126]:
### Encoder Input
embed_dim = 200
num_lstm = 200

# Input for encoder
encoder_inputs = Input(shape = (None, ), name='encoder_inputs')

# Embedding layer
# Why mask_zero = True? https://www.tensorflow.org/guide/keras/masking_and_padding
encoder_embedding = Embedding(input_dim = VOCAB_SIZE, output_dim = embed_dim, mask_zero = True, name='encoder_embedding')(encoder_inputs)

# LSTM layer (that returns states in addition to output)
encoder_outputs, state_h, state_c = LSTM(units = num_lstm, return_state = True, name='encoder_lstm')(encoder_embedding)

# Get the states for encoder
encoder_states = [state_h, state_c]

### Decoder

# Input for decoder
decoder_inputs = Input(shape = (None,  ), name='decoder_inputs')

# Embedding layer
decoder_embedding = Embedding(input_dim = VOCAB_SIZE, output_dim = embed_dim , mask_zero = True, name='decoder_embedding')(decoder_inputs)

# LSTM layer (that returns states and sequences as well)
decoder_lstm = LSTM(units = num_lstm , return_state = True , return_sequences = True, name='decoder_lstm')

# Get the output of LSTM layer, using the initial states from the encoder
decoder_outputs, _, _ = decoder_lstm(inputs = decoder_embedding, initial_state = encoder_states)

# Dense layer
decoder_dense = Dense(units = VOCAB_SIZE, activation = softmax, name='output') 

# Get the output of Dense layer
output = decoder_dense(decoder_outputs)

# Create the model
model = Model([encoder_inputs, decoder_inputs], output)

In [131]:
# Summary
model.summary()

Model: "model_10"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_inputs (InputLayer)     [(None, None)]       0                                            
__________________________________________________________________________________________________
decoder_inputs (InputLayer)     [(None, None)]       0                                            
__________________________________________________________________________________________________
encoder_embedding (Embedding)   (None, None, 200)    387800      encoder_inputs[0][0]             
__________________________________________________________________________________________________
decoder_embedding (Embedding)   (None, None, 200)    387800      decoder_inputs[0][0]             
___________________________________________________________________________________________

In [70]:
# Load the final model
model.load_weights(f'saved_models/final_weight_2020-08-01-16-28-47.h5') 
print("Model Weight Loaded!")

Model Weight Loaded!


<br>

### 3.6. Inference

---

Now it's time to use our model for inference. In other words, we will ask a question to our chatbot and it will answer us.

In [143]:
# Function for making inference
def make_inference_models():
    
    # Create a model that takes encoder's input and outputs the states for encoder
    encoder_model = Model(encoder_inputs, encoder_states)
    
    # Create two inputs for decoder which are hidden state (or state h) and cell state (or state c)
    decoder_state_input_h = Input(shape = (num_lstm, ))
    decoder_state_input_c = Input(shape = (num_lstm, ))
    
    # Store the two inputs for decoder inside a list
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    
    # Pass the inputs through LSTM layer you have created before
    decoder_outputs, state_h, state_c = decoder_lstm(decoder_embedding, initial_state = decoder_states_inputs)
    
    # Store the outputted hidden state and cell state from LSTM inside a list
    decoder_states = [state_h, state_c]

    # Pass the output from LSTM layer through the dense layer you have created before
    decoder_outputs = decoder_dense(decoder_outputs)

    # Create a model that takes decoder_inputs and decoder_states_inputs as inputs and outputs decoder_outputs and decoder_states
    decoder_model = Model([decoder_inputs] + decoder_states_inputs,
                          [decoder_outputs] + decoder_states)
    
    return encoder_model , decoder_model

In [1]:
# Function for converting strings to tokens
def str_to_tokens(sentence: str, maxlen_questions=22):

    # Lowercase the sentence and split it into words
    words = sentence.lower().split()

    # Initialize a list for tokens
    tokens_list = list()

    # Iterate through words
    for word in words:

        # Append the word index inside tokens list
        tokens_list.append(tokenizer.word_index[word]) 

    # Pad the sequences to be the same length
    return pad_sequences([tokens_list] , maxlen = maxlen_questions, padding = 'post')

In [145]:
# Initialize the model for inference
enc_model , dec_model = make_inference_models()

In [None]:
# Save the tokenizer that needs to be used in conjunction with the sequence modelso we can use it elsewhere
with open(f'saved_models/tokenizer_{timestamp}.pkl', 'rb') as f:
    tokenizer = pickle.load(f)

In [39]:
# Iterate through the number of times you want to ask question
try:
    for _ in range(5):

        # Get the input and predict it with the encoder model
        states_values = enc_model.predict(str_to_tokens(preprocess_text(input('Enter question : '))))

        # Initialize the target sequence with zero - array([[0.]])
        empty_target_seq = np.zeros(shape = (1, 1))

        # Update the target sequence with index of "start"
        empty_target_seq[0, 0] = tokenizer.word_index["starttoken"]

        # Initialize the stop condition with False
        stop_condition = False

        # Initialize the decoded words with an empty string
        decoded_translation = []

        # While stop_condition is false
        while not stop_condition :

            # Predict the (target sequence + the output from encoder model) with decoder model
            dec_outputs , h , c = dec_model.predict([empty_target_seq] + states_values)

            # Get the index for sampled word
            sampled_word_index = np.argmax(dec_outputs[0, -1, :])

            # Initialize the sampled word with None
            sampled_word = None

            # Iterate through words and their indexes
            for word, index in tokenizer.word_index.items() :

                # If the index is equal to sampled word's index
                if sampled_word_index == index :

                    # Add the word to the decoded string
                    decoded_translation.append(word)

                    # Update the sampled word
                    sampled_word = word

            # If sampled word is equal to "end" OR the length of decoded string is more that what is allowed
            if sampled_word == 'endtoken' or len(decoded_translation) > maxlen_answers:

                # Make the stop_condition to true
                stop_condition = True

            # Initialize back the target sequence to zero - array([[0.]])    
            empty_target_seq = np.zeros(shape = (1, 1))  

            # Update the target sequence with index of "start"
            empty_target_seq[0, 0] = sampled_word_index

            # Get the state values
            states_values = [h, c] 

            # Print the decoded string
        print(' '.join(decoded_translation[:-1]))
except KeyboardInterrupt:
    print('Ending conversational agent')

Ending conversational agent


<br>

# 4. Text to Speech

---

In this section, we will use a library called pyttsx3 which performs text-to-speech conversion. Unlike alternative libraries, this works offline and is compatible with both Python 2 and 3.

State-of-the-art text-to-voice systems are more difficult to setup and use (such as the Tacotron 2 + WaveNet), and is outside the scope of this course. Interested students can look them up!

<a href="https://ai.googleblog.com/2017/12/tacotron-2-generating-human-like-speech.html">Tacotron 2</a> <br>
https://arxiv.org/abs/1712.05884

Checkout some demos using WaveNet.
https://cloud.google.com/text-to-speech

For a _relatively simple_ implementation of a high quality TTS system that you can run, checkout the Mozilla <a href="https://github.com/mozilla/TTS">TTS</a> project.

In [40]:
# Import the libraries
import pyttsx3

In [41]:
# Construct a new TTS engine instance
engine = pyttsx3.init()

In [42]:
# Get all of the voices
voices = engine.getProperty('voices')

# Loop over voices and print their descriptions
for index, voice in enumerate(voices):
    print("Voice {}: ".format(index))
    print(" - ID: %s" % voice.id)
    print(" - Name: %s" % voice.name)
    print(" - Languages: %s" % voice.languages)
    print(" - Gender: %s" % voice.gender)
    print(" - Age: %s" % voice.age)
    print("")

Voice 0: 
 - ID: com.apple.speech.synthesis.voice.Alex
 - Name: Alex
 - Languages: ['en_US']
 - Gender: VoiceGenderMale
 - Age: 35

Voice 1: 
 - ID: com.apple.speech.synthesis.voice.alice
 - Name: Alice
 - Languages: ['it_IT']
 - Gender: VoiceGenderFemale
 - Age: 35

Voice 2: 
 - ID: com.apple.speech.synthesis.voice.alva
 - Name: Alva
 - Languages: ['sv_SE']
 - Gender: VoiceGenderFemale
 - Age: 35

Voice 3: 
 - ID: com.apple.speech.synthesis.voice.amelie
 - Name: Amelie
 - Languages: ['fr_CA']
 - Gender: VoiceGenderFemale
 - Age: 35

Voice 4: 
 - ID: com.apple.speech.synthesis.voice.anna
 - Name: Anna
 - Languages: ['de_DE']
 - Gender: VoiceGenderFemale
 - Age: 35

Voice 5: 
 - ID: com.apple.speech.synthesis.voice.carmit
 - Name: Carmit
 - Languages: ['he_IL']
 - Gender: VoiceGenderFemale
 - Age: 35

Voice 6: 
 - ID: com.apple.speech.synthesis.voice.damayanti
 - Name: Damayanti
 - Languages: ['id_ID']
 - Gender: VoiceGenderFemale
 - Age: 35

Voice 7: 
 - ID: com.apple.speech.synthesis.

In [43]:
### Voice properties    

# Speed percent (can go over 100)
engine.setProperty(name = 'rate', value = 180)    

# Volume 0-1
engine.setProperty(name = 'volume', value = 0.9)

# Voice ID
en_voice_id = "com.apple.speech.synthesis.voice.daniel.premium"
engine.setProperty('voice', en_voice_id)

In [44]:
# Convert the text to speech
engine.say("You've got mail!")
engine.say("The pyttsx3 module supports native Windows and Mac speech APIs but also supports espeak, making it the best available text-to-speech package.")
engine.runAndWait() 

<br>

# 5. Finalize your Conversational Based Agent

---

Now it's time to put everything together so you can do speech-to-text, text-to-text, and text-to-speech at the same time. For this, you will create a button which after pushing you can speak and your model will speck to you.

In [52]:
# Import the libraries 
import ipywidgets as widgets
from IPython.display import display
from text_to_text import text_to_text

In [146]:
# Conversational based agent 
def agent():
    speak_button = widgets.Button(description="Click and speak!")
    speak_output = widgets.Output()
    display(speak_button, speak_output)
    
    def on_button_clicked(b):
        with speak_output:
            # Speech recognition
            print("Say Something...")
            text = streaming_recognition()
            # Text-to-text
            response = text_to_text(text, enc_model, dec_model, str_to_tokens, preprocess_text, tokenizer, maxlen_answers)
            print(" + AGENT: ", response)
            # Text to speech
            engine.say(response)
            engine.runAndWait() 
            print("")

    
    speak_button.on_click(on_button_clicked)

In [149]:
# Talk to your agent
agent()

Button(description='Click and speak!', style=ButtonStyle())

Output()

# Good Job!