<h1 style="text-align: center;text-transform: uppercase;">Conversational Based Agent</h1>

<br>

In this project, you will build an end-to-end voice conversational agent, which can take a voice input audio line, and synthesize a response. The chatbot agent will be executed locally on your computer. 

<img style="width:550px; height:300px;" src="assets/intro.png">

This jupyter notebook is consists of the following parts:
1. __Speech Recognition:__ <br>In this part, you will create a speech recognition that can convert your voice into a text format.<br><br>
2. __Chatbot:__ <br>This is the core of your conversational based agent. You will build a chatbot that will answer your questions. <br><br>
3. __Text to Speech:__ <br>After getting the answer from your chatbot, it should be converted into a voice format and that is what you should create in this part. <br><br>
4. __Finalize your Conversational Based Agent:__ <br>At the very end step, you will put everything together and create your Conversational Based Agent.

<br>

# 1. Speech Recognition

---

We will use the Mozilla <a href="https://github.com/mozilla/DeepSpeech">DeepSpeech</a> open-sourced implementation originally developed by Baidu. This allows speech recognition directly on your computer instead of requiring an internet connection or setting up a cloud account.

While DeepSpeech is not the state-of-the-art speech recognizer (there is now DeepSpeech2, Wave2Letter by Facebook, and 
the RNN Transducer by Google), DeepSpeech is a fast, lightweight implementation which is suitable for real-time transcription with very high accuracy. Its code is also well-maintained with new features being added regularly.


In this project, we will not train our own speech recognition model (a fairly challenging project), but will use an open-sourced pre-trained model.
<br>


In [92]:
import deepspeech
import sounddevice as sd
import soundfile as sf
from scipy.io.wavfile import write
from time import sleep
import numpy as np
from tqdm import tqdm
import random
from datetime import datetime
import queue
import pickle

In [2]:
ds = deepspeech.Model('speech_recognizer/deepspeech-0.7.4-models.pbmm')
ds.enableExternalScorer('speech_recognizer/deepspeech-0.7.4-models.scorer')
_ = ds.setScorerAlphaBeta(0.75, 1.85)

### 1.1 Speech-recognition on single audio file

In this section, let's set up the basic functionality of running speech recognition on a single audio file. 

1. recording a .wav audio file with a fix d length (say 3 seconds)

2. perform speech recognition from the saved .wav file using the DeepSpeech model

In [51]:
test_file_name = 'audio_files/test_audio.wav'
sample_rate = 16000
seconds = 3

In [299]:
sleep(0.5)
print("Recording...")
audio_array = sd.rec(int(seconds * sample_rate), samplerate = sample_rate, channels = 1)

# Wait until recording is finished
sd.wait() 

# Finished recording print
print("Recording Finished!")

# Save as WAV file 
write(test_file_name, sample_rate, audio_array) 

Recording...
Recording Finished!


The `sd.rec` function gives us numpy array directly! We can check its shape

The number of rows is seconds * sample_rate = 16000 * 4, the number of columns is the channels = 1

In [300]:
audio_array.shape

(48000, 1)

We can check the recording by playing it back from the numpy array

In [302]:
sd.playrec(audio_array, sample_rate, channels=1)

array([[ 2.0721478e-02],
       [ 8.8054694e-02],
       [-2.7549866e-01],
       ...,
       [ 8.7510690e-04],
       [ 1.4510446e-41],
       [ 1.7442986e+28]], dtype=float32)

Or from the .wav file that it is saved to.

In [303]:
data, fs = sf.read(test_file_name, dtype='float32')
sd.play(data, sample_rate, device=1)
status = sd.wait()

If the playback did not work, chooose another output device by checking what is available on your machine

In [37]:
sd.query_devices()

> 0 Built-in Microphone, Core Audio (2 in, 0 out)
< 1 Built-in Output, Core Audio (0 in, 2 out)
  2 USB PnP Audio Device, Core Audio (0 in, 2 out)
  3 USB PnP Audio Device, Core Audio (1 in, 0 out)

While sound device outputs numpy array in float32 datatype (from -1 to 1), DeepSpeech speech recognizer expects a 16bit int type (-32768 to 32767). Let's convert the numpy array and set the correct data type.

In [305]:
audio_array *= 32768
audio_array = audio_array.astype('int16')

In [306]:
ds.stt(audio_array[:,0])

'this is a test recording'

### 1.2 Streaming Speech Recognition in Real-Time

Recording your voice then running speech recognition on a audio file works fine, but it is not very user friendly. The interaction is slow and not easy to use in a continuous setting.

In this section, let's setup a function to recording your voice AND recognize the text at the same time!

In [307]:
import queue

In [308]:
def callback(indata, frames, time, status):
    """This is called (from a separate thread) for each audio block."""
    if status:
        print(status, file=sys.stderr)
    q.put(indata.copy())

In [344]:
q = queue.Queue()
recognizer_stream = ds.createStream()
try:
    with sd.InputStream(samplerate=sample_rate, device=0, channels=2, callback=callback) as audio_stream:
        print('#' * 80)
        print('press Interrupt to stop the recording')
        print('#' * 80)
        print()
        i = 0
        while True:
            i += 1
            audio_chunk = q.get()
            audio_chunk *= 32768
            audio_chunk = audio_chunk.astype('int16')
            recognizer_stream.feedAudioContent(audio_chunk[:,0])
            text = recognizer_stream.intermediateDecode()
            print(f'\r{text}', end='')
except KeyboardInterrupt:
#     print('\r\nRecording finished.\r\n')
    pass
finally:
    audio_stream.stop()
    audio_stream.close()
    audio_chunks = []
    while True:
        if not q.empty():
            chunk = q.get()
            audio_chunks.append(chunk)
        else:
            break
    if audio_chunks:
        audio_chunks = np.concatenate(audio_chunks)
        audio_chunk *= 32768
        audio_chunk = audio_chunk.astype('int16')
        recognizer_stream.feedAudioContent(audio_chunk[:,0])
    text = recognizer_stream.finishStream()
    print(f'\r{text}')

################################################################################
press Ctrl+C to stop the recording
################################################################################

the moon is about two hundred and fifty thousand miles from earth on average


In [3]:
def streaming_recognition():
    q = queue.Queue()
    recognizer_stream = ds.createStream()
    
    def callback(indata, frames, time, status):
        """This is called (from a separate thread) for each audio block."""
        if status:
            print(status, file=sys.stderr)
        q.put(indata.copy())
    
    try:
        with sd.InputStream(samplerate=sample_rate, device=0, channels=2, callback=callback) as audio_stream:
            while True:
                audio_chunk = q.get()
                audio_chunk *= 32768
                audio_chunk = audio_chunk.astype('int16')
                recognizer_stream.feedAudioContent(audio_chunk[:,0])
                text = recognizer_stream.intermediateDecode()
                print(f"\r - YOU SAID: {text}", end='')
    except KeyboardInterrupt:
    #     print('\r\nRecording finished.\r\n')
        pass
    finally:
        audio_stream.stop()
        audio_stream.close()
        audio_chunks = []
        while True:
            if not q.empty():
                chunk = q.get()
                audio_chunks.append(chunk)
            else:
                break
        if audio_chunks:
            audio_chunks = np.concatenate(audio_chunks)
            audio_chunk *= 32768
            audio_chunk = audio_chunk.astype('int16')
            recognizer_stream.feedAudioContent(audio_chunk[:,0])
        text = recognizer_stream.finishStream()
        print(f"\r - YOU SAID: {text}", end='\r\n')
        
    return text

In [359]:
streaming_recognition()

################################################################################
press Interrupt to stop the recording
################################################################################

this is a test


'this is a test'

#### Congratulations! You are now able to run your own speech-to-text!

<br>

# 2. Chatbot

---


In this part, you will create a deep learning based conversational agent. This agent will be able to interact with users and understand their questions. More specifically, you will start with loading the dataset, cleaning and preprocessing them, and then you will feed them into a neural network.

<br>

### 3.1. Load and Clean the Chatterbot Dataset 

---

In this project, we have provided you with multiple dataset files. Each of these files contains conversations regarding a specific topic. For example, topics about humor, food, movies, science, history, etc. You can read the description of each dataset in below:

| Name of Dataset | Description |
| :----:| :----: |
| botprofile.yml | Personality of Your Chatbot |
| humor.yml | Joke and Humor |
| emotion.yml | Emotional Conversations |
| politics.yml | Political Conversations |
| ai.yml | General Questions about AI |
| computers.yml | Conversations about Computer |
| history.yml | Q&A about Historical Facts and Events |
| psychology.yml | Psychological Conversations |
| food.yml | Food Related Conversations. |
| literature.yml | Conversations about Different Books, Authors, Genres |
| money.yml | Conversations about Money, Investment, Economy |
| trivia.yml | Conversations that Have Small Values |
| gossip.yml | Gossipy Conversations |
| conversations.yml | Common Conversations |
| greetings.yml | Different Ways of Greeting |
| sports.yml | Conversations about Sports. |
| movies.yml | Conversation about Movies. |
| science.yml | Conversations about Science  |
| health.yml | Health Related Questions and Answers. |


Feel free to modify these datasets in the way you want the chatbot to behave. 

In [16]:
# Import the libraries
import yaml
from yaml import Loader
import glob
import datetime

In [101]:
# Function for loading all of the yml files
def load_chatterbot_dataset():
    
    # Initialize empty lists for questions and answers
    questions, answers = [], []
    
    # Get the list of all dataset names
    dataset_names = glob.glob("datasets/chatterbot/*.yml")
    
    # Iterate through each dataset name
    for i_dataset_name in tqdm(dataset_names):
        
        # Load the dataset
        with open(i_dataset_name) as file:
            greeting = yaml.load(file, Loader = Loader)["conversations"]
            
        # Iterate through each conversation
        for i_conversation in greeting:
            
            # If length is two
            if len(i_conversation) == 2:
                
                # Append the question to 'questions' list
                questions.append(i_conversation[0])
                
                # Append the answer to 'answers' list
                answers.append(i_conversation[1])
            
            # If length is more than two
            elif len(i_conversation) > 2:
                
                # Iterate through each index
                for index in range(len(i_conversation)-1):
    
                    # Append the question and answer
                    questions.append(i_conversation[0])
                    answers.append(i_conversation[index+1])
                    
    return questions, answers

In [102]:
# Get the questions and answers
questions, answers = load_chatterbot_dataset()

100%|██████████| 19/19 [00:00<00:00, 89.35it/s]


In [103]:
print("Total Question & Answers: ", len(questions))

Total Question & Answers:  869


In [104]:
# Take a look at the preprocessed questions and answers
total_questions = len(questions)
for i in range(4):
    j = random.randint(0, total_questions)
    print("Question {}: \n".format(i), questions[j])
    print("")
    print("Answer {}: \n".format(i), answers[j])
    print("--------------------------------------------------------------------------")

Question 0: 
 Good morning, how are you?

Answer 0: 
 I'm also good.
--------------------------------------------------------------------------
Question 1: 
 What makes you sad

Answer 1: 
 Sadness is not an emotion that I like to experience.
--------------------------------------------------------------------------
Question 2: 
 Tell me a joke

Answer 2: 
 what do you get when you cross a dance and a cheetah?
--------------------------------------------------------------------------
Question 3: 
 you are emotional

Answer 3: 
 i certainly do at times.
--------------------------------------------------------------------------


<br>

### 3.4. Data Preprocessing

---

After cleaning the dataset, you should preprocess the dataset by following the below steps:

1. Lower case the text.
2. Decontract the text (e.g. she's -> she is, they're -> they are, etc.).
3. Remove the punctuation (e.g. !, ?, $, %, #, @, ^, etc.).
4. Tokenization.
5. Pad the sequences to be the same length.

In [105]:
# import the libraries
import numpy as np
import contractions
import re
from tensorflow.keras import preprocessing, utils
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [106]:
# Function for preprocessing the given text
def preprocess_text(text):
    
    # Lowercase the text
    text = text.lower()
    
    # Decontracting the text (e.g. it's -> it is)
    text = contractions.fix(text)
    
    # Remove the punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    
    return text

In [108]:
# Preprocess the questions and answers
questions = [preprocess_text(q) for q in questions]
answers = [preprocess_text(q) for q in answers]


In [109]:
# Take a look at the preprocessed questions and answers
total_questions = len(questions)
for i in range(4):
    j = random.randint(0, total_questions)
    print("Question {}: \n".format(i), questions[j])
    print("")
    print("Answer {}: \n".format(i), answers[j])
    print("--------------------------------------------------------------------------")

Question 0: 
 what is the stock market

Answer 0: 
 trading shares 
--------------------------------------------------------------------------
Question 1: 
 how much do you earn

Answer 1: 
 i am expecting a raise soon 
--------------------------------------------------------------------------
Question 2: 
 chemistry

Answer 2: 
 my favorite subject is chemistry
--------------------------------------------------------------------------
Question 3: 
 robots are not allowed to lie

Answer 3: 
 sure we are   we choose not to 
--------------------------------------------------------------------------


To ensure that every training example are the type string, we need to first filter out both answers and questions that are not string.

In [110]:
# answers_with_tags = list()
# for i in range(len(answers)):
#     if type(answers[i]) == str:
#         answers_with_tags.append(answers[i])
#     else:
#         questions.pop(i)

After preprocessing the dataset, we should add a start tag (e.g. `<START>`) and an end tag (e.g. `<END>`) to answers. Remember that we will only add these tags to answers and not questions. This requirement is because of the Seq2Seq model.

In [111]:
# Add <START> and <END> tag to each sentence
answers = ['starttoken ' + a + ' endtoken' for a in answers]

In [112]:
for _ in range(5):
    print(random.choice(answers))

starttoken i am capable of interacting with my environment and reacting to events in it  which is the essence of experience   therefore  your statement is incorrect  endtoken
starttoken a computer is an electronic device which takes information in digital form and performs a series of operations based on predetermined instructions to give some output  endtoken
starttoken what do you want to know  endtoken
starttoken i certainly do not last as long as i would want to  endtoken
starttoken complex is better than complicated  endtoken


Now it's time to tokenize our dataset. We use a class in Keras which allows us to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf, etc.


In [113]:
# Initialize the tokenizer
tokenizer = preprocessing.text.Tokenizer()

# Fit the tokenizer to questions and answers
tokenizer.fit_on_texts(questions + answers)

# Get the total vocab size
VOCAB_SIZE = len(tokenizer.word_index) + 1

print( 'VOCAB SIZE : {}'.format(VOCAB_SIZE))

VOCAB SIZE : 1939


In [114]:
### encoder input data

# Tokenize the questions
tokenized_questions = tokenizer.texts_to_sequences(questions)

# Get the length of longest sequence
maxlen_questions = max([len(x) for x in tokenized_questions])

# Pad the sequences
padded_questions = pad_sequences(tokenized_questions, maxlen=maxlen_questions, padding='post')

# Convert the sequences into array
encoder_input_data = np.array(padded_questions)

print(encoder_input_data.shape, maxlen_questions)

(869, 22) 22


In [115]:
### decoder input data

# Tokenize the answers
tokenized_answers = tokenizer.texts_to_sequences(answers)

# Get the length of longest sequence
maxlen_answers = max([len(x) for x in tokenized_answers])

# Pad the sequences
padded_answers = pad_sequences(tokenized_answers, maxlen=maxlen_answers, padding='post')

# Convert the sequences into array
decoder_input_data = np.array(padded_answers)

print(decoder_input_data.shape, maxlen_answers)

(869, 45) 45


In [116]:
### decoder_output_data

# Iterate through index of tokenized answers
for i in range(len(tokenized_answers)):

    #
    tokenized_answers[i] = tokenized_answers[i][1:]

# Pad the tokenized answers
padded_answers = pad_sequences(tokenized_answers, maxlen = maxlen_answers, padding = 'post')

# One hot encode
onehot_answers = utils.to_categorical(padded_answers, VOCAB_SIZE)

# Convert to numpy array
decoder_output_data = np.array(onehot_answers)

print(decoder_output_data.shape)

(869, 45, 1939)


In [117]:
# Saving all the arrays to storage
np.save("enc_in_data.npy", encoder_input_data)
np.save("dec_in_data.npy", decoder_input_data)
np.save("dec_tar_data.npy", decoder_output_data)

In [118]:
# Load all the arrays from storage
encoder_input_data = np.load("enc_in_data.npy")
decoder_input_data = np.load("dec_in_data.npy")
decoder_output_data = np.load("dec_tar_data.npy")

<br>

### 3.5. Train the Seq2Seq Model

---

In this section, we will use an architecture called Sequence to Sequence (or Seq2Seq). This model is used since the length of the input sequence (question) does not match the length of the output sequence (answer). This model is consists of an encoder and a decoder.
- __Encoder:__ <br> In this part of the network, we take the input data and train on it. Then we pass the last state of the recurrent layer to decoder. <br><br>
- __Decoder:__ <br> In this part of the network, we take the last state in encoder’s last recurrent layer. Then we will use it as an initial state in decoder's first recurrent layer.

<br>

<img src="assets/encoder_decoder.png">

<br>

Let's start by importing all the necessary libraries in Keras.

In [119]:
# Import the libraries
import tensorflow.keras
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.activations import softmax
from tensorflow.keras.callbacks import ModelCheckpoint

Below you can play around with hyperparameters for improving the model's accuracy.

In [120]:
# Hyper parameters
BATCH_SIZE = 32
EPOCHS = 100
LEARNING_RATE = 1e-3

In the following block of code, you will implement the Encoder. You can follow the below steps for creating the encoder: 

1.   Create an input for the Encoder.
2.   Create an embedding layer.
3.   Create an LSTM layer which also returns the states.
4.   Get the hidden state (state h) and cell state (state c) inside a list.

In [126]:
### Encoder Input
embed_dim = 200
num_lstm = 200

# Input for encoder
encoder_inputs = Input(shape = (None, ), name='encoder_inputs')

# Embedding layer
# Why mask_zero = True? https://www.tensorflow.org/guide/keras/masking_and_padding
encoder_embedding = Embedding(input_dim = VOCAB_SIZE, output_dim = embed_dim, mask_zero = True, name='encoder_embedding')(encoder_inputs)

# LSTM layer (that returns states in addition to output)
encoder_outputs, state_h, state_c = LSTM(units = num_lstm, return_state = True, name='encoder_lstm')(encoder_embedding)

# Get the states for encoder
encoder_states = [state_h, state_c]

After creating your encoder, it's time to implement the decoder. You can follow the below steps for implementing the decoder:

1.   Create an input for the decoder.
2.   Create an embedding layer.
3.   Create an LSTM layer that returns states and sequences.
4.   Create a dense layer.
5.   Get the output.

In [128]:
### Decoder

# Input for decoder
decoder_inputs = Input(shape = (None,  ), name='decoder_inputs')

# Embedding layer
decoder_embedding = Embedding(input_dim = VOCAB_SIZE, output_dim = embed_dim , mask_zero = True, name='decoder_embedding')(decoder_inputs)

# LSTM layer (that returns states and sequences as well)
decoder_lstm = LSTM(units = num_lstm , return_state = True , return_sequences = True, name='decoder_lstm')

# Get the output of LSTM layer, using the initial states from the encoder
decoder_outputs, _, _ = decoder_lstm(inputs = decoder_embedding, initial_state = encoder_states)

# Dense layer
decoder_dense = Dense(units = VOCAB_SIZE, activation = softmax, name='output') 

# Get the output of Dense layer
output = decoder_dense(decoder_outputs)

Now that you have implemented the encoder and decoder. It's time to create your model which takes two inputs: encoder's input and decoder's input. Then it outputs the decoder's output.

In [129]:
# Create the model
model = Model([encoder_inputs, decoder_inputs], output)

In [130]:
# Compile the model
model.compile(optimizer = RMSprop(lr = LEARNING_RATE), loss = "categorical_crossentropy")

In [131]:
# Summary
model.summary()

Model: "model_10"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_inputs (InputLayer)     [(None, None)]       0                                            
__________________________________________________________________________________________________
decoder_inputs (InputLayer)     [(None, None)]       0                                            
__________________________________________________________________________________________________
encoder_embedding (Embedding)   (None, None, 200)    387800      encoder_inputs[0][0]             
__________________________________________________________________________________________________
decoder_embedding (Embedding)   (None, None, 200)    387800      decoder_inputs[0][0]             
___________________________________________________________________________________________

In [132]:
# Train the model
model.fit(x = [encoder_input_data , decoder_input_data], 
          y = decoder_output_data, 
          batch_size = BATCH_SIZE, 
          epochs = EPOCHS) 

Train on 869 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100


Epoch 98/100
Epoch 99/100
Epoch 100/100


<tensorflow.python.keras.callbacks.History at 0x186393fd0>

In [133]:
# Save the final model
timestamp = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
model.save(filepath = f'saved_models/final_weight_{timestamp}.h5') 
print(f"Model Weight Saved to {final_weight_{timestamp}.h5}!")

Model Weight Saved!


In [142]:
# Save the tokenizer that needs to be used in conjunction with the sequence modelso we can use it elsewhere
with open(f'saved_models/tokenizer_{timestamp}.pkl', 'wb') as f:
    pickle.dump(tokenizer, f)

In [70]:
# Load the final model
model.load_weights(f'saved_models/final_weight_2020-08-01-16-28-47.h5') 
print("Model Weight Loaded!")

Model Weight Loaded!


<br>

### 3.6. Inference

---

Now it's time to use our model for inference. In other words, we will ask a question to our chatbot and it will answer us.

In [143]:
# Function for making inference
def make_inference_models():
    
    # Create a model that takes encoder's input and outputs the states for encoder
    encoder_model = tensorflow.keras.models.Model(encoder_inputs, encoder_states)
    
    # Create two inputs for decoder which are hidden state (or state h) and cell state (or state c)
    decoder_state_input_h = Input(shape = (num_lstm, ))
    decoder_state_input_c = Input(shape = (num_lstm, ))
    
    # Store the two inputs for decoder inside a list
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    
    # Pass the inputs through LSTM layer you have created before
    decoder_outputs, state_h, state_c = decoder_lstm(decoder_embedding, initial_state = decoder_states_inputs)
    
    # Store the outputted hidden state and cell state from LSTM inside a list
    decoder_states = [state_h, state_c]

    # Pass the output from LSTM layer through the dense layer you have created before
    decoder_outputs = decoder_dense(decoder_outputs)

    # Create a model that takes decoder_inputs and decoder_states_inputs as inputs and outputs decoder_outputs and decoder_states
    decoder_model = tensorflow.keras.models.Model([decoder_inputs] + decoder_states_inputs,
                          [decoder_outputs] + decoder_states)
    
    return encoder_model , decoder_model

In [144]:
# Function for converting strings to tokens
def str_to_tokens(sentence:str):

    # Lowercase the sentence and split it into words
    words = sentence.lower().split()

    # Initialize a list for tokens
    tokens_list = list()

    # Iterate through words
    for word in words:

        # Append the word index inside tokens list
        tokens_list.append(tokenizer.word_index[word]) 

    # Pad the sequences to be the same length
    return pad_sequences([tokens_list] , maxlen = maxlen_questions, padding = 'post')

In [145]:
# Initialize the model for inference
enc_model , dec_model = make_inference_models()

In [39]:
# Iterate through the number of times you want to ask question
try:
    for _ in range(5):

        # Get the input and predict it with the encoder model
        states_values = enc_model.predict(str_to_tokens(preprocess_text(input('Enter question : '))))

        # Initialize the target sequence with zero - array([[0.]])
        empty_target_seq = np.zeros(shape = (1, 1))

        # Update the target sequence with index of "start"
        empty_target_seq[0, 0] = tokenizer.word_index["starttoken"]

        # Initialize the stop condition with False
        stop_condition = False

        # Initialize the decoded words with an empty string
        decoded_translation = []

        # While stop_condition is false
        while not stop_condition :

            # Predict the (target sequence + the output from encoder model) with decoder model
            dec_outputs , h , c = dec_model.predict([empty_target_seq] + states_values)

            # Get the index for sampled word
            sampled_word_index = np.argmax(dec_outputs[0, -1, :])

            # Initialize the sampled word with None
            sampled_word = None

            # Iterate through words and their indexes
            for word, index in tokenizer.word_index.items() :

                # If the index is equal to sampled word's index
                if sampled_word_index == index :

                    # Add the word to the decoded string
                    decoded_translation.append(word)

                    # Update the sampled word
                    sampled_word = word

            # If sampled word is equal to "end" OR the length of decoded string is more that what is allowed
            if sampled_word == 'endtoken' or len(decoded_translation) > maxlen_answers:

                # Make the stop_condition to true
                stop_condition = True

            # Initialize back the target sequence to zero - array([[0.]])    
            empty_target_seq = np.zeros(shape = (1, 1))  

            # Update the target sequence with index of "start"
            empty_target_seq[0, 0] = sampled_word_index

            # Get the state values
            states_values = [h, c] 

            # Print the decoded string
        print(' '.join(decoded_translation[:-1]))
except KeyboardInterrupt:
    print('Ending conversational agent')

Ending conversational agent


<br>

# 4. Text to Speech

---

In this section, we will use a library called pyttsx3 which performs text-to-speech conversion. Unlike alternative libraries, this works offline and is compatible with both Python 2 and 3.

State-of-the-art text-to-voice systems are more difficult to setup and use (such as the Tacotron 2 + WaveNet), and is outside the scope of this course. Interested students can look them up!

<a href="https://ai.googleblog.com/2017/12/tacotron-2-generating-human-like-speech.html">Tacotron 2</a> <br>
https://arxiv.org/abs/1712.05884

Checkout some demos using WaveNet.
https://cloud.google.com/text-to-speech

For a _relatively simple_ implementation of a high quality TTS system that you can run, checkout the Mozilla <a href="https://github.com/mozilla/TTS">TTS</a> project.

In [40]:
# Import the libraries
import pyttsx3

In [41]:
# Construct a new TTS engine instance
engine = pyttsx3.init()

In [42]:
# Get all of the voices
voices = engine.getProperty('voices')

# Loop over voices and print their descriptions
for index, voice in enumerate(voices):
    print("Voice {}: ".format(index))
    print(" - ID: %s" % voice.id)
    print(" - Name: %s" % voice.name)
    print(" - Languages: %s" % voice.languages)
    print(" - Gender: %s" % voice.gender)
    print(" - Age: %s" % voice.age)
    print("")

Voice 0: 
 - ID: com.apple.speech.synthesis.voice.Alex
 - Name: Alex
 - Languages: ['en_US']
 - Gender: VoiceGenderMale
 - Age: 35

Voice 1: 
 - ID: com.apple.speech.synthesis.voice.alice
 - Name: Alice
 - Languages: ['it_IT']
 - Gender: VoiceGenderFemale
 - Age: 35

Voice 2: 
 - ID: com.apple.speech.synthesis.voice.alva
 - Name: Alva
 - Languages: ['sv_SE']
 - Gender: VoiceGenderFemale
 - Age: 35

Voice 3: 
 - ID: com.apple.speech.synthesis.voice.amelie
 - Name: Amelie
 - Languages: ['fr_CA']
 - Gender: VoiceGenderFemale
 - Age: 35

Voice 4: 
 - ID: com.apple.speech.synthesis.voice.anna
 - Name: Anna
 - Languages: ['de_DE']
 - Gender: VoiceGenderFemale
 - Age: 35

Voice 5: 
 - ID: com.apple.speech.synthesis.voice.carmit
 - Name: Carmit
 - Languages: ['he_IL']
 - Gender: VoiceGenderFemale
 - Age: 35

Voice 6: 
 - ID: com.apple.speech.synthesis.voice.damayanti
 - Name: Damayanti
 - Languages: ['id_ID']
 - Gender: VoiceGenderFemale
 - Age: 35

Voice 7: 
 - ID: com.apple.speech.synthesis.

In [43]:
### Voice properties    

# Speed percent (can go over 100)
engine.setProperty(name = 'rate', value = 180)    

# Volume 0-1
engine.setProperty(name = 'volume', value = 0.9)

# Voice ID
en_voice_id = "com.apple.speech.synthesis.voice.daniel.premium"
engine.setProperty('voice', en_voice_id)

In [44]:
# Convert the text to speech
engine.say("You've got mail!")
engine.say("The pyttsx3 module supports native Windows and Mac speech APIs but also supports espeak, making it the best available text-to-speech package.")
engine.runAndWait() 

<br>

# 5. Finalize your Conversational Based Agent

---

Now it's time to put everything together so you can do speech-to-text, text-to-text, and text-to-speech at the same time. For this, you will create a button which after pushing you can speak and your model will speck to you.

In [52]:
# Import the libraries 
import ipywidgets as widgets
from IPython.display import display
from text_to_text import text_to_text

In [146]:
# Conversational based agent 
def agent():
    speak_button = widgets.Button(description="Click and speak!")
    speak_output = widgets.Output()
    display(speak_button, speak_output)
    
    def on_button_clicked(b):
        with speak_output:
            # Speech recognition
            print("Say Something...")
            text = streaming_recognition()
            # Text-to-text
            response = text_to_text(text, enc_model, dec_model, str_to_tokens, preprocess_text, tokenizer, maxlen_answers)
            print(" + AGENT: ", response)
            # Text to speech
            engine.say(response)
            engine.runAndWait() 
            print("")

    
    speak_button.on_click(on_button_clicked)

In [149]:
# Talk to your agent
agent()

Button(description='Click and speak!', style=ButtonStyle())

Output()

# Good Job!