<h1 style="text-align: center;text-transform: uppercase;">Conversational Based Agent</h1>

<br>

In this project, you will build an end-to-end voice conversational agent, which can take a voice input audio line, and synthesize a response. The chatbot agent will be executed locally on your computer. 

<img style="width:550px; height:300px;" src="assets/intro.png">

This project consists of the following parts:
1. __Speech Recognition:__ <br>In this part, you will create a speech recognition that can convert your voice into a text format.<br><br>
2. __Chatbot:__ <br>This is the core of your conversational based agent. You will build a chatbot that will answer your questions. <br><br>
3. __Text to Speech:__ <br>After getting the answer from your chatbot, it should be converted into a voice format and that is what you should create in this part. <br><br>
4. __Finalize your Conversational Based Agent:__ <br>At the very end step, you will put everything together and create your Conversational Based Agent.

<br>

# 2. Chatbot

---


In this part, you will create a deep learning based conversational agent. This agent will be able to interact with users and understand their questions. More specifically, you will start with loading the dataset, cleaning and preprocessing them, and then you will feed them into a neural network.

<br>

### 2.1. Load and Clean the Chatterbot Dataset 

---

In this project, we have provided you with multiple dataset files. Each of these files contains conversations regarding a specific topic. For example, topics about humor, food, movies, science, history, etc. You can read the description of each dataset in below:

| Name of Dataset | Description |
| :----:| :----: |
| botprofile.yml | Personality of Your Chatbot |
| humor.yml | Joke and Humor |
| emotion.yml | Emotional Conversations |
| politics.yml | Political Conversations |
| ai.yml | General Questions about AI |
| computers.yml | Conversations about Computer |
| history.yml | Q&A about Historical Facts and Events |
| psychology.yml | Psychological Conversations |
| food.yml | Food Related Conversations. |
| literature.yml | Conversations about Different Books, Authors, Genres |
| money.yml | Conversations about Money, Investment, Economy |
| trivia.yml | Conversations that Have Small Values |
| gossip.yml | Gossipy Conversations |
| conversations.yml | Common Conversations |
| greetings.yml | Different Ways of Greeting |
| sports.yml | Conversations about Sports. |
| movies.yml | Conversation about Movies. |
| science.yml | Conversations about Science  |
| health.yml | Health Related Questions and Answers. |


Feel free to modify these datasets to change the behavior of your model.

In [16]:
# Import the libraries
import yaml
from yaml import Loader
import glob
import datetime

In [101]:
# Function for loading all of the yml files
def load_chatterbot_dataset():
    
    # Initialize empty lists for questions and answers
    questions, answers = [], []
    
    # Get the list of all dataset names
    dataset_names = glob.glob("datasets/chatterbot/*.yml")
    
    # Iterate through each dataset name
    for i_dataset_name in tqdm(dataset_names):
        
        # Load the dataset
        with open(i_dataset_name) as file:
            greeting = yaml.load(file, Loader = Loader)["conversations"]
            
        # Iterate through each conversation
        for i_conversation in greeting:
            
            # If length is two
            if len(i_conversation) == 2:
                
                # Append the question to 'questions' list
                questions.append(i_conversation[0])
                
                # Append the answer to 'answers' list
                answers.append(i_conversation[1])
            
            # If length is more than two
            elif len(i_conversation) > 2:
                
                # Iterate through each index
                for index in range(len(i_conversation)-1):
    
                    # Append the question and answer
                    questions.append(i_conversation[0])
                    answers.append(i_conversation[index+1])
                    
    return questions, answers

In [102]:
# Get the questions and answers
questions, answers = load_chatterbot_dataset()

100%|██████████| 19/19 [00:00<00:00, 89.35it/s]


In [103]:
print("Total Question & Answers: ", len(questions))

Total Question & Answers:  869


In [104]:
# Take a look at the preprocessed questions and answers
total_questions = len(questions)
for i in range(4):
    j = random.randint(0, total_questions)
    print("Question {}: \n".format(i), questions[j])
    print("")
    print("Answer {}: \n".format(i), answers[j])
    print("--------------------------------------------------------------------------")

Question 0: 
 Good morning, how are you?

Answer 0: 
 I'm also good.
--------------------------------------------------------------------------
Question 1: 
 What makes you sad

Answer 1: 
 Sadness is not an emotion that I like to experience.
--------------------------------------------------------------------------
Question 2: 
 Tell me a joke

Answer 2: 
 what do you get when you cross a dance and a cheetah?
--------------------------------------------------------------------------
Question 3: 
 you are emotional

Answer 3: 
 i certainly do at times.
--------------------------------------------------------------------------


<br>

### 2.2. Data Preprocessing

---

After cleaning the dataset, you should preprocess the dataset by following the below steps:

1. Lower case the text.
2. Decontract the text (e.g. she's -> she is, they're -> they are, etc.).
3. Remove the punctuation (e.g. !, ?, $, %, #, @, ^, etc.).
4. Tokenization.
5. Pad the sequences to be the same length.

In [105]:
# import the libraries
import numpy as np
import contractions
import re
from tensorflow.keras import preprocessing, utils
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [106]:
# Function for preprocessing the given text
def preprocess_text(text):
    
    # Lowercase the text
    text = text.lower()
    
    # Decontracting the text (e.g. it's -> it is)
    text = contractions.fix(text)
    
    # Remove the punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    
    return text

In [108]:
# Preprocess the questions and answers
questions = [preprocess_text(q) for q in questions]
answers = [preprocess_text(q) for q in answers]


In [109]:
# Take a look at the preprocessed questions and answers
total_questions = len(questions)
for i in range(4):
    j = random.randint(0, total_questions)
    print("Question {}: \n".format(i), questions[j])
    print("")
    print("Answer {}: \n".format(i), answers[j])
    print("--------------------------------------------------------------------------")

Question 0: 
 what is the stock market

Answer 0: 
 trading shares 
--------------------------------------------------------------------------
Question 1: 
 how much do you earn

Answer 1: 
 i am expecting a raise soon 
--------------------------------------------------------------------------
Question 2: 
 chemistry

Answer 2: 
 my favorite subject is chemistry
--------------------------------------------------------------------------
Question 3: 
 robots are not allowed to lie

Answer 3: 
 sure we are   we choose not to 
--------------------------------------------------------------------------


To ensure that every training example are the type string, we need to first filter out both answers and questions that are not string.

In [110]:
# answers_with_tags = list()
# for i in range(len(answers)):
#     if type(answers[i]) == str:
#         answers_with_tags.append(answers[i])
#     else:
#         questions.pop(i)

After preprocessing the dataset, we should add a start tag (e.g. `<START>`) and an end tag (e.g. `<END>`) to answers. Remember that we will only add these tags to answers and not questions. This requirement is because of the Seq2Seq model.

In [111]:
# Add <START> and <END> tag to each sentence
answers = ['starttoken ' + a + ' endtoken' for a in answers]

In [112]:
for _ in range(5):
    print(random.choice(answers))

starttoken i am capable of interacting with my environment and reacting to events in it  which is the essence of experience   therefore  your statement is incorrect  endtoken
starttoken a computer is an electronic device which takes information in digital form and performs a series of operations based on predetermined instructions to give some output  endtoken
starttoken what do you want to know  endtoken
starttoken i certainly do not last as long as i would want to  endtoken
starttoken complex is better than complicated  endtoken


Now it's time to tokenize our dataset. We use a class in Keras which allows us to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf, etc.


In [113]:
# Initialize the tokenizer
tokenizer = preprocessing.text.Tokenizer()

# Fit the tokenizer to questions and answers
tokenizer.fit_on_texts(questions + answers)

# Get the total vocab size
VOCAB_SIZE = len(tokenizer.word_index) + 1

print( 'VOCAB SIZE : {}'.format(VOCAB_SIZE))

VOCAB SIZE : 1939


In [114]:
### encoder input data

# Tokenize the questions
tokenized_questions = tokenizer.texts_to_sequences(questions)

# Get the length of longest sequence
maxlen_questions = max([len(x) for x in tokenized_questions])

# Pad the sequences
padded_questions = pad_sequences(tokenized_questions, maxlen=maxlen_questions, padding='post')

# Convert the sequences into array
encoder_input_data = np.array(padded_questions)

print(encoder_input_data.shape, maxlen_questions)

(869, 22) 22


In [115]:
### decoder input data

# Tokenize the answers
tokenized_answers = tokenizer.texts_to_sequences(answers)

# Get the length of longest sequence
maxlen_answers = max([len(x) for x in tokenized_answers])

# Pad the sequences
padded_answers = pad_sequences(tokenized_answers, maxlen=maxlen_answers, padding='post')

# Convert the sequences into array
decoder_input_data = np.array(padded_answers)

print(decoder_input_data.shape, maxlen_answers)

(869, 45) 45


In [116]:
### decoder_output_data

# Iterate through index of tokenized answers
for i in range(len(tokenized_answers)):

    #
    tokenized_answers[i] = tokenized_answers[i][1:]

# Pad the tokenized answers
padded_answers = pad_sequences(tokenized_answers, maxlen = maxlen_answers, padding = 'post')

# One hot encode
onehot_answers = utils.to_categorical(padded_answers, VOCAB_SIZE)

# Convert to numpy array
decoder_output_data = np.array(onehot_answers)

print(decoder_output_data.shape)

(869, 45, 1939)


In [117]:
# Saving all the arrays to storage
np.save("enc_in_data.npy", encoder_input_data)
np.save("dec_in_data.npy", decoder_input_data)
np.save("dec_out_data.npy", decoder_output_data)

In [None]:
# Save the tokenizer that needs to be used in conjunction with the sequence modelso we can use it elsewhere
with open(f'saved_models/tokenizer.pkl', 'wb') as f:
    pickle.dump(tokenizer, f)