# Intelligent chatbots

In this section we go show how to make your chatbot more intelligent. We cover two type of chatbots:

- rule-based chatbots with two examples,
- generative-based chatbots with an n-gram model example.

We also show a few working examples of generative-based chatbots that are open source.

![](images/chatbotsbulbs.png)

## Rule-based chatbots

A rule-based chatbot has a list of questions and answers. Usually, we can build simple scenarios to go through each question and respond with an answer as drawn below.

![](images/phrases_list.png)

### Simple scenario chatbot - Greg is your stock marker advisor

This chatbot is a stock market advisor with a list of questions. The answers for these questions allow the chatbot to give a the stock value for a given date. You need to request for the API key first.

In [1]:
import requests
import random

API_KEY = "5OODSC46TH6XNTCQ"
URL = "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol="

welcome = "Hi! I'm Greg, your stock market advisor."

questions = (
    "What stock exchange would like to check? Please provide the stock exchange symbol.",
    "What stock from the stock exchange would like to check? Please use the stock index name.",
    "What date are you interested in?",
    "Should I print the maximum, minimum, opening or closing value? Choose one."
            )

def get_stock_value(stock_request):
    if stock_request[0].upper() != "NYSE":
        return "Stock exchange not supported"
    
    resposne = requests.get(URL+stock_request[1]+ "&apikey=" + API_KEY)
    stock_data = resposne.json()["Time Series (Daily)"]
    date = random.choice(list(stock_data.keys()))
    answer = "The stock " + stock_request[3] +" for " + stock_request[1]+ " on " + stock_request[2] +" is "+ stock_data[date]['2. high']
    return answer

The main part of the chatbot is written in just few lines. We loop over the questions and used it to get the stock details.

In [2]:
import sys

def run_chatbot():
    print(welcome)
    answers = []
    for question_id in range(len(questions)):
        print(questions[question_id])
        answer = input()
        answers.append(answer)
    print(get_stock_value(answers))
    
run_chatbot()    

Hi! I'm Greg, your stock market advisor.
What stock exchange would like to check? Please provide the stock exchange symbol.
NYSE
What stock from the stock exchange would like to check? Please use the stock index name.
GOOG
What date are you interested in?
7-12
Should I print the maximum, minimum, opening or closing value? Choose one.
maximum
The stock maximum for GOOG on 7-12 is 1199.0100


### Rule-based customer support chatbot

In this case we also need to setup a welcome message and a list of questions. This time the questions are potential customer questions. We set also a list of answers for each question.

In [5]:
welcome = "Hi! I'm Arthur, the customer support chatbot. How can I help you?"

questions = (
    "The app is freezing after I click run button",
    "I don't know how to proceed with the invoice",
    "I get an error when I try to install the app",
    "It crash after I have updated it",
    "I cannot login into the app",
    "I'm not able to download it"
            )

answers = (
        "You need to clean up the cache. Please go to ...",
        "Please go to Setting, next Subscriptions and there is the Billing section",
        "Could you plese send the log files placed in ... to ...",
        "Please restart your PC",
        "Use the forgot password button to setup a new password",
        "Probably you have an ad blocker plugin installed and it blocks the popup with the download link"
            )

Most questions will not be exactly the same as we have on our list, but can be similar. Let's define a function to measure the similarity.

In [6]:
from difflib import SequenceMatcher

similarity_treshold = 0.2

def get_highest_similarity(customer_question):
    max_similarity = 0
    highest_prob_index = 0
    for question_id in range(len(questions)):
        similarity = SequenceMatcher(None,customer_question,questions[question_id]).ratio()
        print(similarity)
        if similarity > max_similarity:
            highest_index = question_id
            max_similarity = similarity
    if max_similarity > similarity_treshold:
        return answers[highest_index]
    else:
        return "The issues has been saved. We will contact you soon."

The main part is just a few lines of code. You can print the similarities of each sentence.

In [7]:
def run_chatbot():
    print(welcome)
    question = ""
    while question != "thank you":
        question = input()
        answer = get_highest_similarity(question)
        print(answer)
    
run_chatbot()

Hi! I'm Arthur, the customer support chatbot. How can I help you?
freeze
0.24
0.16
0.12
0.21052631578947367
0.06060606060606061
0.06060606060606061
You need to clean up the cache. Please go to ...
gfgfgg
0.08
0.0
0.04
0.05263157894736842
0.06060606060606061
0.0
The issues has been saved. We will contact you soon.
error
0.16326530612244897
0.08163265306122448
0.20408163265306123
0.10810810810810811
0.0625
0.125
Could you plese send the log files placed in ... to ...


KeyboardInterrupt: 

### Exercise 1: Build a rule-based chatbot

There is a list of questions below. Use different method of comparison to figure out which one gives the best results and why. Compare the above used methods with the following two subexercises:

- normalized Levenshtein distance.
- NLP word vector similarity - use spaCy for it.

In [None]:
import jellyfish

distance_threshold = 0.3

def levenstein_distance(sentence1,sentence2):
    return 0.0

def get_highest_similarity(customer_question):
    max_distance = 0
    highest_prob_index = 0
    for question_id in range(len(questions)):
        distance = levenstein_distance(customer_question,questions[question_id])

        if distance > max_distance:
            highest_index = question_id
            max_distance = distance
    if max_distance > distance_threshold:
        return answers[highest_index]
    else:
        return "The issues has been saved. We will contact you soon."

Test your solution:

In [None]:
def run_chatbot():
    print(welcome)
    question = ""
    while question != "thank you":
        question = input()
        answer = get_highest_similarity(question)
        print(answer)
    
run_chatbot()

#### spacy similarity

In [None]:
similarity_treshold = 0.5
import spacy

spacy.load("en_core_web_lg")

def get_highest_similarity(customer_question):
    max_similarity = 0
    highest_prob_index = 0
    for question_id in range(len(questions)):
        # put your code here
        similarity = 0 #
        #print(similarity)
        if similarity > max_similarity:
            highest_index = question_id
            max_similarity = similarity
    if max_similarity > similarity_treshold:
        return answers[highest_index]
    else:
        return "The issues has been saved. We will contact you soon."

In [None]:
def run_chatbot():
    print(welcome)
    question = ""
    while question != "thank you":
        question = input()
        answer = get_highest_similarity(question)
        print(answer)
    
run_chatbot()

## Generative-based chatbots

There are several generative-based chatbots solutions. Most are based on deep learning methods. You can find many of such implementation using architectures like:

- n-gram model,
- recurrent neural network,
- autoencoders,
- generative adversarial network.

Autoencoders architecture looks like:
![](images/autoencoders.png)
You can think about like something similar to the zip compressing algorithm, where the compressed file is the feature vector.

Generative Adversational Networks architecure is like following:
![](images/gan.png)
The training is finished when the discriminator does not recognize if the text is generated or real one.

### N-gram models

In this example, we use the Wall Street Journal corpus to generate new sentences. The corpus is available as a part of NLTK library. This example is based on Erroll Wood's work.

In [8]:
from nltk.book import *

wall_street = text7.tokens

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


We need to clean the text up and delete all meaningless words/characters. The easiest way is to use regular expressions:

In [9]:
import re

tokens = wall_street

def cleanup():
    compiled_pattern = re.compile("^[a-zA-Z0-9.!?]")
    clean = list(filter(compiled_pattern.match,tokens))
    return clean
tokens = cleanup()

The next step is to build ngrams. It means that we group the tokens into a list of three that are placed next to each other. You can print the ngrams.

In [13]:
def build_ngrams():
    ngrams = []
    for i in range(len(tokens)-N+1):
        ngrams.append(tokens[i:i+N])
    return ngrams

The next step is to calculate the frequency of tokens in each ngram and sum if there are more than one tokens related to a ngram. There are 85826 ngrams and 54677 frequency ngrams.

In [10]:
def ngram_freqs(ngrams):
    counts = {}

    for ngram in ngrams:
        token_seq  = SEP.join(ngram[:-1])
        last_token = ngram[-1]

        if token_seq not in counts:
            counts[token_seq] = {}

        if last_token not in counts[token_seq]:
            counts[token_seq][last_token] = 0

        counts[token_seq][last_token] += 1;

    return counts;
#ngram_freqs(ngrams)

We choose the next word by using the most recent tokens and adds it.

In [11]:
def next_word(text, N, counts):

    token_seq = SEP.join(text.split()[-(N-1):]);
    choices = counts[token_seq].items();

    total = sum(weight for choice, weight in choices)
    r = random.uniform(0, total)
    upto = 0
    for choice, weight in choices:
        upto += weight;
        if upto > r: return choice
    assert False # should not reach here

We need to setup a few parameters like the windows size N, the number of sentences that we want to generate and start of the sentence that we want to generate. The sentence start string are N-1 words that exists in our ngrams list.

In [14]:
import random

N=3 # fix it for other value of N

SEP=" "

sentence_count=5

ngrams = build_ngrams()
start_seq="We could"

counts = ngram_freqs(ngrams)

if start_seq is None: start_seq = random.choice(list(counts.keys()))
generated = start_seq.lower();

sentences = 0
while sentences < sentence_count:
    generated += SEP + next_word(generated, N, counts)
    sentences += 1 if generated.endswith(('.','!', '?')) else 0

print(generated)

we could have done this in public because so little sensitive information was disclosed the aide added 0 are filled with the recent march in Washington Mrs. Yeargin admitted 0 she offered Mrs. Yeargin on a strip of Sixth Avenue populated by jugglers magicians and other war-rationed goodies . She did n't just use old fashioned bribery . It the classic problem of social disaffiliation a mental health problem a timing issue which he spoke at length with Chinese leaders expressed no regret for the current benchmark 30-year bond that was reported in the economy and traders note that Japanese capital may produce the economic prospects of a slowing economy are increasing pressure on the priority list because of a rather basic concept Two separate markets in different locations trading basically the same company Macmillan\/McGraw-Hill a joint venture with Ronald Bodner a glass industry executive and Mitsubishi Oil rose 50 to 75 cents for each 105 common shares 0 they needed 10,000 parking spac

### Exercise 2: n-gram model

There are three tasks to do:

- Use 5-gram model instead of 3.
- Change to capital letter each first letter of a sentence.
- Remove the whitespace between the last word in a sentence and . ! or ?.

**Hint**: for 2. and 3. implement a function called clean_generated() that takes the generated text and fix both issues at once. It could be easier to fix the text after it's generated rather then doing some changes in the while loop.

In [None]:
import random

def clean_generated(generated):
    # fill the code here
    return generated

N=5

SEP=" "

sentence_count=5

ngrams = build_ngrams()

start_seq="Was named a nonexecutive"

counts = ngram_freqs(ngrams)

if start_seq is None: start_seq = random.choice(list(counts.keys()))
generated = start_seq.lower();

sentences = 0
while sentences < sentence_count:
    generated += SEP + next_word(generated, N, counts)
    sentences += 1 if generated.endswith(('.','!', '?')) else 0


print(clean_generated(generated))