# Closed Domain Chatbot with **words vectors**

There is several way to create a chatbot, in this section I will use sentence similarity to create one, I will create the words embeddings
of the questions and answers data(here I will use covid19-related-faqs from Kaggle). Then, this words embeddings is used to calculate the
similarity between two sentences(the user request and a question in the dataset). We will take the most similar questions to the user request and
return the corresponding responses.

In [None]:
!pip install gensim nltk

In [None]:
#import necessary library
import pandas as pd
import gensim
from nltk import word_tokenize
import numpy as np
from string import punctuation
import re
from nltk.stem import PorterStemmer
import nltk

stemmer = PorterStemmer ()
nltk.download ('punkt')

In [None]:
questions_answers_data = pd.read_csv('data/covid_faq.csv')
questions_answers_data.head ()

Unnamed: 0,questions,answers
0,What is a novel coronavirus?,A novel coronavirus is a new coronavirus that ...
1,Why is the disease being called coronavirus di...,"On February 11, 2020 the World Health Organiza..."
2,How does the virus spread?,The virus that causes COVID-19 is thought to s...
3,Can I get COVID-19 from food (including restau...,Currently there is no evidence that people can...
4,Will warm weather stop the outbreak of COVID-19?,It is not yet known whether weather and temper...


In [None]:
#Process the data
translator = str.maketrans('','',punctuation)
covid19_data = [data.translate(translator).lower().split () for data in questions_answers_data.values.ravel()]
stemmed_data = []
#stemming words is important to reduce redundancy
for data in covid19_data:
    new_data = []
    for word in data:
        new_data.append (stemmer.stem (word))
    stemmed_data.append (new_data)
stemmed_data[0]

['what', 'is', 'a', 'novel', 'coronaviru']

In [None]:
#Build words embeddings
model = gensim.models.word2vec.Word2Vec (stemmed_data, iter = 500)

In [None]:
#get the most similar word to the word 'coronavirus'
stemmed_word = stemmer.stem ('coronavirus')
model.wv.most_similar (stemmed_word)

[('novel', 0.4405056834220886),
 ('way', 0.38052836060523987),
 ('thought', 0.356001079082489),
 ('2019', 0.32111015915870667),
 ('coronavirus', 0.28446969389915466),
 ('mainli', 0.27395784854888916),
 ('caus', 0.2720997631549835),
 ('we', 0.24451205134391785),
 ('current', 0.24158185720443726),
 ('eye', 0.23453915119171143)]

In [None]:
#Some examples of sentences similarity
question1 = "what is coronavirus"
question2 = "what is community spread"
question3 = "what is novel coronavirus"
q1 = [stemmer.stem (data) for data in word_tokenize(question1) if stemmer.stem (data) in model.wv]
q2 = [stemmer.stem (data) for data in word_tokenize(question2) if stemmer.stem (data) in model.wv]
q3 = [stemmer.stem (data) for data in word_tokenize(question3) if stemmer.stem (data) in model.wv]
print(f"similarity between (question1,question2) : {model.wv.n_similarity(q1, q2)} \nsimilarity between (question1,question3): {model.wv.n_similarity(q1,q3)}")

similarity between (question1,question2) : 0.5672325491905212 
similarity between (question1,question3): 0.9393385648727417


In [None]:


class chatbot ():

    def __init__(self, name = 'bot'):
        self.name = name
        self.data = {}

    def fit (self, data):
        #create the dictionary questions-answers
        for i, conversation in enumerate (data):
            question, answer = conversation
            self.data [question.lower ()] = answer.lower ()

    def get_response (self, request, threshold = 0.75):
        #search the most similar questions to the user request and return the corresponding question
        request_token = [stemmer.stem (word) for word in word_tokenize (request) if stemmer.stem (word) in model.wv] or ['to']
        best_similarity = 0.
        best_question = ''
        for question in self.data:
            question_token = [stemmer.stem (word) for word in word_tokenize (question.lower ()) if stemmer.stem (word) in model.wv]
            similarity = model.wv.n_similarity (question_token, request_token)
            if similarity > best_similarity:
                best_similarity = similarity
                best_question = question
        if best_similarity < threshold:
                return "I am sorry, I do not understand, or I have not enough information."
        return self.data.get (best_question)



In [None]:
covid19_bot = chatbot ('Covid19_bot')
covid19_bot.fit (questions_answers_data.to_numpy ())

In [None]:
while True:
    request = input('You: ')
    if request.lower() == 'bye':
        print('Bot : Bye... Let me know if you need help.')
        break
    response = covid19_bot.get_response(request.lower (), 0.5)
    print(f'{covid19_bot.name}: {response}')

You: What is novel coronavirus?
Covid19_bot: a novel coronavirus is a new coronavirus that has not been previously identified. the virus causing coronavirus disease 2019 (covid-19), is not the same as the coronaviruses that commonly circulate among humans and cause mild illness, like the common cold.
You: What are the symptoms?
Covid19_bot: people with covid-19 have reported a wide range of symptoms – from mild symptoms to severe illness. symptoms may appear 2-14 days after exposure to the virus. if you have fever, cough, or other symptoms, you might have covid-19.
You: What to do if someone get sick?
You: How to protect myself?
Covid19_bot: visit the how to protect yourself & others page to learn about how to protect yourself from respiratory illnesses, like covid-19.
You: Wearing a mask can help me?
Covid19_bot: cdc recommends that everyone 2 years and older wear a mask that covers their nose and mouth in public settings when around people not living in your household, particularly w

It work but is not yet perfect, as we see, some questions are not well answered by the bot. Maybe, the problems is that words vectors focus only on cooccurence, so it contain limited semantic information. Maybe, using Deep neural network( like Siamese network)
is a good alternative, siamese network is good for question duplicates.

# Open Domain Chatbot with huggingface transformers

Training a transformer from scratch is a compute-intensive process, it is time-consuming, it may take days or weeks. Thanks to huggingface for building the library transformers, data scientist can reduce training time,
reduce computational cost, furthermore state-of-the-art models are available to use. It is a great tool for organization that build and deploy production-grade machine learning solutions.

In [None]:
!pip3 install transformers

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
#we need to specify the model_name so the transformers can download it
model_name = "microsoft/DialoGPT-medium"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/642 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/823M [00:00<?, ?B/s]

Because deep learning model does't use text as input, we need to transform text to numerical data types, it is the tokenizer work
<br> Each transformer has its own tokenizer, because they was trained on different corpus of text.

In [None]:
text = 'Hello, it is a good day to die '
text_token = tokenizer.encode(text + tokenizer.eos_token, return_tensors = 'pt')
text_token

tensor([[15496,    11,   340,   318,   257,   922,  1110,   284,  4656,   220,
         50256]])

We will use function **model.generate()** to generate some text given the input. Documentation of DialoGPT is the same as GPT2 transformer
because it was built on the GPT2 transformer architecture and trained on conversation extracted from Reddit.

In [None]:

output_token = model.generate(text_token,return_tensors = 'pt')
#We will decode the output
print ()
tokenizer.decode(output_token[0])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.





'Hello, it is a good day to die <|endoftext|>Hello, it is a good day to die'

In [None]:
chat_history_ids = torch.LongTensor ([])
while True:
    request = input("You: ")
    if request.lower() == 'bye':
        print('OpenBot : It was nice chatting to you. Have a good day.')
        break
    input_token = tokenizer.encode(request + tokenizer.eos_token, return_tensors="pt")
    chat_history = torch.cat ([chat_history_ids, input_token], dim=-1)
    output_with_chat_history = model.generate(
        chat_history,
        max_length=1000,        #max length of output
        do_sample=True,         #introduce randomness for more creative response
        top_p=0.95,             # nucleus sampling: choosing from the smallest set of words whose cummulative probability exceeds 0.95 probability
        top_k=0,                # disabling top k sampling, top k sampling is used to ensure that the less probable words should not have any chance at all
        temperature=0.5,        # temperature is used to introduce randomness, 0.0 is same as greed search, and 1.0 mean more randomness and creativity.
        pad_token_id=tokenizer.eos_token_id
    )
    response = tokenizer.decode(output_with_chat_history[:, chat_history.shape[-1]:][0], skip_special_tokens=True)
    print("OpenBot: {}".format(response))

You: Hello
OpenBot: Hello! How are you?
You: L am fine, thank you
OpenBot: No problem. I'm glad you're okay!
You: Who is Zinedine zidane?
OpenBot: The guy that got the ball to Ronaldo on the goal.
You: So, who is Maradona ? 
OpenBot: The best player of all time.
You: Okay ...Do you know to be a good soccer player?
OpenBot: I know how to play soccer.
You: Are you a good one?
OpenBot: I'm a good one.
You: How to be rich?
OpenBot: Or just rich enough to have a family.
You: Yaeh, so how?
OpenBot: I just have a really good imagination.
You: So you can't give me a answer?
OpenBot: You can't ask a question.
You: Ok ...Bye
OpenBot: Bye... but not before I have a bit of a talk with you.
You: Bye
OpenBot : It was nice chatting to you. Have a good day.


As we can see, Transformers hold the potential to understand the relationship between sequential elements that are far from each other. They have more creativity. But these advantages have drawbacks, a model with billions of parameters are hard to train.