# World Health Organization (WHO) COVID-19 Chatbot

## INDEX: <a class="anchor" id="index"></a>
* [1. Project overview](#first-bullet)
* [2. What is a chatbot?](#second-bullet)
* [3. Data retrival and intents creation](#third-bullet)
    * [Scrapers](#scrapers)
    * [Dataset creation (GUI)](#dataset)
* [4. Data preprocessing](#fourth-bullet)
* [5. Neural Network Architecture](#fifth-bullet)
    * [Classes definition](#network_architecture)
    * [Classes instantiation](#instantiation)
* [6. Training the model](#sixth-bullet)
* [7. Chatting](#seventh-bullet)
    * [Speech Recognition Module](#speech_recognition)
    * [Chatbot GUI](#chatbot_gui)

## 1. Project Overview <a class="anchor" id="first-bullet"></a>
[return to index](#index)

The aim of this project is to create a Support Chatbot able to answer questions concerning COVID-19 virus. All the answers are taken from the <a href="https://www.who.int/emergencies/diseases/novel-coronavirus-2019/question-and-answers-hub/q-a-detail/coronavirus-disease-covid-19"> World Health Organization Frequently Asked Questions (FAQ)</a>. This chatbot relies on Natural Language Processing techniques and machine learning algorithms.

### Modules:
- requests
- BeautifoulSoup
- json
- re
- tkinter
- numpy
- nltk
- torch
- random
- speech_recognition
- string

## 2. What is a Chatbot? <a class="anchor" id="second-bullet"></a>
[return to index](#index)

A chatbot, or chatterBot, is a software developed to simulate a dialog with a human being. Chatbots are used in dialog systems for various purposes including customer service, request routing, or for information gathering.

There are two types of chatbots:
- <b>Rule-Based:</b> chatbots based on pre-set rules to which users question must conform in order to get an answer. Rule-Based chatbots can deal only with simple queries.   
- <b>Self-Learning:</b> chatbots based on Machine Learning approaches to process users's queries. Particularly, Natural Language Processing techniques (NLP) are adopted for this purpose. There are two types of Self-Learning chatbots:
    - <b>Retrieval Based: </b>retrieval-based models that uses some heuristic to select a response from a library of predefined responses.
    - <b>Generative: </b>Generative chatbots are able to generate new answers.

In this project a Retrival Based chatbot has been implemented. 

## 3. Data retrival and intents creation <a class="anchor" id="third-bullet"></a>
[return to index](#index)

The dataset used for this chatbot is small and restricted to a limited number of topics that users may ask. Those topics are called <b>intents</b> and are structured in this way:

```python
{"intents": [
        {"tag": str,
         "patterns": [str, str, ...],
         "responses": [str, str, ...]
        },
    ...
   ]
}
```

- <b>tag:</b> name of the intent 
- <b>patterns:</b> list of possible questions about the tag 
- <b>responses:</b> answer to all the "patterns" questions 

<i>Tags</i> used in this project are listed in <a href="tags.txt">covid_faq.txt</a>. <i>Questions</i> are scraped from websites with FAQs about Covid in it. In <a href="covid_faq.txt">covid_faq.txt</a> are listed the urls of the websites scraped for this chatbot. All the <i>answers</i> provided by the chatbot the official one from the <a href="https://www.who.int/emergencies/diseases/novel-coronavirus-2019/question-and-answers-hub/q-a-detail/coronavirus-disease-covid-19"> World Health Organization</a>.    

In [25]:
import requests
from bs4 import BeautifulSoup
import json
import re

In [26]:
def create_intent(question,response,tag):

    """ 
    Given a pattern, a respose, and a tag, creates an intent (dict). 
    """
    d = {"tag": tag,
         "patterns": [question],
         "responses": [response]
         }

    return d


def init_dataset(patterns,responses,tags):

    """ 
    intialises the dataset with some standard intents and WHO FAQs. 
    """
    dataset = { "intents" : [
        {"tag": "greeting",
         "patterns": ["Hi", "How are you", "Is anyone there?", "Hello", "Good day"],
         "responses": ["Hello, thanks for visiting", "Good to see you again", "Hi there, how can I help?"],
         },
        {"tag": "goodbye",
         "patterns": ["Bye", "See you later", "Goodbye"],
         "responses": ["See you later, thanks for visiting", "Have a nice day", "Bye! Come back again soon."]
         },
        {"tag": "thanks",
         "patterns": ["Thanks", "Thank you", "That's helpful"],
         "responses": ["Happy to help!", "Any time!", "My pleasure"]
         }
     ]
    }
    
    #creates an intent for every WHO FAQ
    for index in range(len(patterns)):
        dataset["intents"].append(create_intent(patterns[index], responses[index], tags[index]))

    return dataset


def get_source(url_):
    
    """
    Given a url as input it returns the BeautifulSoup object of it    
    """
    try:
        response = requests.get(url_)
        response.raise_for_status()
        print("%%% Succesfully connected to the url with status code", response.status_code, "%%%\n")
    except:
        print("%%% Try again. Status code: ", response.status_code, "%%%\n")
    return  BeautifulSoup(response.content,'html.parser')


def raw_data(url,link):
    raw = {'name' : url,
         'url' : link,
         'html' : get_source(link)}
    return raw


In [27]:
#open the list of urls to scrape 
with open('covid_faq.txt', encoding='utf-8') as file:
    lines = file.readlines()
websites = [line.strip('\n').split(',') for line in lines]

#raw dataset with name, url and BeautifulSoup object of every website
raw_dataset = {'websites' : [raw_data(website[0],website[1]) for website in websites]}

%%% Succesfully connected to the url with status code 200 %%%

%%% Succesfully connected to the url with status code 200 %%%

%%% Succesfully connected to the url with status code 200 %%%

%%% Succesfully connected to the url with status code 200 %%%

%%% Succesfully connected to the url with status code 200 %%%



### Scrapers <a class="anchor" id="scrapers"></a>

In [28]:
def who_scraper(source):
    
    """
    Scrapes questions and answers from World Health Organization website FAQ
    """
    
    sections = source.find_all("div", class_="sf-accordion")

    questions = []
    for section in sections:
        for question in section.find_all("a", class_="sf-accordion__link"):
            questions.append(question.get_text().strip())

    answers = []
    for section in sections:
        for item in section.find_all("div", class_="sf-accordion__content"):
            text = re.sub("\n|  |  +", " ", item.get_text(" ", strip=True))
            answers.append(text)

    return questions,answers


def hopkins_scraper(source):
    sections = source.find_all("div", class_="rtf")
    questions = []
    for section in sections:
        if section.find("h2") is not None:
            faq = section.find_all("h2")
            faq = [question.get_text() for question in faq]
            questions.extend(faq)
    return questions


def nicd_scraper(source):
    sections = source.find_all("div", class_="elementor-accordion-item")
    questions = []
    for section in sections:
        for question in section.find_all("a", class_="elementor-accordion-title"):
            questions.append(question.get_text().strip().lower())

    return questions
  

def nsw_scraper(source):
    elements = source.find("div", class_="ms-rtestate-field").find_all("h3")
    questions = [faq.get_text().strip("\n").replace("\xa0", " ") for faq in elements]
    return questions


def penn_scraper(source):
    sections = source.find_all("section", class_="js-tabs__content")
    questions = [faq.find("h3").get_text().strip() for faq in sections]
    return questions

In [40]:
who_patterns, who_responses = who_scraper(raw_dataset["websites"][0]["html"])

#opening list of intents' tags
with open("tags.txt") as file:
    tags = file.read().split("\n")
        
dataset = init_dataset(who_patterns, who_responses, tags)

with open("intents.json", "w") as out:
    json.dump(dataset, out)

#scraping all the questions     
all_questions = []    
for item in raw_dataset["websites"]:
    if item["name"] == 'hopkins':
        all_questions.extend(hopkins_scraper(item["html"]))
    elif item["name"] == 'nicd':
        all_questions.extend(nicd_scraper(item["html"]))
    elif item["name"] == 'nsw':
        all_questions.extend(nsw_scraper(item["html"]))
    elif item["name"] == 'penn':
        all_questions.extend(penn_scraper(item["html"]))

### Dataset creation (GUI) <a class="anchor" id="dataset"></a>

In [30]:
def get_intent(tag, dataset):
    """
    Returns an intent given a tag. 
    """
    for item in dataset['intents']:
        if item['tag'] == tag:
            return item

In [49]:
import tkinter as tk
from tkinter import *
from tkinter.ttk import *
from tkinter import scrolledtext

window = Tk()
window.geometry('800x550')
window.title("Modify intents")
window.grid_columnconfigure(0, weight=1)

#tag selection
tags_lbl = Label(window, text="Chose a tag to see the related intent", font=("Roboto", 14))
tags_lbl.grid(column=0, row=0)

combo_tags = Combobox(window, values=tags, font=("Roboto", 10))

def callback(eventObject):
    #intent display
    selected_tag = combo_tags.get()
    intent_text = scrolledtext.ScrolledText(window, width=80, height=10, font=("Roboto", 10))
    intent_data = get_intent(selected_tag,dataset)
    output = json.dumps(intent_data, indent=True, ensure_ascii=True)
    intent_text.insert(INSERT, output)
    intent_text.grid(column=0,row=2,pady=10)
    
    #question list
    list_lbl = Label(window, text="Choose one or more pattern to add: ", font=("Roboto",10))
    list_lbl.grid(row=3,column=0)
    
    frame = Frame(window)
    frame.grid(row=4, column=0, padx=10, pady=10)
    
    list_patterns = Listbox(frame, width=90, height=10, font=("Roboto", 10), selectmode=MULTIPLE)
    list_patterns.pack(side="left", fill="y")

    scrollbar = Scrollbar(frame, orient="vertical")
    scrollbar.config(command=list_patterns.yview)
    scrollbar.pack(side="right", fill="y")

    list_patterns.config(yscrollcommand=scrollbar.set)

    for faq in all_questions:
        list_patterns.insert(END, faq)
    
    #user input text area
    frame_entry = Frame(window)
    frame_entry.grid(row=5, column=0, padx=5, pady=5)
    entry_lbl = Label(frame_entry, text="Or manually add a new pattern: ", font=["Roboto",10])
    entry_lbl.grid(row=0,column="0")
    new_faq = Entry(frame_entry,width=50)
    new_faq.grid(row=0, column=1)
    
    #buttons section
    frame_btn = Frame(window)
    frame_btn.grid(row=6, column=0, padx=5, pady=5)
    
    def save_intent(dataset,text_area,entry_text,selected_items):
        new_patterns = list()
        faqs = selected_items.curselection()
        new_patterns = [list_patterns.get(faq) for faq in faqs]

        if entry_text != "":
            new_patterns.append(entry_text)

        for item in dataset['intents']:
            if item['tag'] == selected_tag:
                item['patterns'].extend(new_patterns)

        intent_data = get_intent(selected_tag,dataset)
        output = json.dumps(intent_data, indent=True, ensure_ascii=True)
        text_area.delete("1.0", "end")
        text_area.insert(END, output)
        
    
    btn_update = tk.Button(frame_btn, text="Update intent",
                            command= lambda : save_intent(dataset,intent_text,new_faq.get(),list_patterns))
    btn_update.grid(row=0, column=1, padx=5)
    
    def save_dataset(dataset):
        with open("intents.json", "w") as out:
            json.dump(dataset, out)
        print("dataset saved!")
    
    
    btn_save = tk.Button(frame_btn, text="Save new dataset", command= lambda : save_dataset(dataset))
    btn_save.grid(row=0, column=2, padx=5)

combo_tags.grid(row=1, column=0, pady=10)
combo_tags.current(1)
combo_tags.set("Choose a tag")
combo_tags.bind("<<ComboboxSelected>>", callback)


window.mainloop()

## 4. Data prepocessing <a class="anchor" id="fourth-bullet"></a>
[return to index](#index)

The data collected in the previous section need to be preprocessed in order to be transformed into numerical values and to be used as input of the neural network (defined in section [5](#fifth-bullet)). 

To do so, first a <b>vocabulary</b> needs to be buildt. Then, using the vocabulary and the intents's pattern as input, numerical values are obtained using the <b>Bag of Words BoW</b> method.

<img src="vocabulary.svg">

In [8]:
import json
import re
import numpy as np
import nltk
import string
nltk.download('punkt')
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Daniel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [9]:
with open("intents/intents.json", encoding="utf8") as file:
    intents = json.load(file)

# with open("intents.json") as file:
#     intents = json.load(file)

In [10]:
def tokenize(text):
    return nltk.word_tokenize(text.lower())


def stemming(word):
    return stemmer.stem(word)


def bag_of_words(vocabulary, tokenized_sent):
    stem_sent = [stemming(word) for word in tokenized_sent]
    bag = np.zeros(len(vocabulary), dtype=np.float32)

    for index,word in enumerate(vocabulary):
        if word in stem_sent:
            bag[index] = 1
  
    return bag


def remove_punctuation(vocabulary):
    """
    removes punctuation from the input list and return the stem of the words in it.
    """
    punctuation = set(string.punctuation)
    return [stemming(word) for word in vocabulary if word not in punctuation]

In [11]:
vocabulary = []
tags = []
pattern_tag = []

for intent in intents["intents"]:
    tag = intent["tag"]
    tags.append(tag)
    for pattern in intent["patterns"]:
        tokens = tokenize(pattern)
        vocabulary.extend(tokens)
        pattern_tag.append((tokens,tag))

#removing punctuation from vocabulary and get the stem of every word
vocabulary = remove_punctuation(vocabulary)
#deleting duplicates
vocabulary = list(set(vocabulary))
tags = set(tags)
#sorting elements
vocabulary = sorted(vocabulary)
tags = sorted(tags)

* training: 
    - <i>type</i> : list of np arrays
    - <i>content</i> : <b>Bag of Words BoW</b> of every pattern (question)
    
* output: 
    - <i>type</i> : list of np arrays 
    - <i>content</i> : for every pattern (question) the <b>index</b> of the corresponding tag 

In [12]:
training = []
output = []

for pattern,tag in pattern_tag:
    bag = bag_of_words(vocabulary, pattern)
    training.append(bag)
    output.append(tags.index(tag))

training = np.array(training)
output = np.array(output) 

## 5. Neural Network Architecture <a class="anchor" id="fifth-bullet"></a>
[return to index](#index)

### Feedforward neural network: <a class="anchor" id="network_architecture"></a>
* <i>input</i> : BoW of the user's query
* <i>output</i> : predicted tag 
* <i>structure</i> :
    * input : size equals to the # of words in the vocabulary;
    * hidden layers : two layers of size equals to 8
    * output : size equals to the # of tags 
    * activation function: <a href= "https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">rectifier</a>, i.e. the positive part of its argument $$f(x) = x^+ = \textrm{max}(x,0)$$ $$f(x) = \begin{cases} x, & \mbox{if } x \geq 0 \\ 0, & \mbox{if  } x < 0 \end{cases} $$

In [14]:
# Hyper-parameters 
num_epochs = 1500
batch_size = 8
learning_rate = 0.001
input_size = len(training[0])
hidden_size = 8
output_size = len(tags)
# print(input_size, output_size)

In [15]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

class Chatbot(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(Chatbot, self).__init__()
        self.l1 = nn.Linear(input_size, hidden_size) 
        self.l2 = nn.Linear(hidden_size, hidden_size) 
        self.l3 = nn.Linear(hidden_size, num_classes)
        #activation function: rectifier
        self.relu = nn.ReLU()
    
    def forward(self, x):
        out = self.l1(x)
        out = self.relu(out)
        out = self.l2(out)
        out = self.relu(out)
        out = self.l3(out)
        return out

class ChatDataset(Dataset):

    def __init__(self):
        self.n_samples = len(training)
        self.training = training
        self.output = output

    def __getitem__(self, index):
        return self.training[index], self.output[index]

    def __len__(self):
        return self.n_samples

### Classes instantiation <a class="anchor" id="instantiation"></a>

In [16]:
dataset = ChatDataset()

train_loader = DataLoader(dataset=dataset,
                          batch_size=batch_size,
                          shuffle=True,
                          num_workers=0)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = Chatbot(input_size, hidden_size, output_size).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [17]:
#summary: model architecture
print(model)

Chatbot(
  (l1): Linear(in_features=110, out_features=8, bias=True)
  (l2): Linear(in_features=8, out_features=8, bias=True)
  (l3): Linear(in_features=8, out_features=20, bias=True)
  (relu): ReLU()
)


## 6. Training the model <a class="anchor" id="sixth-bullet"></a>
[return to index](#index)

In [18]:
# Train the model
for epoch in range(num_epochs):
    for (words, labels) in train_loader:
        words = words.to(device)
        labels = labels.to(dtype=torch.long).to(device)
        
        # Forward pass
        outputs = model(words)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step() # Updates the parameters
        
    if (epoch+1) % 100 == 0:
        print (f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')


print(f'final loss: {loss.item():.4f}')

data = {
"model_state": model.state_dict(),
"input_size": input_size,
"hidden_size": hidden_size,
"output_size": output_size,
"all_words": vocabulary,
"tags": tags
}

FILE = "data.pth"
torch.save(data, FILE)

print(f'training complete. file saved to {FILE}')

Epoch [100/1500], Loss: 0.7294
Epoch [200/1500], Loss: 0.1913
Epoch [300/1500], Loss: 0.0217
Epoch [400/1500], Loss: 0.0072
Epoch [500/1500], Loss: 0.0033
Epoch [600/1500], Loss: 0.1076
Epoch [700/1500], Loss: 0.0007
Epoch [800/1500], Loss: 0.0009
Epoch [900/1500], Loss: 0.0005
Epoch [1000/1500], Loss: 0.0003
Epoch [1100/1500], Loss: 0.0002
Epoch [1200/1500], Loss: 0.1008
Epoch [1300/1500], Loss: 0.0002
Epoch [1400/1500], Loss: 0.0000
Epoch [1500/1500], Loss: 0.0001
final loss: 0.0001
training complete. file saved to data.pth


## 7. Chatting! <a class="anchor" id="seventh-bullet"></a>
[return to index](#index)

In [19]:
import random

def input_processing(text,vocabulary):
    
    """ Input: User's query
        Output: Chatbot response OR error message """
    
    text = tokenize(text)
    X = bag_of_words(vocabulary, text)
    X = X.reshape(1, X.shape[0])
    X = torch.from_numpy(X).to(device)

    output = model(X)
    _, predicted = torch.max(output, dim=1)

    tag = tags[predicted.item()]

    probs = torch.softmax(output, dim=1)
    prob = probs[0][predicted.item()]
    if prob.item() > 0.80:
        for intent in intents['intents']:
            if tag == intent["tag"]:
                return "BOT:\n" + random.choice(intent['responses']) + "\n"
    else:
        return "BOT:\n" + "I don't understand. Please, repeat the question." + "\n"
    
    
def refresh_chat(query,vocabuary,list_msg):
    question = "YOU:\n" + query + "\n" 
    answer = input_processing(query,vocabulary)
    list_msg.insert(END, question)
    list_msg.insert(END, answer)
    return question,answer

### Speech Recognition Module <a class="anchor" id="speech_recognition"></a>

In [20]:
import speech_recognition as sr

def audio_to_text(vocabulary,list_msg):
    
    # create recognizer and mic instances
    recognizer = sr.Recognizer()
    microphone = sr.Microphone()
    
    # adjust the recognizer sensitivity to ambient noise and record audio from the microphone
    with microphone as source:
        recognizer.adjust_for_ambient_noise(source)
        audio = recognizer.listen(source)
        
    # set up the response object
    response = {
        "success": True,
        "error": None,
        "transcription": None
    }    
        
    # try recognizing the speech in the recording, managing possible errors
    try:
        response["transcription"] = recognizer.recognize_google(audio)
    except sr.RequestError:
        # API was unreachable or unresponsive
        response["success"] = False
        response["error"] = "API unavailable"
    except sr.UnknownValueError:
        # speech was unintelligible
        response["error"] = "Unable to recognize speech"
        
    if response["transcription"]:
        refresh_chat(response["transcription"],vocabulary,list_msg)
    elif response["success"] == False:
        message = "\nERROR: API was unreachable or unresponsive.\n"
        list_msg.insert(END,message)
    elif response["error"]:
        message = "\nERROR: " + response["error"] + ".\n"
        list_msg.insert(END,message)

### Chatbot GUI <a class="anchor" id="chatbot_gui"></a>

In [None]:
import tkinter as tk
from tkinter import *
from tkinter.ttk import *
from tkinter import scrolledtext

# loading the pre-trained model

data = torch.load("data.pth")

input_size = data["input_size"]
hidden_size = data["hidden_size"]
output_size = data["output_size"]
all_words = data['all_words']
tags = data['tags']
model_state = data["model_state"]

model = Chatbot(input_size, hidden_size, output_size).to(device)
model.load_state_dict(model_state)
# set dropout and batch normalization layers
model.eval()

window = Tk()
window.geometry('700x550')
window.title("Chatbot")
window.grid_columnconfigure(0, weight=1)


tags_lbl = Label(window, text="What do you want to know about COVID-19?", font=("Roboto", 13))
tags_lbl.grid(column=0, row=0, pady=15)

# chat window
frame = Frame(window)
frame.grid(row=1, column=0, padx=15, pady=10)
list_msg = Text(frame, width=90, height=23, font=("Roboto", 10))
list_msg.pack(side="left", fill="y")
scrollbar = Scrollbar(frame, orient="vertical")
scrollbar.config(command=list_msg.yview)
scrollbar.pack(side="right", fill="y")
list_msg.config(yscrollcommand=scrollbar.set)

# user query window
frame_entry = Frame(window)
frame_entry.grid(row=2, column=0, padx=10, pady=5)
query = Entry(frame_entry,width=80)
query.grid(row=0, column=0, padx=5)
btn_audio = tk.Button(frame_entry, text="record", command= lambda: audio_to_text(vocabulary,list_msg))
btn_audio.grid(row=0, column=1, padx=2)
btn_send = tk.Button(frame_entry, text="send", command= lambda: refresh_chat(query.get(),vocabulary,list_msg))
btn_send.grid(row=0, column=2, padx=2)

# end conversation and close the window
btn_quit = tk.Button(window, text="quit", command=window.destroy)
btn_quit.grid(row=3, column=0, pady=15)

window.mainloop()