# Rule-Based Chatbot using Wikipedia

Rule-based chatbots are relatively simple as compared to learning-based chatbots. There are a specific set of rules. If the user query matches any rule, the answer to the query is generated, otherwise the user is notified that the answer to the user query does not exist.

These chatbots always give accurate results. But they do not scale well. To add more responses, new rules must be defined.

In this notebook, a rule-based chatbot will be developed using user input to define the query request from Wikipedia, from which the corpus will be scraped.

In [1]:
#libraries
import nltk
import numpy as np
import random
import string

import bs4 as bs #to parse the data from Wikipedia
import urllib.request
import re

## Define the query and generate the corpus

In [2]:
query = input("What do you want to learn about? ", )

query_formatted = query.title()
query_formatted = '_'.join(query_formatted.split())

What do you want to learn about? United Kingdom


In [3]:
raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/{}'.format(query_formatted))
raw_html = raw_html.read()

article_html = bs.BeautifulSoup(raw_html, 'lxml')

article_paragraphs = article_html.find_all('p')

article_text = ''

for para in article_paragraphs:
    article_text += para.text

article_text = article_text.lower()

## Text preprocessing

In [4]:
#remove special characters and empty spaces
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)

In [5]:
#divide text into sentences and words for cosine similarity
article_sentences = nltk.sent_tokenize(article_text)
article_words = nltk.word_tokenize(article_text)

In [6]:
#create helper functions to remove punctuation from input and lemmatise text
wnlemmatizer = nltk.stem.WordNetLemmatizer()

def perform_lemmatization(tokens):
    return [wnlemmatizer.lemmatize(token) for token in tokens]

punctuation_removal = dict((ord(punctuation), None) for punctuation in string.punctuation)

def get_processed_text(document):
    return perform_lemmatization(nltk.word_tokenize(document.lower().translate(punctuation_removal)))

## Responding the user input
### Standard greetings responses

In [7]:
#define greeting responses
greeting_inputs = ("hello", "hiya", "hey", "good morning", "good evening", "morning", "evening", "hi", "whatsup")
greeting_responses = ["hey", "hey how are you?", "*nods*", "hello, how you doing", "hello", "Welcome", "greetings"]

#function to generate greeting response to user greeting
def generate_greeting_response(greeting):
    for token in greeting.split():
        if token.lower() in greeting_inputs:
            return random.choice(greeting_responses)

### Generating responses to other user input

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/charlottefettes/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [9]:
#function to generate response to user input other than greeting
def generate_response(user_input):
    robo_response = ''
    article_sentences.append(user_input)

    word_vectorizer = TfidfVectorizer(tokenizer=get_processed_text, stop_words='english')
    all_word_vectors = word_vectorizer.fit_transform(article_sentences)
    
    #cosine similarity to find cosine similarity between last item in all_word_vectors
    #list and word vectors for user input for all corpus sentences
    similar_vector_values = cosine_similarity(all_word_vectors[-1], all_word_vectors)
    
    #sort list containing cosine similarities of vectors; second to last will have highest cosine 
    #with user input (last will be the user input)
    similar_sentence_number = similar_vector_values.argsort()[0][-2]

    #flatten cosine similarity and check if equal to 0 (query does not have an answer) or not
    matched_vector = similar_vector_values.flatten()
    matched_vector.sort()
    vector_matched = matched_vector[-2]

    if vector_matched == 0:
        robo_response = robo_response + "I am sorry, I do not understand you"
        return robo_response
    else:
        robo_response = robo_response + article_sentences[similar_sentence_number]
        return robo_response

## Generating a conversation with the Chatbot

In [12]:
continue_dialogue = True
print("Hello, I am your friend Robo. You can ask me any question regarding {}:".format(query))
while(continue_dialogue == True):
    human_text = input()
    human_text = human_text.lower()
    if human_text != 'bye':
        if human_text == 'thanks' or human_text == 'thank you very much' or human_text == 'thank you':
            continue_dialogue = False
            print("Robo: Most welcome")
        else:
            if generate_greeting_response(human_text) != None:
                print("Robo: " + generate_greeting_response(human_text))
            else:
                print("Robo: ", end="")
                print(generate_response(human_text))
                article_sentences.remove(human_text)
    else:
        continue_dialogue = False
        print("Robo: Good bye and take care of yourself...")

Hello, I am your friend Robo. You can ask me any question regarding United Kingdom:
what is the climate


  'stop_words.' % sorted(inconsistent))


Robo: higher elevations in scotland experience a continental subarctic climate (dfc) and the mountains experience a tundra climate (et).
thanks
Robo: Most welcome
