# Chatbot

Build a **corpus-based conversational chatbot using NLTK and python**, using this [reference tutorial](https://medium.com/swlh/a-chatbot-in-python-using-nltk-938a37a9eacc) for guidance and ideas.
<br><br>

Perform the following tasks:
1. Use the dataset provided in the tutorial, or develop your own dataset with similar structure.
2. Perform text normalisation: convert text to lowercase,remove special characters, and perform lemmatisation; remove any stopwords.
3. Use word embeddings such as:bag of words and TF-IDF, and compute cosine similarity.
4. Compare the performance and results of the two methods,i.e. bag of words (BOW) and TF-IDF.
5. Customise using any of the NLP techniques we have learned.

In [2]:
import nltk, re # NLTK library of language resources
nltk.download('omw-1.4')

# Part of speech tagging and tokenisation
from nltk import pos_tag, word_tokenize

# to perform lemmatisation
from nltk.stem import wordnet, WordNetLemmatizer

# stopwords
from nltk.corpus import stopwords

import json

# to perform bow and tfidf
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# For cosine similarity
from sklearn.metrics import pairwise_distances

# Data processing and visualisation
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image



[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Import the dataset (corpus) into a pandas dataframe

In [3]:
df = pd.read_csv("jawa.csv")
df.head(15)

Unnamed: 0,Context,Text response,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8
0,Kapan mangan Soto Lamongan?,Wes kangen banget karo rasane Soto Lamongan si...,,,,,,,
1,Piye rasane Soto Lamongan?,Aku seneng banget karo panganan Soto Lamongan ...,,,,,,,
2,ngendi dheweke arep mangan pempek?,"dheweke milih ngrasakake khas daerah, yaiku pe...",,,,,,,
3,ngendi padha mangan pitik?,padha seneng banget karo masakan pitik ing War...,,,,,,,
4,ngendi dheweke ngombe dawet?,dheweke sregep ngrasakake segelas dawet ijo se...,,,,,,,
5,Apa sing dadi spesial bakso Malang?,Bakso Malang misuwur kanthi saos kental sing e...,,,,,,,
6,Ing endi papan sing paling apik kanggo nikmati...,Sampeyan bisa nyoba sarapan tradisional Malang...,,,,,,,
7,Apa sing ndadekake jajanan pinggir dalan ing A...,Jajan jalanan ing Alun-Alun Malang populer ama...,,,,,,,
8,Apa ana panganan cuci mulut sing patut dicoba ...,"Kanggo panganan cuci mulut, sampeyan kudu nyob...",,,,,,,
9,Ing ngendi aku bisa nemokake warung kopi sing ...,Sampeyan bisa nemokake warung kopi sing nyaman...,,,,,,,


In [4]:
df = df.iloc[:, :2]

In [5]:
df.shape

(1002, 2)

There are 1592 entries in the dataset, each entry giving a turn-response in the conversation, similar questions are grouped together and NaN values indicate a response similar to the previous entry.


In [6]:
# Replace null values with previous value
df.ffill(axis = 0, inplace = True)
df.head(10)

Unnamed: 0,Context,Text response
0,Kapan mangan Soto Lamongan?,Wes kangen banget karo rasane Soto Lamongan si...
1,Piye rasane Soto Lamongan?,Aku seneng banget karo panganan Soto Lamongan ...
2,ngendi dheweke arep mangan pempek?,"dheweke milih ngrasakake khas daerah, yaiku pe..."
3,ngendi padha mangan pitik?,padha seneng banget karo masakan pitik ing War...
4,ngendi dheweke ngombe dawet?,dheweke sregep ngrasakake segelas dawet ijo se...
5,Apa sing dadi spesial bakso Malang?,Bakso Malang misuwur kanthi saos kental sing e...
6,Ing endi papan sing paling apik kanggo nikmati...,Sampeyan bisa nyoba sarapan tradisional Malang...
7,Apa sing ndadekake jajanan pinggir dalan ing A...,Jajan jalanan ing Alun-Alun Malang populer ama...
8,Apa ana panganan cuci mulut sing patut dicoba ...,"Kanggo panganan cuci mulut, sampeyan kudu nyob..."
9,Ing ngendi aku bisa nemokake warung kopi sing ...,Sampeyan bisa nemokake warung kopi sing nyaman...


* Text normalisation (lower case, remove special characters, lemmatisation)
* word embedding (bag of words (BOW), TF-IDF)
* cosine similarity

In [7]:
df1 = df.head(10)

In [8]:
def normalise_text(text):
    '''
    Function takes a text string (utterance) as input,
    converts to lowercase,
    removes special characters and punctuation,
    tokenises, POS-tags and lemmatises each token..
    Joins lemmatised tokens, and returns lemmatised string.

    '''

    # convert to lowercase
    text = str(text).lower()

    # remove special characters
    text = re.sub(r'[^a-z0-9]', " ",text)

    # tokenise
    tokens = word_tokenize(text)

    # Initialise lemmatiser
    lemmatiser = wordnet.WordNetLemmatizer()

    # Part of speech (POS) tagging, tagset set to default
    tagged_tokens =  pos_tag(tokens, tagset = None)

    # Empty list
    token_lemmas = []
    for (token, pos_token) in tagged_tokens:
        if pos_token.startswith("V"): # verb
            pos_val = "v"
        elif pos_token.startswith("J"): # adjective
            pos_val = "a"
        elif pos_token.startswith("R"): # adverb
            pos_val = "r"
        else:
            pos_val = 'n' # noun

        # lemmatise and append to list of lemmatised tokens
        token_lemmas.append(lemmatiser.lemmatize(token, pos_val))

    return " ".join(token_lemmas)

In [11]:
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [15]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [16]:
# apply the normalise_text function to each entry in the context column
df["lemmatised_text"] = df["Context"].apply(normalise_text)
df.head()

Unnamed: 0,Context,Text response,lemmatised_text
0,Kapan mangan Soto Lamongan?,Wes kangen banget karo rasane Soto Lamongan si...,kapan mangan soto lamongan
1,Piye rasane Soto Lamongan?,Aku seneng banget karo panganan Soto Lamongan ...,piye rasane soto lamongan
2,ngendi dheweke arep mangan pempek?,"dheweke milih ngrasakake khas daerah, yaiku pe...",ngendi dheweke arep mangan pempek
3,ngendi padha mangan pitik?,padha seneng banget karo masakan pitik ing War...,ngendi padha mangan pitik
4,ngendi dheweke ngombe dawet?,dheweke sregep ngrasakake segelas dawet ijo se...,ngendi dheweke ngombe dawet


### Remove stopwords

In [17]:
def remove_stopwords(text):

    # stopwords
    stop = stopwords.words("english")

    #if token not in stop
    text = [word for word in text.split() if word not in stop]
    return " ".join(text)

### Bag of words (BOW)

In [18]:
# count vectoriser
cv = CountVectorizer()
X = cv.fit_transform(df["lemmatised_text"]).toarray()

features = cv.get_feature_names_out()
df_bow = pd.DataFrame(X, columns = features)
df_bow.head()

Unnamed: 0,10k,abang,abdi,acara,aceh,adat,adhedhasar,adhem,adol,akeh,...,wong,wonogiri,wonten,wujude,yaiku,yang,yen,yogyakarta,yogyakata,youtuber
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Utterance
s = string input to the chatbot

In [24]:
# Utterance preprocessing - remove stopwords and normalise text

#s = "I can't believe how tasty the Korean is!!! We should go there again!"
#s = "What's the weather like tomorrow?"
s = "pecel!"

t = remove_stopwords(s)
print(t)

u = normalise_text(t)
print(u)

pecel!
pecel


In [20]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [25]:
# Convert the preprocessed utterance to bag of words
u_bow = cv.transform([u]).toarray()

# Apply the cosine similarity to utternace to search for
# most similar utterance in training dataset bow
cosine_value = 1 - pairwise_distances(df_bow, u_bow, metric = "cosine")

# cosine_value calculates similarity between utterance and each entry in the dataset

# print the question
print(s)

# Get the index of the most similar entry
index_value1 = cosine_value.argmax()

# Get the response of the most similar entry
df.loc[index_value1,"Text response"]

pecel!


'Pecel sambal rampal nggunakake sayuran ijo kayata kangkung sing dilapisi sambel kacang pedhes sing disuguhake kanthi tambahan kayata tempe.'

In [26]:
# Initialise sklearn tfidf
tfidf = TfidfVectorizer()
x_tfidf = tfidf.fit_transform(df["lemmatised_text"]).toarray()
df_tfidf = pd.DataFrame(x_tfidf, columns = tfidf.get_feature_names_out())
df_tfidf.head()

Unnamed: 0,10k,abang,abdi,acara,aceh,adat,adhedhasar,adhem,adol,akeh,...,wong,wonogiri,wonten,wujude,yaiku,yang,yen,yogyakarta,yogyakata,youtuber
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [44]:
s = "opo iku gudeg?"
s = "Rawon enak?"
s = "Dpiye cara masak pecel?"
s = "mie gacoan!"
s = "sore enak e mangan opo?"
s = "panganan jogja seng enak?"
#s = "ojo lali mangan sego goreng!"
s="panganan khas madiun?"


def chat_tfidf(text):

    # Lemmatised utterance
    text = normalise_text(text)
    print(text)

    text_tfidf = tfidf.transform([text]).toarray()
    cos = 1 - pairwise_distances(df_tfidf, text_tfidf, metric = "cosine")
    index_value = cos.argmax()
    return df["Text response"].loc[index_value]

chat_tfidf(s)

panganan khas madiun


'Kutha Madiun misuwur kanthi panganan khas pecel'

In [39]:
!pip install gTTS



Collecting gTTS
  Downloading gTTS-2.4.0-py3-none-any.whl (29 kB)
Installing collected packages: gTTS
Successfully installed gTTS-2.4.0


In [45]:
from gtts import gTTS
from IPython.display import Audio, display

def text_to_speech(text):
    tts = gTTS(text=text, lang='id')
    tts.save("output.mp3")


    return Audio("output.mp3", autoplay=True)


text_response = chat_tfidf(s)
audio_widget = text_to_speech(text_response)
display(audio_widget)


panganan khas madiun


# Storing a list of conversation topics

In [29]:
def get_conversation_topics(text):
    '''
    Function takes a text string (utterance) as input,
    converts to lowercase, removes special characters and punctuation,
    tokenises, POS-tags and returns nouns.
    '''
    text = str(text).lower() # convert to lowercase
    text = re.sub(r'[^a-z0-9]', " ",text) # remove special characters
    tokens = word_tokenize(text) # tokenise
    # Initialise lemmatiser
    lemmatiser = wordnet.WordNetLemmatizer()
     # Part of speech (POS) tagging, tagset set to default
    tagged_tokens =  pos_tag(tokens, tagset = None)

    # Empty lists to store nouns and verbs from input
    noun_lemmas = []
    verb_lemmas = []
    for (token, pos_token) in tagged_tokens:
        if pos_token.startswith("V"): # verb
            pos_val = "v"
            noun_lemmas.append(lemmatiser.lemmatize(token, pos_val))
        elif pos_token.startswith("NN"): # noun
            pos_val = 'n' # noun
            verb_lemmas.append(lemmatiser.lemmatize(token, pos_val))

    if len(set(noun_lemmas)) > 0:
        return sorted(set(noun_lemmas))
    elif len(set(verb_lemmas)) >0:
        return sorted(set(verb_lemmas))
    else:
        return ["I am not sure... What do you think?",
                "I need time to think about that. Do you have other ideas?",
                "Hmmm"
                ]

In [31]:
# Use tfidf on text if cosine similarity > 0.75
# Otherwise generate text on subject
def chat_extend(chat_input):

    # Lemmatised utterance
    text = normalise_text(chat_input)
    #print(text)

    text_tfidf = tfidf.transform([text]).toarray()
    cos = 1 - pairwise_distances(df_tfidf, text_tfidf, metric = "cosine")

    # Use cosine similarity w tfidf for strong simliarity
    if max(cos) >= 0.8:
        index_value = cos.argmax()
        return df["Text response"].loc[index_value]
    else:
        topic_options = get_conversation_topics(chat_input)
        topic = random.choice(topic_options)
        return topic # To use as seed to generate bot response




In [32]:

s = "pecel enak!"

chat_extend(s) # Use output as seed for response generation
#### work in progress

'enak'

In [33]:
s="opo iku nasi padang?"
chat_extend(s)

'iku'

## MODIFICATIONS
* Extend the training corpus - use NLTK chat corpus for BOW
* Hybrid - introduce rules to integrate rules based conversation
* Adapt BERT or GPT-3
* Add voice output
* javanese dataset

## REFERENCES

* [Chatbot tutorial by Bhargava Sai Reddy P](https://medium.com/swlh/a-chatbot-in-python-using-nltk-938a37a9eacc)
* [A Sentiment-Based Chat Bot](https://docslib.org/doc/25782/a-sentiment-based-chat-bot) [accessed 5 December 2022].
* [How To Make AI Chatbot In Python Using NLP (NLTK) In 2022?, Pykit, 2022](https://pykit.org/chatbot-in-python-using-nlp/) [accessed 5 December 2022]
* [How To Create A Chatbot with Python & Deep Learning In Less Than An Hour](https://towardsdatascience.com/how-to-create-a-chatbot-with-python-deep-learning-in-less-than-an-hour-56a063bdfc44)[accessed 5 December 2022]
* [Build a simple chat bot graphical user interface using python](https://randerson112358.medium.com/build-a-simple-chat-bot-graphical-user-interface-using-python-adf7bd558fc3)[accessed 5 December 2022]
* [Natural Language Toolkit](https://www.nltk.org/book/)
* [Chatbot](https://www.kaggle.com/code/ksenia5/chatbot/notebook)

### Chatbot deployment on the web
* [Chatbot deployment with flask code](https://github.com/patrickloeber/chatbot-deployment) and [video tutorial](https://www.youtube.com/watch?v=a37BL0stIuM)
* https://www.geeksforgeeks.org/gui-chat-application-using-tkinter-in-python/
* https://www.python-engineer.com/posts/chatbot-gui-tkinter/