Data Science Fundamentals: Python |
[Table of Contents](../../index.ipynb)
- - - 
<!--NAVIGATION-->
Real World Examples: [Web Scraping](../web_scraping/01_rw_web_scraping.ipynb) | [Automation](../automation/02_rw_automation.ipynb) | [Messaging](../messaging/03_rw_messaging.ipynb) | [CSV](../csv/04_rw_csv.ipynb) | [Games](../games/05_games.ipynb) | [Mobile](../mobile/06_mobile.ipynb) | [Computer Vision](../computer_vision/08_computer_vision.ipynb) | **[Chatbot](../chatbot/10_chatbot.ipynb)** | [Built-In Database](../database/11_database.ipynb) 

# [Building a Simple Chatbot from Scratch in Python (using NLTK)](https://github.com/parulnith/Building-a-Simple-Chatbot-in-Python-using-NLTK)

![Alt text](https://cdn-images-1.medium.com/max/800/1*pPcVfZ7i-gLMabUol3zezA.gif)

History of chatbots dates back to 1966 when a computer program called ELIZA was invented by Weizenbaum. It imitated the language of a psychotherapist from only 200 lines of code. You can still converse with it here: [Eliza](http://psych.fullerton.edu/mbirnbaum/psych101/Eliza.htm?utm_source=ubisend.com&utm_medium=blog-link&utm_campaign=ubisend). 

On similar lines let's create a very basic chatbot utlising the Python's NLTK library.It's a very simple bot with hardly any cognitive skills,but still a good way to get into NLP and get to know about chatbots.

For detailed analysis, please see the accompanying blog titled: **[Building a Simple Chatbot in Python (using NLTK](https://medium.com/analytics-vidhya/building-a-simple-chatbot-in-python-using-nltk-7c8c8215ac6e)**


## NLP
NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation.

## Import necessary libraries

In [None]:
pip install sklearn

Collecting sklearn
  Using cached sklearn-0.0.tar.gz (1.1 kB)
Collecting scikit-learn
  Downloading scikit-learn-0.23.2.tar.gz (7.2 MB)
[K     |████████████████████████████████| 7.2 MB 252 kB/s eta 0:00:01
[?25h  Installing build dependencies ... [?25l-

In [None]:
import io
import random
import string # to process standard python strings
import warnings
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

## Downloading and installing NLTK
NLTK(Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

[Natural Language Processing with Python](http://www.nltk.org/book/) provides a practical introduction to programming for language processing.

For platform-specific instructions, read [here](https://www.nltk.org/install.html)



In [1]:
pip install nltk

You should consider upgrading via the '/usr/local/Cellar/jupyterlab/2.2.9/libexec/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


### Installing NLTK Packages




In [2]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('popular', quiet=True) # for downloading packages
#nltk.download('punkt') # first-time use only
#nltk.download('wordnet') # first-time use only

True

# How Does A Chatbot Work

![image](images/chatbot.png)

## What is a Corpus?

In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts. Corpus is a collection of written texts and corpora is the plural of corpus. In NLTK, you have some corpora included like Gutenberg Corpus, Web and Chat Text and so on.

In this example, you are going to use Gutenberg Corpus. To import it, create a new file and type:

In [3]:
from nltk.corpus import gutenberg as gt

So this corpus has different txt txt files which contain different texts. If you want to see all the texts that this corpus has, you can say:

In [4]:
print(gt.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


So you can see that this corpus has texts like Hamlet, Macbeth and a novel of Milton.

Let’s say that you want to access the file shakespeare-macbeth.txt  and see what words the text have. To do this, you can use the words  method. So in your code type:

In [5]:
shakespeare_macbeth = gt.words("shakespeare-macbeth.txt")
print(shakespeare_macbeth)

['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', ...]


Let’s say that now you want to see the sentences your text has. You can use the sents  function. So in your code type

In [6]:
sents = gt.sents("shakespeare-macbeth.txt")
print(sents)

[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...]


In [7]:
raw = gt.raw("shakespeare-macbeth.txt")
print(raw)

[The Tragedie of Macbeth by William Shakespeare 1603]


Actus Primus. Scoena Prima.

Thunder and Lightning. Enter three Witches.

  1. When shall we three meet againe?
In Thunder, Lightning, or in Raine?
  2. When the Hurley-burley's done,
When the Battaile's lost, and wonne

   3. That will be ere the set of Sunne

   1. Where the place?
  2. Vpon the Heath

   3. There to meet with Macbeth

   1. I come, Gray-Malkin

   All. Padock calls anon: faire is foule, and foule is faire,
Houer through the fogge and filthie ayre.

Exeunt.


Scena Secunda.

Alarum within. Enter King Malcome, Donalbaine, Lenox, with
attendants,
meeting a bleeding Captaine.

  King. What bloody man is that? he can report,
As seemeth by his plight, of the Reuolt
The newest state

   Mal. This is the Serieant,
Who like a good and hardie Souldier fought
'Gainst my Captiuitie: Haile braue friend;
Say to the King, the knowledge of the Broyle,
As thou didst leaue it

   Cap. Doubtfull it stood,
As two spent Swimmers, t

You can use those functions to do more elaborate things. If you want for example see the number of words and sentences in all of the texts present in your corpus, you can say:

In [8]:
for fileid in gt.fileids():
    num_words = len(gt.words(fileid))
    num_sents = len(gt.sents(fileid))
    print("Data for file:", fileid)
    print("Number of words:", num_words)
    print("Number of sentences:", num_sents, end="\n\n\n")

Data for file: austen-emma.txt
Number of words: 192427
Number of sentences: 7752


Data for file: austen-persuasion.txt
Number of words: 98171
Number of sentences: 3747


Data for file: austen-sense.txt
Number of words: 141576
Number of sentences: 4999


Data for file: bible-kjv.txt
Number of words: 1010654
Number of sentences: 30103


Data for file: blake-poems.txt
Number of words: 8354
Number of sentences: 438


Data for file: bryant-stories.txt
Number of words: 55563
Number of sentences: 2863


Data for file: burgess-busterbrown.txt
Number of words: 18963
Number of sentences: 1054


Data for file: carroll-alice.txt
Number of words: 34110
Number of sentences: 1703


Data for file: chesterton-ball.txt
Number of words: 96996
Number of sentences: 4779


Data for file: chesterton-brown.txt
Number of words: 86063
Number of sentences: 3806


Data for file: chesterton-thursday.txt
Number of words: 69213
Number of sentences: 3742


Data for file: edgeworth-parents.txt
Number of words: 210663

## Loading In Your Own Corpus

To do this, you need a corpus reader so create a new file named loading-your-own-corpus.py  with the following lines.

In [4]:
from nltk.corpus import PlaintextCorpusReader
import os

**[List Of All Shakespeare Plays To Download](http://www.textfiles.com/etext/AUTHORS/SHAKESPEARE/)**

To continue, use the play [Taming of the Shrew](http://www.textfiles.com/etext/AUTHORS/SHAKESPEARE/shakespeare-taming-2.txt) and place it in the ```data/``` directory.

After you download the play, create an object of PlainTextCorpusReader with the following line

In [5]:
corpus_root = os.getcwd() + "/data"
file_ids = ".*.txt"
corpus = PlaintextCorpusReader(corpus_root, file_ids)

As you can see, PlainTextCorpusReader  expects two inputs in its constructor. The first one is corpus_root  and the second one is the file_ids  . The corpus_root  is the path of your files and the file_ids  are the name of the files.

To get the path of your files, you can use the getcwd  method of os  module. Note that we add a /  in the path. In the file_id , we use a RegEx expression to fetch all the files that you want. In our example, we want all files that have the .txt extension.

As this object returns you a corpus object, you can use the same functions you used in the previous section. So if you want to see the words in the text, for example, you can use:

In [17]:
print(corpus.words("shakespeare-taming-2.txt"))

['THE', 'TAMING', 'OF', 'THE', 'SHREW', 'DRAMATIS', ...]


## Reading in the corpus

For our example,we will be using the Wikipedia page for chatbots as our corpus. Copy the contents from the page and place it in a text file named ‘chatbot.txt’. However, you can use any corpus of your choice.

In [6]:
f=open('data/chatbot.txt','r',errors = 'ignore')
raw=f.read()
raw = raw.lower()# converts to lowercase


The main issue with text data is that it is all in text format (strings). However, the Machine learning algorithms need some sort of numerical feature vector in order to perform the task. So before we start with any NLP project we need to pre-process it to make it ideal for working. Basic text pre-processing includes:

* Converting the entire text into **uppercase** or **lowercase**, so that the algorithm does not treat the same words in different cases as different

* **Tokenization**: Tokenization is just the term used to describe the process of converting the normal text strings into a list of tokens i.e words that we actually want. Sentence tokenizer can be used to find the list of sentences and Word tokenizer can be used to find the list of words in strings.

_The NLTK data package includes a pre-trained Punkt tokenizer for English._

* Removing **Noise** i.e everything that isn’t in a standard number or letter.
* Removing the **Stop words**. Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words
* **Stemming**: Stemming is the process of reducing inflected (or sometimes derived) words to their stem, base or root form — generally a written word form. Example if we were to stem the following words: “Stems”, “Stemming”, “Stemmed”, “and Stemtization”, the result would be a single word “stem”.
* **Lemmatization**: A slight variant of stemming is lemmatization. The major difference between these is, that, stemming can often create non-existent words, whereas lemmas are actual words. So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma. Examples of Lemmatization are that “run” is a base form for words like “running” or “ran” or that the word “better” and “good” are in the same lemma so they are considered the same.



## Tokenisation

In [7]:
sent_tokens = nltk.sent_tokenize(raw)# converts to list of sentences 
word_tokens = nltk.word_tokenize(raw)# converts to list of words

## Preprocessing

We shall now define a function called LemTokens which will take as input the tokens and return normalized tokens.

In [8]:
lemmer = nltk.stem.WordNetLemmatizer()
#WordNet is a semantically-oriented dictionary of English included in NLTK.
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

NameError: name 'string' is not defined

## Keyword matching

Next, we shall define a function for a greeting by the bot i.e if a user’s input is a greeting, the bot shall return a greeting response.ELIZA uses a simple keyword matching for greetings. We will utilize the same concept here.

In [9]:
GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey",)
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]
def greeting(sentence):
 
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

## Generating Response

### Bag of Words
After the initial preprocessing phase, we need to transform text into a meaningful vector (or array) of numbers. The bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

* A vocabulary of known words.

* A measure of the presence of known words.

Why is it is called a “bag” of words? That is because any information about the order or structure of words in the document is discarded and the model is only **concerned with whether the known words occur in the document, not where they occur in the document.**

The intuition behind the Bag of Words is that documents are similar if they have similar content. Also, we can learn something about the meaning of the document from its content alone.

For example, if our dictionary contains the words {Learning, is, the, not, great}, and we want to vectorize the text “Learning is great”, we would have the following vector: (1, 1, 0, 0, 1).


### TF-IDF Approach
A problem with the Bag of Words approach is that highly frequent words start to dominate in the document (e.g. larger score), but may not contain as much “informational content”. Also, it will give more weight to longer documents than shorter documents.

One approach is to rescale the frequency of words by how often they appear in all documents so that the scores for frequent words like “the” that are also frequent across all documents are penalized. This approach to scoring is called Term Frequency-Inverse Document Frequency, or TF-IDF for short, where:

**Term Frequency: is a scoring of the frequency of the word in the current document.**

```
TF = (Number of times term t appears in a document)/(Number of terms in the document)
```

**Inverse Document Frequency: is a scoring of how rare the word is across documents.**

```
IDF = 1+log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.
```
### Cosine Similarity

Tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus

```
Cosine Similarity (d1, d2) =  Dot product(d1, d2) / ||d1|| * ||d2||
```
where d1,d2 are two non zero vectors.



To generate a response from our bot for input questions, the concept of document similarity will be used. We define a function response which searches the user’s utterance for one or more known keywords and returns one of several possible responses. If it doesn’t find the input matching any of the keywords, it returns a response:” I am sorry! I don’t understand you”

In [10]:
def response(user_response):
    robo_response=''
    sent_tokens.append(user_response)
    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx=vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    if(req_tfidf==0):
        robo_response=robo_response+"I am sorry! I don't understand you"
        return robo_response
    else:
        robo_response = robo_response+sent_tokens[idx]
        return robo_response

Finally, we will feed the lines that we want our bot to say while starting and ending a conversation depending upon user’s input.

In [11]:
flag=True
print("ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type Bye!")
while(flag==True):
    user_response = input()
    user_response=user_response.lower()
    if(user_response!='bye'):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("ROBO: You are welcome..")
        else:
            if(greeting(user_response)!=None):
                print("ROBO: "+greeting(user_response))
            else:
                print("ROBO: ",end="")
                print(response(user_response))
                sent_tokens.remove(user_response)
    else:
        flag=False
        print("ROBO: Bye! take care..")

ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type Bye!
hello


NameError: name 'random' is not defined

- - - 
<!--NAVIGATION-->
Real World Examples: [Web Scraping](../web_scraping/01_rw_web_scraping.ipynb) | [Automation](../automation/02_rw_automation.ipynb) | [Messaging](../messaging/03_rw_messaging.ipynb) | [CSV](../csv/04_rw_csv.ipynb) | [Games](../games/05_games.ipynb) | [Mobile](../mobile/06_mobile.ipynb) | [Feature Engineering](../feature_engineering/07_feature-engineering.ipynb) | [Computer Vision](../computer_vision/08_computer_vision.ipynb) | **[Chatbot](./10_chatbot.ipynb)**