Human beings communicate among themselves using language. Language is made of alphabets and grammar. If we've to communicate in the same way with a computer, we've to make him understand these alphabets and grammar. If we were expecting an answer from computer on - "What's the weather like outside?". We've to first teach him the meaning of the words used in this sentence and also the context in which these words are used. Also, once computer understands the question, it should be able to generate an appropriate response. So, the process of understaing the language and generating it is known as Natural Language Processing.

This is mainly used for processing unstructured data. Unstructured data is the data that is generated from text messages, tweets, blogs, images. Structured data is the one that is organized and neatly fit into the defined schema. For example, data in Relational Databases. Around 80% of the data is unstructured and we need a means to process this data to make inference out of it. NLP deals with this unstructured data (Mostly text)

NLP can be of two types - Natural Language Understanding (NLU) and Natural Language Generation (NLG). We're building Chatbot using Natural Language Toolkit. The basic format to process data is 
1. Tokenization
2. Stemming and Lemmatization

Tokenization means breaking text sentences into sentences and words. 
Stemming means to convert words into some base word. This base word may not be a proper word. For example - Stemming of words give, given, giving results in giv and giv is not a word in English dictionary.
Lemmatization is the same process as that of Stemming. However, Lemmatization results in proper word of the dictionary. 

We've to run our text through these processes to weigh the words used in the sentence. We're going to build a chatbot using NLTK and SciKit packages. Let's first import the basic packages.

In [1]:
import nltk
import numpy
import random
import string

nltk is Natural Language Toolkit
numpy is used for faster execution of array operations
We'll be using random to generate a random choice and string is imported because we'll be using some of its basics methods

In [2]:
chatbots_file = open('chatbot.txt', 'r', errors='ignore')
content = chatbots_file.read()
content = content.lower()

# UnComment this after first time download
nltk.download('punkt')
nltk.download('wordnet')

sentence_tokens = nltk.sent_tokenize(content)
word_tokens = nltk.word_tokenize(content)

[nltk_data] Downloading package punkt to /home/gaurav/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/gaurav/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


To get started, We're using chatbots data from wikipedia and is directly dumped into chatbot.text file. We're reading that file using the read method and then storing its entire content in lower case in the content variable. NLTK comes bundled with a lot of models (For example, it has models containing data on movie reviews and ratings). We're using its punkt and wordnet models. punkt is a pre-trained tokenizer for English. Wordnet is a lexical database for the English language created by Princeton. It can be used to find the meanings of words, synonyms, antonynms.

sent_tokenize method converts our content into sentences and word_tokenize method converts it into words. This may sound simple if we assume that the sentence tokenizer splits the text based on the occurrence of a period. But that is not correct. It is smart enough to identify the period in "Mr. John is a wise man. He wakes up early in the morning." It does't consider the dot after Mr as another sentence. 

In [3]:
lemmer = nltk.stem.WordNetLemmatizer()

We're using a WordNetLemmatizer for finding the lemma of words used in our content.

In [4]:
def lem_tokens (tokens):
    return [lemmer.lemmatize(token) for token in tokens]

In [5]:
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

The above sentence is used for removing the unnecessary noise created by the punctuation marks. We're putting None if any punctuation marks are present in the sentence. Let's see what we get when we print string.punctuation

In [6]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [7]:
def lem_normalize (text):
    return lem_tokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

The `greeting` method below returns a random response from the `GREETING_RESPONSES` list if the input is anyone of the `GREETING_INPUTS`

In [8]:
GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey")
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]

def greeting (sentence):
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

We're using scikit-learn library for generating the TF-IDF values. TF-IDF is Term Frequency-Inverse Document Frequency value. 

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Skip the `response` method below as of now. We'll get into its details in some time.

In [10]:
def response (user_response):
    robo_response = ''
    sentence_tokens.append(user_response)
    TfidVec = TfidfVectorizer(tokenizer = lem_normalize, stop_words = 'english')
    tfidf = TfidVec.fit_transform(sentence_tokens)
    
    values = cosine_similarity(tfidf[-1], tfidf)
    
    idx = values.argsort()[0][-2]
    flat = values.flatten()
    flat.sort()
    
    req_tfidf = flat[-2]
    
    if (req_tfidf == 0):
        robo_response = robo_response + 'I\'m sorry! I don\'t understand you.'
    else:
        robo_response = robo_response + sentence_tokens[idx]
        
    return robo_response

We'll have to implement a mechanism to take input from the user, process that input and send back the relevant response. `input()` renders a input field and whatever the user types in that field in stored in the variable `user_response`. We convert the user's input in lower case to match the content that we have (Remember, we've converted our entire chatbots content in lower case). We've to end the loop of conversation between bot and user when the user types `bye`. Check the if condition that evalutes if the input is either `thanks` or `thank you`. It then terminates the loop by saying `You are welcome..`. If the input is any of our `GREETING_INPUTS`, we send back any random value from the `GREETING_RESPONSES` list. This logic is implemented in the `greeting` method above. Now, lets say user has asked a question to the bot. We call the `response` method from above and render its output as the answer to the user's questions.

Now, its time to understand what our `response` method actually does.

We append the questions of the user (stored in `user_response` variable) to `sentence_tokens` (`sentence_tokens` contains sentence tokens from our chatbots content). As stated above, TF-IDF is Term Frequency Inverse Document Frequency. Before getting into its details, we've to first understand how the bot would respond to the user's query.

The single source of truth for Bot is the `chatbot.txt` file. That's all the bot knows. It'll search that document for answering our questions. When we ask - `What is a turing machine?`. It'll find all the relevant sentences that mention turing machine and will return one of them. In this process, it has to make sure that it puts weights on words appropriately. Like for this question, if it starts searching for `what` and `is`, it wouldn't make sense. It should be searching for `turing machine`. Now, that's the task that we've to handle - to weigh down the words. 

If we were to do this by simple logic of calculating the frequency of words (This is called as Bag of Words), that wouldn't be appt. To weigh our words appropriately, we're using TF-IDF.

TF (Term Frequency) = Number of time a term appears in a document / Total number of terms in the document
IDF (Inverse Document Frequency) = log ( Total number of documents / Number of documents with that term in it)

TF-IDF = TF*IDF

We're using `TfidfVectorizer` to convert our document to a matrix of TF-IDF features. 

`TfidVec = TfidfVectorizer(tokenizer = lem_normalize, stop_words = 'english')` By supplying the `tokenizer` as `lem_normalize`, we're overriding its default way of lemmatizing tokens. As mentioned above, there are some words in English like `what` `is` that are of low importance and should be ignored while generating a response. `stop_words` contains an extensive list of such words in English. These words will be assigned a lowest value of 0 in the resulting TF-IDF matrix. Now that we initialized our `TfidVec`, we'll be providing it our document. 
`tfidf = TfidVec.fit_transform(sentence_tokens)` returns a tf-idf matrix. 

Next, `values = cosine_similarity(tfidf[-1], tfidf)`. `cosine_similarity` takes in two TF-IDF vectors and computes similarity between these two. The two arguments provided to `cosine_similarity` here are `tfidf[-1]` - The last element in our tfidf matrix (Remember we've appended the user's question to our sentence_tokens, that's the first argument) and `tfidf` is the entire matrix. `cosine_similarity` measures the cosine of the angle between these two vectors. It takes into consideration the orientation of the vectors and not their magnitude. An orintation of 0 degrees means the vectors are parallel and this would result in cosine of 0 deg as 1, which suggests that the vectors are similar. Cosine of positive space would result in values from 0 to 1. So, two vectors that are 90deg apart would result in cosine of 0 and that suggests there is nothing common in these two vectors. 

`argsort` returns an array with the indices of elements in sorted form. So, if we've an array like [2, 1, 4] argsort of this array would be [1, 0, 2] The first element 1 is the index of element 1 in our original array and 0 is the index of 2 in the original array. `flatten` converts it into 1-D array and then we're sorting it based on the values of TF-IDF. We consider the second last value in this array as the most appropriate response. The last value will be the input that user has entered. Second last value will have the max TF-IDF value. If that value is 0, we straight away print `I'm sorry! I don't understand you.` And if it has some definite value, we print the text on that index. 


In [11]:
flag = True
print("ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type Bye!")

while (flag == True):
    user_response = input()
    user_response = user_response.lower()
    
    if (user_response != 'bye'):
        if (user_response == 'thanks' or user_response == 'thank you'):
            flag = False
            print("ROBO: You are welcome..")
        else:
            if (greeting(user_response) != None):
                print("ROBO: " + greeting(user_response))
            else:
                print("ROBO: ")
                print(response(user_response))
                sentence_tokens.remove(user_response)
    else:
        flag = False
        print("ROBO: Bye! take care..")

ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type Bye!
hi
ROBO: hi
what is turing machine
ROBO: 


  'stop_words.' % sorted(inconsistent))


background
in 1950, alan turing's famous article "computing machinery and intelligence" was published, which proposed what is now called the turing test as a criterion of intelligence.
bye
ROBO: Bye! take care..
