Human beings communicate among themselves using language. Language is made of alphabets and grammar. If we've to communicate in the same way with a computer, we've to make him understand these alphabets and grammar. If we were expecting an answer from computer on - "What's the weather like outside?". We've to first teach him the meaning of the words used in this sentence and also the context in which these words are used. Also, once computer understands the question, it should be able to generate an appropriate response. So, the process of understaing the language and generating it is known as Natural Language Processing.

This is mainly used for processing unstructured data. Unstructured data is the data that is generated from text messages, tweets, blogs, images. Structured data is the one that is organized and neatly fit into the defined schema. For example, data in Relational Databases. Around 80% of the data is unstructured and we need a means to process this data to make inference out of it. NLP deals with this unstructured data (Mostly text)

NLP can be of two types - Natural Language Understanding (NLU) and Natural Language Generation (NLG). We're building Chatbot using Natural Language Toolkit. The basic format to process data is 
1. Tokenization
2. Stemming and Lemmatization

Tokenization means breaking text sentences into sentences and words. 
Stemming means to convert words into some base word. This base word may not be a proper word. For example - Stemming of words give, given, giving results in giv and giv is not a word in English dictionary.
Lemmatization is the same process as that of Stemming. However, Lemmatization results in proper word of the dictionary. 

We've to run our text through these processes to weigh the words used in the sentence. We're going to build a chatbot using NLTK and SciKit packages. Let's first import the basic packages.

In [None]:
import nltk
import numpy
import random
import string

nltk is Natural Language Toolkit
numpy is used for faster execution of array operations
We'll be using random to generate a random choice and string is imported because we'll be using some of its basics methods

In [None]:
chatbots_file = open('chatbot.txt', 'r', errors='ignore')
content = chatbots_file.read()
content = content.lower()

# Comment this after first time download
nltk.download('punkt')
nltk.download('wordnet')

sentence_tokens = nltk.sent_tokenize(content)
word_tokens = nltk.word_tokenize(content)

To get started, We're using chatbots data from wikipedia and is directly dumped into chatbot.text file. We're reading that file using the read method and then storing its entire content in lower case in the content variable. NLTK comes bundled with a lot of models (For example, it has models containing data on movie reviews and ratings). We're using its punkt and wordnet models. punkt is a pre-trained tokenizer for English. Wordnet is a lexical database for the English language created by Princeton. It can be used to find the meanings of words, synonyms, antonynms.

sent_tokenize method converts our content into sentences and word_tokenize method converts it into words. This may sound simple if we assume that the sentence tokenizer splits the text based on the occurrence of a period. But that is not correct. It is smart enough to identify the period in "Mr. John is a wise man. He wakes up early in the morning." It does't consider the dot after Mr as another sentence. 

In [None]:
lemmer = nltk.stem.WordNetLemmatizer()

We're using a WordNetLemmatizer for finding the lemma of words used in our content.

In [None]:
def lem_tokens (tokens):
    return [lemmer.lemmatize(token) for token in tokens]

In [None]:
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

The above sentence is used for removing the unnecessary noise created by the punctuation marks. We're putting None if any punctuation marks are present in the sentence. Let's see what we get when we print string.punctuation

In [None]:
print(string.punctuation)

In [None]:
def lem_normalize (text):
    return lem_tokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

The `greeting` method below returns a random response from the `GREETING_RESPONSES` list if the input is anyone of the `GREETING_INPUTS`

In [None]:
GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey")
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]

def greeting (sentence):
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

We're using scikit-learn library for generating the TF-IDF values. TF-IDF is Term Frequency-Inverse Document Frequency value. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Skip the `response` method below as of now. We'll get into its details in some time.

In [None]:
def response (user_response):
    robo_response = ''
    sentence_tokens.append(user_response)
    TfidVec = TfidfVectorizer(tokenizer = lem_normalize, stop_words = 'english')
    tfidf = TfidVec.fit_transform(sentence_tokens)
    values = cosine_similarity(tfidf[-1], tfidf)
    print('values in cosine_smililarity', values)
    idx = values.argsort()[0][-2]
    print('argsort', values.argsort())
    flat = values.flatten()
    flat.sort()
    print('after flattening', flat)
    req_tfidf = flat[-2]
    
    if (req_tfidf == 0):
        robo_response = robo_response + 'I\'m sorry! I don\'t understand you.'
    else:
        robo_response = robo_response + sentence_tokens[idx]
        
    return robo_response

We'll have to implement a mechanism to take input from the user, process that input and send back the relevant response. `input()` renders a input field and whatever the user types in that field in stored in the variable `user_response`. We convert the user's input in lower case to match the content that we have (Remember, we've converted our entire chatbots content in lower case). We've to end the loop of conversation between bot and user when the user types `bye`. Check the if condition that evalutes if the input is either `thanks` or `thank you`. It then terminates the loop by saying `You are welcome..`. If the input is any of our `GREETING_INPUTS`, we send back any random value from the `GREETING_RESPONSES` list. This logic is implemented in the `greeting` method above. Now, lets say user has asked a question to the bot. We call the `response` method from above and render its output as the answer to the user's questions.

Now, its time to understand what our `response` method actually does.

We append the questions of the user (stored in `user_response` variable) to `sentence_tokens` (`sentence_tokens` contains sentence tokens from our chatbots content). As stated above, TF-IDF is Term Frequency Inverse Document Frequency. Before getting into its details, we've to first understand how the bot would respond to the user's query.

In [None]:
flag = True
gprint("ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type Bye!")

while (flag == True):
    user_response = input()
    user_response = user_response.lower()
    
    if (user_response != 'bye'):
        if (user_response == 'thanks' or user_response == 'thank you'):
            flag = False
            print("ROBO: You are welcome..")
        else:
            if (greeting(user_response) != None):
                print("ROBO: " + greeting(user_response))
            else:
                print("ROBO: ")
                print(response(user_response))
                sentence_tokens.remove(user_response)
    else:
        flag = False
        print("ROBO: Bye! take care..")

In [None]:
who was noted