# ChatBot Using Natural Language ToolKit (NLTK)

Natural language processing (NLP) helps computers communicate with humans in their own language and scales other language-related tasks. For example, NLP makes it possible for computers to read text, interpret it, measure sentiment and determine which parts are important. Understanding this will enable you to build the core component of any conversational chatbot. This is the core engine of a conversational chatbot

In [85]:
# Installing NLTK Components
import nltk

# nltk.download_gui()

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\John\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\John\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\John\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [86]:
# Importing All the Required Modules
import re
import os
import json
import random

import numpy as np
import pandas as pd

from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

In [87]:
# Get multiple outputs in the same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Ignore all warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore', category=DeprecationWarning)

In [88]:
# Display all rows and columns of a dataframe instead of a truncated version
from IPython.display import display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Preprocessing Stage
- Convert to lowercase
- Tokenize
- Remove StopWords

Preprocessing stage involves converting text to lowercase, tokenizing it (breaking it into individual words or tokens), and removing stopwords. Tokenization is the process of splitting text into smaller units such as words, phrases, symbols, or other meaningful elements.

In [89]:
sentence1 = "The old oak tree stood majestically in the center of the lush green meadow, its branches reaching out like welcoming arms."
sentence2 = "This is particularly important in today's world where we are swamped with unstructured natural language data on the variety of social media platforms people engage in now-a-days (note -  now-a-days in the decade of 2010-2020)"

In [90]:
def preprocess(sentence):
    sentence = sentence.lower()
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(sentence)
    filtered_words = [w for w in tokens if not w in stopwords.words('english')]
    return filtered_words

In [91]:
preprocessedSentence = preprocess(sentence2)
print(preprocessedSentence)

['particularly', 'important', 'today', 'world', 'swamped', 'unstructured', 'natural', 'language', 'data', 'variety', 'social', 'media', 'platforms', 'people', 'engage', 'days', 'note', 'days', 'decade', '2010', '2020']


## Lematizing or Extracting Features!
- Tagging (POS Tagging) : *Assignement of POS tags to each word.*
- Stemming : *Process of reducing words to their root or base form by removing suffixes or prefixes.*
- Lemmatize words : *Process of reducing words to their base or dictionary form, known as the lemma. This process involves removing inflections and variations to bring words to a common base form, making them easier to analyze and compare across different contexts*

### Tagging
Tagging involves assigning part-of-speech tags to each word in a given text, indicating its grammatical category and usage in the sentence.

Here is the POS tag list:

- CC: Coordinating conjunction
- CD: Cardinal digit
- DT: Determiner
- EX: Existential there (like: "there is" ... think of it like "there exists")
- FW: Foreign word
- IN: Preposition/subordinating conjunction
- JJ: Adjective 'big'
- JJR: Adjective, comparative 'bigger'
- JJS: Adjective, superlative 'biggest'
- LS: List marker 1
- MD: Modal could, will
- NN: Noun, singular 'desk'
- NNS: Noun plural 'desks'
- NNP: Proper noun, singular 'Harrison'
- NNPS: Proper noun, plural 'Americans'
- PDT: Predeterminer 'all the kids'
- POS: Possessive ending parent's
- PRP: Personal pronoun I, he, she
- PRP$: Possessive pronoun my, his, hers
- RB: Adverb very, silently,
- RBR: Adverb, comparative better
- RBS: Adverb, superlative best
- RP: Particle give up
- TO: To go 'to' the store.
- UH: Interjection errrrrrrrm
- VB: Verb, base form take
- VBD: Verb, past tense took
- VBG: Verb, gerund/present participle taking
- VBN: Verb, past participle taken
- VBP: Verb, sing. present, non-3d take
- VBZ: Verb, 3rd person sing. present takes
- WDT: Wh-determiner which
- WP: Wh-pronoun who, what
- WP$: Possessive wh-pronoun whose
- WRB: Wh-abverb where, when

In [92]:
tags = nltk.pos_tag(preprocessedSentence)
print(tags)

[('particularly', 'RB'), ('important', 'JJ'), ('today', 'NN'), ('world', 'NN'), ('swamped', 'VBD'), ('unstructured', 'JJ'), ('natural', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('variety', 'NN'), ('social', 'JJ'), ('media', 'NNS'), ('platforms', 'NNS'), ('people', 'NNS'), ('engage', 'VBP'), ('days', 'NNS'), ('note', 'VBP'), ('days', 'NNS'), ('decade', 'NN'), ('2010', 'CD'), ('2020', 'CD')]


In [93]:
# Extracting just the required one's like verbs, nouns, etc
def extractTags(tags):
    features = []
    for tagged_word in tags:
        word, tag = tagged_word
        if tag=='NN' or tag == 'VBN' or tag == 'NNS' or tag == 'VBP' or tag == 'RB' or tag == 'VBZ' or tag == 'VBG' or tag =='PRP' or tag == 'JJ':
            features.append(word)
    return features

extractedTags = extractTags(tags)
print(extractedTags)

['particularly', 'important', 'today', 'world', 'unstructured', 'natural', 'language', 'data', 'variety', 'social', 'media', 'platforms', 'people', 'engage', 'days', 'note', 'days', 'decade']


In [94]:
# Stemming
words_for_stemming = ['stem', 'stemming', 'stemmed', 'stemmer', 'stems','feet','willing']
stemmer = SnowballStemmer("english")
[stemmer.stem(x) for x in words_for_stemming]

['stem', 'stem', 'stem', 'stemmer', 'stem', 'feet', 'will']

In [95]:
# Lemmatizer
lmtzr = WordNetLemmatizer()

words = ['cacti', 'cactus', 'stemming', 'feet', 'foot']
[lmtzr.lemmatize(word) for word in words]

['cactus', 'cactus', 'stemming', 'foot', 'foot']

In [96]:
def extractFeatures(text):
    words = preprocess(text)
    tags = nltk.pos_tag(words)
    extracted_features = extractTags(tags)
    stemmed_words = [stemmer.stem(x) for x in extracted_features]
    result = [lmtzr.lemmatize(x) for x in stemmed_words]
    return result

In [97]:
# A function to convert features to a dictionary format
# Format for NLTK Classifiers >> Efficient Access Purpose
# Each feature is represented as a key-value pair, 
# where the key is the feature name, and the value is either True or False
def word_feats(words):
    return dict([(word, True) for word in words])
words = extractFeatures(sentence1)
print(words)
word_feats(words)

['old', 'oak', 'tree', 'majest', 'center', 'lush', 'green', 'meadow', 'branch', 'reach', 'welcom', 'arm']


{'old': True,
 'oak': True,
 'tree': True,
 'majest': True,
 'center': True,
 'lush': True,
 'green': True,
 'meadow': True,
 'branch': True,
 'reach': True,
 'welcom': True,
 'arm': True}

In [98]:
def extract_feature_from_doc(data):
    result = []
    # Corpus - Collection of text
    corpus = []
    # The responses of the chat bot
    answers = {}
    for (text,category,answer) in data:

        features = extractFeatures(text)

        corpus.append(features)
        result.append((word_feats(features), category))
        answers[category] = answer
    combined_corpus = [word for sublist in corpus for word in sublist]
    return (result, combined_corpus, answers)

In [99]:
extract_feature_from_doc([['this is the input text from the user','category','answer to give'],])

([({'input': True, 'user': True}, 'category')],
 ['input', 'user'],
 {'category': 'answer to give'})

In [144]:
def get_content(filename):
    with open(filename, 'r') as content_file:
        data = json.load(content_file)
        all_data = []
        for intent in data['intents']:
            for pattern, response in zip(intent['patterns'], intent['responses']):
                all_data.append([pattern, intent['tag'], response])
    return all_data

filename = 'data.json'
data = get_content(filename)

In [115]:
features_data, corpus, answers = extract_feature_from_doc(data)

## Train model
- Classification using Decision tree
- Classification using Naive Bayes

In [122]:
# Train Model
split_ratio = 0.85
def split_dataset(data, split_ratio):
    random.shuffle(data)
    data_length = len(data)
    train_split = int(data_length * split_ratio)
    return (data[:train_split]), (data[train_split:])

In [123]:
training_data, test_data = split_dataset(features_data, split_ratio)
# save the data
np.save('training_data', training_data)
np.save('test_data', test_data)
# load data
training_data = np.load('training_data.npy', allow_pickle=True)
test_data = np.load('test_data.npy' , allow_pickle=True)

In [126]:
def train_using_decision_tree(training_data, test_data):
    classifier = nltk.classify.DecisionTreeClassifier.train(training_data, entropy_cutoff=0.6, support_cutoff=6)
    classifier_name = type(classifier).__name__
    training_set_accuracy = nltk.classify.accuracy(classifier, training_data)
    print('training set accuracy: ', training_set_accuracy)
    test_set_accuracy = nltk.classify.accuracy(classifier, test_data)
    print('test set accuracy: ', test_set_accuracy)
    return classifier, classifier_name, test_set_accuracy, training_set_accuracy

In [161]:
dtclassifier, classifier_name, test_set_accuracy, training_set_accuracy = train_using_decision_tree(training_data, test_data)

training set accuracy:  0.8870967741935484
test set accuracy:  0.08333333333333333


In [147]:
def train_using_naive_bayes(training_data, test_data):
    classifier = nltk.NaiveBayesClassifier.train(training_data)
    classifier_name = type(classifier).__name__
    training_set_accuracy = nltk.classify.accuracy(classifier, training_data)
    test_set_accuracy = nltk.classify.accuracy(classifier, test_data)
    return classifier, classifier_name, test_set_accuracy, training_set_accuracy

In [160]:
classifier, classifier_name, test_set_accuracy, training_set_accuracy = train_using_naive_bayes(training_data, test_data)
print(training_set_accuracy)
print(test_set_accuracy)
# print(len(classifier.most_informative_features()))
# classifier.show_most_informative_features()

0.9193548387096774
0.08333333333333333


In [148]:
input_sentence = "Hospital"
dtclassifier.classify(word_feats(extractFeatures(input_sentence)))

'hospitals'

In [152]:
def reply1(input_sentence):
    category = dtclassifier.classify(word_feats(extractFeatures(input_sentence)))
    return answers[category]

def reply2(input_sentence):
    category = classifier.classify(word_feats(extractFeatures(input_sentence)))
    return answers[category]

In [153]:
reply1('fever treat')
reply2('fever treat')

"1) Lie down or sit down. To reduce the chance of fainting again, don't get up too quickly, 2) Place your head between your knees if you sit down, 3)Position the person on his or her back. If there are no injuries and the person is breathing, raise the person's legs above heart level — about 12 inches (30 centimeters) — if possible. Loosen belts, collars or other constrictive clothing. "

'To treat a fever at home: 1)Drink plenty of fluids to stay hydrated. 2)Dress in lightweight clothing. 3)Use a light blanket if you feel chilled, until the chills end. 4)Take acetaminophen (Tylenol, others) or ibuprofen (Advil, Motrin IB, others). 5) Get medical help if the fever lasts more than five days in a row.'

In [155]:
reply1('pharmacy')
reply2('pharmacy')

'You can search for pharmacies nearby using online maps or directories.'

'You can search for pharmacies nearby using online maps or directories.'

In [157]:
reply1('treat a mild Fever?')
reply2('treat a mild Fever?')

'To treat a fever at home: 1)Drink plenty of fluids to stay hydrated. 2)Dress in lightweight clothing. 3)Use a light blanket if you feel chilled, until the chills end. 4)Take acetaminophen (Tylenol, others) or ibuprofen (Advil, Motrin IB, others). 5) Get medical help if the fever lasts more than five days in a row.'

'To treat a fever at home: 1)Drink plenty of fluids to stay hydrated. 2)Dress in lightweight clothing. 3)Use a light blanket if you feel chilled, until the chills end. 4)Take acetaminophen (Tylenol, others) or ibuprofen (Advil, Motrin IB, others). 5) Get medical help if the fever lasts more than five days in a row.'

In [159]:
reply1('how to deal with bleed in nose?')
reply2('how to deal with bleed in nose?')

'1) Wash the affected area with soap and water. Apply a cold compress (such as a flannel or cloth cooled with cold water) or an ice pack to any swelling for at least 10 minutes. Raise or elevate the affected area if possible, as this can help reduce swelling '

'1) Wash the affected area with soap and water. Apply a cold compress (such as a flannel or cloth cooled with cold water) or an ice pack to any swelling for at least 10 minutes. Raise or elevate the affected area if possible, as this can help reduce swelling '