<a href="https://colab.research.google.com/github/adrien50/NLPclassification-chatbot/blob/master/NLPclassification_chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text classification using NLP / Core engine of a chat bot.

Human language is astoundingly complex and diverse. When we write, we often misspell or abbreviate words, or omit punctuation. There is a lot of unstructured data around us. Natural language processing helps computers communicate with humans in their own language and scales other language-related tasks. For example, NLP makes it possible for computers to read text, interpret it, measure sentiment and determine which parts are important. Understanding this will enable you to build the core component of any conversational chatbot. This is the core engine of a conversational chatbot

Detecting patterns is a central part of Natural Language Processing. Words ending in -ed tend to be past tense verbs. Frequent use of will is indicative of news text (3). These observable patterns — word structure and word frequency — happen to correlate with particular aspects of meaning, such as tense and topic. But how did we know where to start looking, which aspects of form to associate with which aspects of meaning? In this series we shall learn to create the core engine of a chat bot. We will learn text classification using the techniques of natural language processing.



# Import useful libraries

In [None]:
import nltk

In [None]:
!nltk.download_gui()

/bin/bash: -c: line 1: syntax error: unexpected end of file


### Install NLTK components:
    
nltk.download_gui()

#The above will open a GUI
Select the below

    stopwords from Corpa
    averaged_perceptron_tagger from All corpus
    wordnet
    
OR you can download all the nltk components by:
    nltk.download()
    
Please Note: The above will take much time (30-60mins depending on Internet speed)

In [None]:
import re
import os
import csv
from nltk.stem.snowball import SnowballStemmer
import random
from nltk.classify import SklearnClassifier
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import numpy as np
import pandas as pd

In [None]:
## Get multiple outputs in the same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Ignore all warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore', category=DeprecationWarning)

In [None]:
## Display all rows and columns of a dataframe instead of a truncated version
from IPython.display import display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Preprocess

In [None]:
sentence = "The Big brown fox jumped over a lazy dog."
sentence2 = "This is particularly important in today's world where we are swamped with unstructured natural language data on the variety of social media platforms people engage in now-a-days (note -  now-a-days in the decade of 2010-2020)"

In [None]:
#convert sentence to lower case
'This' == 'this'
print('AbcdEFgH'.lower())
sentence.lower()
sentence2.lower()

False

abcdefgh


'the big brown fox jumped over a lazy dog.'

"this is particularly important in today's world where we are swamped with unstructured natural language data on the variety of social media platforms people engage in now-a-days (note -  now-a-days in the decade of 2010-2020)"

### Tokenize - extract individual words

In [None]:
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(sentence)
tokens
tokens2 = tokenizer.tokenize(sentence2)
tokens2

['The', 'Big', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']

['This',
 'is',
 'particularly',
 'important',
 'in',
 'today',
 's',
 'world',
 'where',
 'we',
 'are',
 'swamped',
 'with',
 'unstructured',
 'natural',
 'language',
 'data',
 'on',
 'the',
 'variety',
 'of',
 'social',
 'media',
 'platforms',
 'people',
 'engage',
 'in',
 'now',
 'a',
 'days',
 'note',
 'now',
 'a',
 'days',
 'in',
 'the',
 'decade',
 'of',
 '2010',
 '2020']

### Stopwords : Filter words to remove non-useful words

In [None]:
import nltk
nltk.download('stopwords')
filtered_words = [w for w in tokens if not w in stopwords.words('english')]
filtered_words

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

['The', 'Big', 'brown', 'fox', 'jumped', 'lazy', 'dog']

In [None]:
filtered_words = [w for w in tokens2 if not w in stopwords.words('english')]
filtered_words

['This',
 'particularly',
 'important',
 'today',
 'world',
 'swamped',
 'unstructured',
 'natural',
 'language',
 'data',
 'variety',
 'social',
 'media',
 'platforms',
 'people',
 'engage',
 'days',
 'note',
 'days',
 'decade',
 '2010',
 '2020']

In [None]:
def preprocess(sentence):
    sentence = sentence.lower()
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(sentence)
    filtered_words = [w for w in tokens if not w in stopwords.words('english')]
    return filtered_words

In [None]:
preprocessed_sentence = preprocess(sentence)
print(preprocessed_sentence)

['big', 'brown', 'fox', 'jumped', 'lazy', 'dog']


In [None]:
preprocess(sentence2)

['particularly',
 'important',
 'today',
 'world',
 'swamped',
 'unstructured',
 'natural',
 'language',
 'data',
 'variety',
 'social',
 'media',
 'platforms',
 'people',
 'engage',
 'days',
 'note',
 'days',
 'decade',
 '2010',
 '2020']

## Tagging

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')
tags = nltk.pos_tag(preprocessed_sentence)
print(tags)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

[('big', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumped', 'VBD'), ('lazy', 'JJ'), ('dog', 'NN')]


In [None]:
tags = nltk.pos_tag(preprocess(sentence2))
print(tags)

[('particularly', 'RB'), ('important', 'JJ'), ('today', 'NN'), ('world', 'NN'), ('swamped', 'VBD'), ('unstructured', 'JJ'), ('natural', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('variety', 'NN'), ('social', 'JJ'), ('media', 'NNS'), ('platforms', 'NNS'), ('people', 'NNS'), ('engage', 'VBP'), ('days', 'NNS'), ('note', 'VBP'), ('days', 'NNS'), ('decade', 'NN'), ('2010', 'CD'), ('2020', 'CD')]


## Extracting only Nouns and Verb nouns

POS tag list:

CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: "there is" ... think of it like "there exists")
FW foreign word
IN preposition/subordinating conjunction
JJ adjective 'big'
JJR adjective, comparative 'bigger'
JJS adjective, superlative 'biggest'
LS list marker 1)
MD modal could, will
NN noun, singular 'desk'
NNS noun plural 'desks'
NNP proper noun, singular 'Harrison'
NNPS proper noun, plural 'Americans'
PDT predeterminer 'all the kids'
POS possessive ending parent's
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to go 'to' the store.
UH interjection errrrrrrrm
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when

In [None]:
def extract_tagged(sentences):
    features = []
    for tagged_word in sentences:
        word, tag = tagged_word
        if tag=='NN' or tag == 'VBN' or tag == 'NNS' or tag == 'VBP' or tag == 'RB' or tag == 'VBZ' or tag == 'VBG' or tag =='PRP' or tag == 'JJ':
            features.append(word)
    return features

In [None]:
extract_tagged(tags)

['big', 'brown', 'fox', 'lazy', 'dog']

## Lemmatize words

In [None]:
import nltk
nltk.download('wordnet')
lmtzr = WordNetLemmatizer()
print(lmtzr.lemmatize('cacti'))
print(lmtzr.lemmatize('willing'))
print(lmtzr.lemmatize('feet'))
print(lmtzr.lemmatize('stemmed'))

print(lmtzr.lemmatize('cactus'))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

cactus
willing
foot
stemmed
cactus


## Stem words

In [None]:
words_for_stemming = ['stem', 'stemming', 'stemmed', 'stemmer', 'stems','feet','willing']

In [None]:
stemmer = SnowballStemmer("english")
[stemmer.stem(x) for x in words_for_stemming]

['stem', 'stem', 'stem', 'stemmer', 'stem', 'feet', 'will']

## Putting it all together

In [None]:
def extract_feature(text):
    words = preprocess(text)
#     print('words: ',words)
    tags = nltk.pos_tag(words)
#     print('tags: ',tags)
    extracted_features = extract_tagged(tags)
#     print('Extracted features: ',extracted_features)
    stemmed_words = [stemmer.stem(x) for x in extracted_features]
#     print(stemmed_words)

    result = [lmtzr.lemmatize(x) for x in stemmed_words]
   
    return result

In [None]:
sentence

'The Big brown fox jumped over a lazy dog.'

In [None]:
words = extract_feature(sentence)
print(words)

['big', 'brown', 'fox', 'lazi', 'dog']


In [None]:
words = extract_feature(sentence2)
print(words)

['particular', 'import', 'today', 'world', 'unstructur', 'natur', 'languag', 'data', 'varieti', 'social', 'medium', 'platform', 'peopl', 'engag', 'day', 'note', 'day', 'decad']


In [None]:
extract_feature("He hurt his right foot while he was wearing white shoes on his feet")

['hurt', 'right', 'foot', 'wear', 'white', 'shoe', 'foot']

## Implementing bag of words

In simple terms, it’s a collection of words to represent a sentence, disregarding the order in which they appear.

In [None]:
def word_feats(words):
    return dict([(word, True) for word in words])

In [None]:
word_feats(words)

{'data': True,
 'day': True,
 'decad': True,
 'engag': True,
 'import': True,
 'languag': True,
 'medium': True,
 'natur': True,
 'note': True,
 'particular': True,
 'peopl': True,
 'platform': True,
 'social': True,
 'today': True,
 'unstructur': True,
 'varieti': True,
 'world': True}

## Parsing the whole document

In [None]:
def extract_feature_from_doc(data):
    result = []
    corpus = []
    # The responses of the chat bot
    answers = {}
    for (text,category,answer) in data:

        features = extract_feature(text)

        corpus.append(features)
        result.append((word_feats(features), category))
        answers[category] = answer

    return (result, sum(corpus,[]), answers)

In [None]:
extract_feature_from_doc([['this is the input text from the user','category','answer to give']])

([({'input': True, 'user': True}, 'category')],
 ['input', 'user'],
 {'category': 'answer to give'})

In [None]:
def get_content(filename):
    doc = os.path.join(filename)
    with open(doc, 'r') as content_file:
        lines = csv.reader(content_file,delimiter='|')
        data = [x for x in lines if len(x) == 3]
        return data

In [None]:
filename = "leave.txt"
data = get_content(filename)

In [None]:
data

[['Hello',
  'Greetings',
  'Hello. I am Dexter. I will serve your leave enquiries.'],
 ['hi hello',
  'Greetings',
  'Hello. I am Dexter. I will serve your leave enquiries.'],
 ['hi ',
  'Greetings',
  'Hello. I am Dexter. I will serve your leave enquiries.'],
 ['hi', 'Greetings', 'Hello. I am Dexter. I will serve your leave enquiries.'],
 ['hi', 'Greetings', 'Hello. I am Dexter. I will serve your leave enquiries.'],
 ['hey',
  'Greetings',
  'Hello. I am Dexter. I will serve your leave enquiries.'],
 ['hello, hi',
  'Greetings',
  'Hello. I am Dexter. I will serve your leave enquiries.'],
 ['hey',
  'Greetings',
  'Hello. I am Dexter. I will serve your leave enquiries.'],
 ['hey, hi',
  'Greetings',
  'Hello. I am Dexter. I will serve your leave enquiries.'],
 ['hey, hello',
  'Greetings',
  'Hello. I am Dexter. I will serve your leave enquiries.'],
 ['Good morning',
  'Morning',
  'Good Morning. I am Dexter. I will serve your leave enquiries.'],
 ['Good afternoon',
  'Afternoon',
  

In [None]:
features_data, corpus, answers = extract_feature_from_doc(data)

In [None]:
print(features_data[50])

({'mani': True, 'option': True, 'leav': True}, 'Utilized-Optional-Leaves')


In [None]:
corpus

['hello',
 'hi',
 'hello',
 'hi',
 'hi',
 'hi',
 'hey',
 'hello',
 'hi',
 'hey',
 'hey',
 'hi',
 'hey',
 'hello',
 'good',
 'morn',
 'good',
 'afternoon',
 'good',
 'even',
 'good',
 'night',
 'today',
 'want',
 'help',
 'need',
 'help',
 'help',
 'want',
 'help',
 'want',
 'assist',
 'help',
 'great',
 'talk',
 'great',
 'thank',
 'help',
 'thank',
 'thank',
 'much',
 'thank',
 'thank',
 'much',
 'mani',
 'type',
 'leav',
 'type',
 'leav',
 'type',
 'leav',
 'type',
 'leav',
 'type',
 'mani',
 'leav',
 'taken',
 'mani',
 'leav',
 'alreadi',
 'taken',
 'mani',
 'annual',
 'leav',
 'mani',
 'annual',
 'leav',
 'taken',
 'mani',
 'annual',
 'leav',
 'alreadi',
 'taken',
 'annual',
 'leav',
 'count',
 'taken',
 'mani',
 'annual',
 'leav',
 'taken',
 'number',
 'annual',
 'leav',
 'taken',
 'annual',
 'leav',
 'taken',
 'number',
 'annual',
 'leav',
 'alreadi',
 'taken',
 'annual',
 'leav',
 'taken',
 'annual',
 'leav',
 'alreadi',
 'taken',
 'number',
 'annual',
 'leav',
 'taken',
 'numbe

In [None]:
answers

{'Afternoon': 'Good afternoon. I am Dexter. I will serve your leave enquiries.',
 'Balance-Annual-Leaves': 'You have 25 annual leaves remaining.',
 'Balance-Optional-Leaves': 'You have 2 optional leaves remaining.',
 'CF': 'You have 30 carry forward leaves.',
 'Closing': "It's glad to know that I have been helpful. Have a good day!",
 'Default-Balance-Annual-Leaves': 'You have 25 annual leaves left.',
 'Default-Utilized-Annual-Leaves': 'You have used 12 annual leaves.',
 'Evening': 'Good evening. I am Dexter. I will serve your leave enquiries.',
 'Goodbye': 'Good night. Take care.',
 'Greetings': 'Hello. I am Dexter. I will serve your leave enquiries.',
 'Help': 'How can I help you?',
 'Leaves-Type': 'Currently I know about two: annual and optional leaves.',
 'Morning': 'Good Morning. I am Dexter. I will serve your leave enquiries.',
 'No-Help': 'Ok sir/madam. No problem. Have a nice day.',
 'Opening': "I'm fine! Thank you. How can I help you?",
 'Utilized-Annual-Leaves': 'You have tak

# Train a model using these fetures

In [None]:
## split data into train and test sets
split_ratio = 0.8

In [None]:
def split_dataset(data, split_ratio):
    random.shuffle(data)
    data_length = len(data)
    train_split = int(data_length * split_ratio)
    return (data[:train_split]), (data[train_split:])

In [None]:
training_data, test_data = split_dataset(features_data, split_ratio)

In [None]:
training_data

[({'count': True, 'leav': True, 'option': True, 'remain': True},
  'Balance-Optional-Leaves'),
 ({'hi': True}, 'Greetings'),
 ({'forward': True, 'leav': True, 'number': True}, 'CF'),
 ({'alreadi': True,
   'annual': True,
   'leav': True,
   'number': True,
   'taken': True},
  'Utilized-Annual-Leaves'),
 ({'carri': True, 'forward': True, 'leav': True, 'mani': True}, 'CF'),
 ({'annual': True, 'count': True, 'leav': True, 'remain': True},
  'Balance-Annual-Leaves'),
 ({'carri': True, 'forward': True}, 'CF'),
 ({'leav': True, 'number': True, 'option': True}, 'Balance-Optional-Leaves'),
 ({'alreadi': True,
   'leav': True,
   'number': True,
   'option': True,
   'taken': True},
  'Utilized-Optional-Leaves'),
 ({'help': True, 'need': True}, 'Help'),
 ({'annual': True, 'count': True, 'leav': True, 'taken': True},
  'Utilized-Annual-Leaves'),
 ({'leav': True, 'mani': True, 'type': True}, 'Leaves-Type'),
 ({'carri': True, 'forward': True, 'leav': True, 'tell': True}, 'CF'),
 ({'leav': True},

In [None]:
# save the data
np.save('training_data', training_data)
np.save('test_data', test_data)

## Classification using Decision tree

In [None]:
training_data = np.load('training_data.npy', allow_pickle=True)
test_data = np.load('test_data.npy', allow_pickle=True)

In [None]:
def train_using_decision_tree(training_data, test_data):
    
    classifier = nltk.classify.DecisionTreeClassifier.train(training_data, entropy_cutoff=0.6, support_cutoff=6)
    classifier_name = type(classifier).__name__
    training_set_accuracy = nltk.classify.accuracy(classifier, training_data)
    print('training set accuracy: ', training_set_accuracy)
    test_set_accuracy = nltk.classify.accuracy(classifier, test_data)
    print('test set accuracy: ', test_set_accuracy)
    return classifier, classifier_name, test_set_accuracy, training_set_accuracy

In [None]:
dtclassifier, classifier_name, test_set_accuracy, training_set_accuracy = train_using_decision_tree(training_data, test_data)

training set accuracy:  0.8947368421052632
test set accuracy:  0.7931034482758621


## Classification using Naive Bayes

In [None]:
def train_using_naive_bayes(training_data, test_data):
    classifier = nltk.NaiveBayesClassifier.train(training_data)
    classifier_name = type(classifier).__name__
    training_set_accuracy = nltk.classify.accuracy(classifier, training_data)
    test_set_accuracy = nltk.classify.accuracy(classifier, test_data)
    return classifier, classifier_name, test_set_accuracy, training_set_accuracy

In [None]:
classifier, classifier_name, test_set_accuracy, training_set_accuracy = train_using_naive_bayes(training_data, test_data)
print(training_set_accuracy)
print(test_set_accuracy)
print(len(classifier.most_informative_features()))
classifier.show_most_informative_features()

0.8771929824561403
0.6206896551724138
64
Most Informative Features
                    leav = None           Greeti : Balanc =     13.9 : 1.0
                    mani = True           Defaul : Balanc =      7.8 : 1.0
                   taken = None           Balanc : Utiliz =      5.1 : 1.0
                 alreadi = True           Defaul : Utiliz =      4.4 : 1.0
                   thank = None           Utiliz : Closin =      3.3 : 1.0
                  remain = None           Utiliz : Balanc =      3.0 : 1.0
                   carri = None           Utiliz : CF     =      2.8 : 1.0
                      hi = None           Utiliz : Greeti =      2.8 : 1.0
                    help = None           Utiliz : No-Hel =      2.6 : 1.0
                    want = None           Utiliz : No-Hel =      2.6 : 1.0


In [None]:
classifier.classify(({'mani': True, 'option': True, 'leav': True}))

'Balance-Optional-Leaves'

In [None]:
extract_feature("hello")

['hello']

In [None]:
word_feats(extract_feature("hello"))

{'hello': True}

In [None]:
input_sentence = "how many balanced leaves do I have?"
classifier.classify(word_feats(extract_feature(input_sentence)))

'Balance-Optional-Leaves'

In [None]:
def reply(input_sentence):
    category = dtclassifier.classify(word_feats(extract_feature(input_sentence)))
    return answers[category]
    
    

In [None]:
reply('Hi')

'Hello. I am Dexter. I will serve your leave enquiries.'

In [None]:
reply('How many annual leaves do I have left?')

'You have 25 annual leaves remaining.'

In [None]:
reply('How many leaves have I taken?')

'You have taken 1 optional leaves.'

In [None]:
reply('Thanks!')

"It's glad to know that I have been helpful. Have a good day!"

# Conclusion:

Once the model has been developed using an algorithm that gives an acceptable accuracy, this model can be called using to any chatbot UI framework