## Classification

Let us start by importing packages

In [1]:
import nltk
from nltk.stem.lancaster import LancasterStemmer
import json

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem.
For example the stem of catlike is cat.

In [2]:
stemmer = LancasterStemmer()

training_data consists of the intent we want to match it to and the sentence 

In [3]:
training_data = []
training_data.append({"class": "intro", "sentence": "Who are you?"})
training_data.append({"class": "intro", "sentence": "Hi"})
training_data.append({"class": "intro", "sentence": "what do you do?"})
training_data.append({"class": "intro", "sentence": "what can you do for me?"})
training_data.append({"class": "greet", "sentence": "Hey"})
training_data.append({"class": "greet", "sentence": "howdy"})
training_data.append({"class": "greet", "sentence": "hey there"})
training_data.append({"class": "greet", "sentence": "hello"})
training_data.append({"class": "affirm", "sentence": "yes"})
training_data.append({"class": "affirm", "sentence": "Yeah"})
training_data.append({"class": "affirm", "sentence": "that\'s right"})
training_data.append({"class": "goodbye", "sentence": "Bye"})
training_data.append({"class": "goodbye", "sentence": "goodbye"})

turn a list into a set (of unique items) and then a list again (this removes duplicates)

In [5]:
corpus_words = {}
class_words = {}
classes = list(set([a['class'] for a in training_data]))
for c in classes:
    class_words[c] = []
print(class_words)

{'affirm': [], 'greet': [], 'goodbye': [], 'intro': []}


#### Now we will fill the class words with their respective data

In [6]:
word_stems = []
for data in training_data:
    # tokenize each sentence into words
    for word in nltk.word_tokenize(data['sentence']):
        # ignore a few things
        if word not in ["?", "'s"]:
            # stem and lowercase each word
            stemmed_word = stemmer.stem(word.lower())
            word_stems.append(stemmed_word)
            if stemmed_word not in corpus_words:
                corpus_words[stemmed_word] = 1
            else:
                corpus_words[stemmed_word] += 1

            class_words[data['class']].extend([stemmed_word])
print(word_stems)
for item in class_words:
    print(item)
    print(class_words[item])

['who', 'ar', 'you', 'hi', 'what', 'do', 'you', 'do', 'what', 'can', 'you', 'do', 'for', 'me', 'hey', 'howdy', 'hey', 'ther', 'hello', 'ye', 'yeah', 'that', 'right', 'bye', 'goodby']
affirm
['ye', 'yeah', 'that', 'right']
greet
['hey', 'howdy', 'hey', 'ther', 'hello']
goodbye
['bye', 'goodby']
intro
['who', 'ar', 'you', 'hi', 'what', 'do', 'you', 'do', 'what', 'can', 'you', 'do', 'for', 'me']


In [7]:
print("Corpus words and counts: %s" % corpus_words)
# also we have all words in each class
print("Class words: %s" % class_words)
# we can now calculate the Naive Bayes score for a new sentence
sentence = "Hi"

Corpus words and counts: {'who': 1, 'ar': 1, 'you': 3, 'hi': 1, 'what': 2, 'do': 3, 'can': 1, 'for': 1, 'me': 1, 'hey': 2, 'howdy': 1, 'ther': 1, 'hello': 1, 'ye': 1, 'yeah': 1, 'that': 1, 'right': 1, 'bye': 1, 'goodby': 1}
Class words: {'affirm': ['ye', 'yeah', 'that', 'right'], 'greet': ['hey', 'howdy', 'hey', 'ther', 'hello'], 'goodbye': ['bye', 'goodby'], 'intro': ['who', 'ar', 'you', 'hi', 'what', 'do', 'you', 'do', 'what', 'can', 'you', 'do', 'for', 'me']}


In [8]:
def calculate_class_score(sentence, class_name):
    score = 0
    for word in nltk.word_tokenize(sentence):
        if word in class_words[class_name]:
            score += 1
    return score

In [9]:
sentence = "hey guys. What's up?"
for c in class_words.keys():
    print ("Class: %s  Score: %s" % (c, calculate_class_score(sentence, c)))

Class: affirm  Score: 0
Class: greet  Score: 1
Class: goodbye  Score: 0
Class: intro  Score: 0


We can significantly improve our algorithm by accounting for the commonality of each word. The word “is” should carry a lower weigh than the word “python” in most cases, because it is more common.

In [10]:
def calculate_class_score_commonality(sentence, class_name):
    score = 0
    for word in nltk.word_tokenize(sentence):
        if word in class_words[class_name]:
            score += (1 / corpus_words[word])
    return score


# now we can find the class with the highest score
for c in class_words.keys():
    print ("Class: %s  Score: %s" % (c, calculate_class_score_commonality(sentence, c)))

Class: affirm  Score: 0
Class: greet  Score: 0.5
Class: goodbye  Score: 0
Class: intro  Score: 0


##### Lets write a helper function for the above

In [11]:
def find_class(sentence):
    high_class = None
    high_score = 0
    for c in class_words.keys():
        score = calculate_class_score_commonality(sentence, c)
        if score > high_score:
            high_class = c
            high_score = score
    return high_class, high_score


In [12]:
print(find_class("Is there anything cool you can do for me?"))

('intro', 3.6666666666666665)
