# Learning *Deep Learning* by creating a chatbot

In this tutorial we will be creating a chat bot from scratch using deep learning. 

### Basic definations used in tutorial

**Utterance**: Sentence/word that user types/speaks. This is input of model. 

**Intent**: A class that an user utterance belongs to. This is the output of model.

For example when user says *show me details about your job* to a bot, `show me details about your job` is a *utterance* and `JobDetails` can be named the intent of that utterance since user's main intent is to get detail of our job. 


We will create a deep learning model that takes an *utterance* and predicts a *intent* that this utterance falls into. Our chatbot model we trained using the data that is stored in `intents.json`.


## Basics of Nural Network

Artificial neural network is inspired from biological neural network. Just as in biological neural network, neuron is a fundamental unit, in artificial neural network too we have a neuron. Neuron is a node in the graph that neural network creates while training itself. 


### Layers

The 

### Activation Function

### Optimization Function


The model then uses this data to learn by optimizing the loss. The learning basically means finding optimal weights needed to make prediction with lowest margin of error. Once model learns to optimize the loss using our training data, it can then predict on future data. 

## Processing Data

The first step of creating any deep learning model is to processing data. We process data to convert it into the format that model can digest.fed into the model. 
 

### Loading Data

We create a utility method named `load_data` that reades and parses the raw data stored in a given file. ( assumes the passed file is JSON ) To read the content of from given filename we use python's built in `open` function. Then to convert content of file into python's dictionary, we use `loads` function provided by `json` library. We return this dictionary so caller of this function can perferm all operations of python dictionary.


**Function:*** A 

In [68]:
import json

def load_data(json_file_name):
    data_file = open(json_file_name).read()
    return json.loads(data_file)

Now that we have a utitlity function to read data, we can use it to read the data.

In [69]:
file_name = 'intents.json'
data = load_data(file_name)['intents']

Notice we extracted `intents` out of the dictionary returned by `load_data` function and stored the list of *intents* into variable named data.

The data that we loaded from `intent.json` above is in human readable format. But, `keras` - the model building library we use do not support the data structure that the data currently is in. Therefore, we will need to perform pre processing on the data to make it model compatible. To achieve this, we will be using a very popular natural language processing library named `nltk` (Natural language toolkit). 

### Tokenization

`Tokeniztion` is nothing but the process of dividing a long text into smaller chunks. 

A word tokenizer tokenizes a text into an array of words. The easiest way to do this is to split the text by white space. But this does not always gives the optimal result. For example, `I am a man.` can give us array `[I, am, a, man.]` when `[I, am, a man, .]` would be a better result. Since this is such a common thing we would want to do while working on parsing a text, `nltk` provides a utility method `word_tokenize` that can tokenize a text into words.

A sentence tokenizer tokenizes a text into an array of sentences, and a word tokenizer divides some text into an array of words.  At first it might seem straightforward to idenify a sentence - Anything that starts with capital letter word and ends at `.` can be a sentence. So, basic intutation can be to split a text by `.` But when we esdily see from sentences like `Mr. Upen is a good person.`  that its not always straigntforward to `tokenize` a text. So, we are using `punkt` to get a pretty good tokenization of our text. 


### Stemming and Lemmantizing 


[Here is an article that I find excellent explaining it](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)


### Stopwords

In many languages there are some words which do not add any meaning to of the sentence but rather are used for gramatical correction. For example, in `I have a ball`, `I` and `a` can be ignored to extract the root meaning of the sentene. I and a provides more context but since we have limited processing resource and time we can ignore them to simplify our model.

`nltk` provides a list of `stopwords` for different language. We will have to download this resource before using it and it can be downloaded with `nltk.download('stopwords')`. Then we can find set of stopwords in english, for example, using `set(stopwords.words('english'))`

In [70]:
import nltk
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords

lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Gamer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [71]:
# Uniqueue words in all the intents
words = []
# All the possible intents in our data
intents = []

# List of tuple of [([<list of words in utterance>], '<string representing the intent>'])
word_to_intent = []

## TODO update this to actually use stopwords
stop_words =  ['.', '!'] # set(stopwords.words('english'))

def get_words_in_sentence(sentence):
    return [lemmatizer.lemmatize(w) for w in word_tokenize(sentence) if w not in stop_words]

for intent in data:
    for utterance in intent['utterances']:
        new_words = get_words_in_sentence(utterance)
        words.extend(new_words)
        
        word_to_intent.append((new_words, intent['intent']))
        
        if intent['intent'] not in intents:
            intents.append(intent['intent'])

## Bag of words

[Stanford NLP](https://nlp.stanford.edu/IR-book/html/htmledition/term-frequency-and-weighting-1.html)

Machine learning algorithms provided by `keras` library are not compatible with stack list of strings. So, we will have to find a way to represent these list of strings into numberic array. One of the technique to convert these strings to numeric array is called *Bag of words*. The idea of bag of words to create an `N by N` array containing 0 and 1. `N` is number of unique possible word in our training set. Each of our utterance is converted to an `N by 1` array containing 1 and 0 - 0 meaning the word is not in the sentence and 1 meaning the word is in the sentence. Then the arrays created for each utterance is stacked up to create the `N by N` array (bag).  


For our example we would need two bags of words. One bag to represent utterances (X -  input that our model takes) and one to represent the corresponding intent for each utterance (Y - output that our model predicts)

In [72]:
unique_words = set(words)
training = []
empty_tag_bag = [0] * len(intents)

def convert_words_to_hot_bag(words):
    bag = []
    for word in unique_words:
        bag.append(1) if word in words else bag.append(0)
    return bag

def convert_sentence_to_hot_bag(sentence):
    words_in_sentence = get_words_in_sentence(sentence)
    return convert_words_to_hot_bag(words_in_sentence)
   
for word_intent in word_to_intent:
    utterance_bag = convert_words_to_hot_bag(word_intent[0])
    tag_bag = list(empty_tag_bag)
    tag_bag[intents.index(word_intent[1])] = 1
    training.append([utterance_bag, tag_bag])


import random
import numpy as np
 
random.shuffle(training)
training = np.array(training)

# create train input and label. X - patterns, Y - intents
train_x = list(training[:,0])
train_y = list(training[:,1])


In [73]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD

In [74]:
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))

# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])

#fitting and saving the model
hist = model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5)
model.save('chatbot_model.h5', hist)

print("model created")

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

Epoch 82/200
Epoch 83/200
Epoch 84/200
Epoch 85/200
Epoch 86/200
Epoch 87/200
Epoch 88/200
Epoch 89/200
Epoch 90/200
Epoch 91/200
Epoch 92/200
Epoch 93/200
Epoch 94/200
Epoch 95/200
Epoch 96/200
Epoch 97/200
Epoch 98/200
Epoch 99/200
Epoch 100/200
Epoch 101/200
Epoch 102/200
Epoch 103/200
Epoch 104/200
Epoch 105/200
Epoch 106/200
Epoch 107/200
Epoch 108/200
Epoch 109/200
Epoch 110/200
Epoch 111/200
Epoch 112/200
Epoch 113/200
Epoch 114/200
Epoch 115/200
Epoch 116/200
Epoch 117/200
Epoch 118/200
Epoch 119/200
Epoch 120/200
Epoch 121/200
Epoch 122/200
Epoch 123/200
Epoch 124/200
Epoch 125/200
Epoch 126/200
Epoch 127/200
Epoch 128/200
Epoch 129/200
Epoch 130/200
Epoch 131/200
Epoch 132/200
Epoch 133/200
Epoch 134/200
Epoch 135/200
Epoch 136/200
Epoch 137/200
Epoch 138/200
Epoch 139/200
Epoch 140/200
Epoch 141/200
Epoch 142/200
Epoch 143/200
Epoch 144/200
Epoch 145/200
Epoch 146/200
Epoch 147/200
Epoch 148/200
Epoch 149/200
Epoch 150/200
Epoch 151/200
Epoch 152/200
Epoch 153/200
Epoch 154/

Epoch 161/200
Epoch 162/200
Epoch 163/200
Epoch 164/200
Epoch 165/200
Epoch 166/200
Epoch 167/200
Epoch 168/200
Epoch 169/200
Epoch 170/200
Epoch 171/200
Epoch 172/200
Epoch 173/200
Epoch 174/200
Epoch 175/200
Epoch 176/200
Epoch 177/200
Epoch 178/200
Epoch 179/200
Epoch 180/200
Epoch 181/200
Epoch 182/200
Epoch 183/200
Epoch 184/200
Epoch 185/200
Epoch 186/200
Epoch 187/200
Epoch 188/200
Epoch 189/200
Epoch 190/200
Epoch 191/200
Epoch 192/200
Epoch 193/200
Epoch 194/200
Epoch 195/200
Epoch 196/200
Epoch 197/200
Epoch 198/200
Epoch 199/200
Epoch 200/200
model created


Ablove we also downloaded `punkt` from `nltk`. `Punkt` is a `tokenizer` that tokenizes the given `text` into sentences. 

In [75]:
def predict(utterance):
    input_bag = convert_sentence_to_hot_bag(utterance)
    res = model.predict(np.array([input_bag]))[0]
    ERROR_THRESHOLD = 0.25
    results = [[intent, probability] for intent, probability in enumerate(res) if probability >ERROR_THRESHOLD]
    results.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in results:
        return_list.append({"intent": intents[r[0]], "probability": str(r[1])})
    return return_list[0]


In [76]:
def getResponse(utterance):
    initent = predict(utterance)['intent']
    predicted_intent_data = [i for i in data if i['intent'] == initent][0]
    possible_responses = predicted_intent_data['responses']
    return random.choice(possible_responses)

In [77]:
getResponse('good morning')

'Hi there, nice to meet you! How may I help?'

In [78]:
getResponse('what school?')

'Fisk Forever!'

In [79]:
getResponse('where did you study')

'I did my undergraduate from Fisk University.'

In [80]:
getResponse('profession')

'I am a software engineer'

In [83]:
getResponse('')

'Hi! How can I help?'