# $$ Train ChatBot $$

## Import Libraries and Load the Data

In [1]:
import numpy as np
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
import random

**Code Explanation:**
- `numpy` is imported as `np` for numerical operations.
- `Sequential` is a model type in Keras that allows you to build a neural network layer by layer.
- `Dense` is a layer type for creating densely-connected neural network layers.
- `Activation` is used to specify the activation functions for layers.
- `Dropout` is used to add dropout regularization to the model.
- `SGD` stands for Stochastic Gradient Descent, which is an optimization algorithm used for training the neural network.
- `random` is imported for generating random numbers and is used in various parts of the code.


In [2]:
# Import necessary libraries for text processing
import nltk
from nltk.stem import WordNetLemmatizer

# Initialize a lemmatizer for word normalization
lemmatizer = WordNetLemmatizer()

# Import JSON and Pickle for data serialization
import json
import pickle


**Code Explanation:**
- `nltk` is a powerful library for natural language processing. It provides tools for working with text data.
- We import the `WordNetLemmatizer` from `nltk.stem`. A lemmatizer is a tool used to reduce words to their base or dictionary form. This is helpful for text analysis as it reduces words to their common form (e.g., "running" to "run").
- `json` is imported to work with JSON data. JSON is a common data format for storing and exchanging structured data.
- `pickle` is imported to work with Python object serialization. It allows us to save and load Python objects like variables or models.


In [3]:
intents_file = open('intents.json').read()
intents = json.loads(intents_file)

## Preprocessing the Data

### Tokenization: Preparing Text Data for Analysis

In natural language processing (NLP) and machine learning, the data used by models cannot be raw text; it needs to go through a series of pre-processing steps to make it more suitable for analysis. One of the fundamental preprocessing techniques for textual data is "tokenization."

**Tokenization** is the process of breaking down sentences or paragraphs into individual words or tokens. These tokens are the basic building blocks that enable machines to understand and work with textual data. In our code, we use tokenization to convert raw text into a more structured format, making it easier for the machine to analyze.

Here's why tokenization is important in our project:

- In our project, we are working with intents, which are essentially groups of user queries or patterns and corresponding responses. Each intent is associated with a specific category or purpose, like "greeting," "farewell," or "ordering."

- To prepare this textual data for analysis, we tokenize each pattern. This means that we break down sentences or user queries into individual words. For example, the sentence "How are you doing?" would be tokenized into individual tokens like "How," "are," "you," and "doing."

- The tokenized words are then collected into lists, making it easier for us to manage and work with the textual data. For instance, all the tokens from various user queries associated with a particular intent are stored in a list.

- Additionally, we maintain a list of classes, which represents the categories or intents. Each intent is associated with specific patterns. This list of classes helps us understand the scope of our project and categorize user queries effectively.


In [4]:
# Initialize empty lists for words, classes, and documents
words = []
classes = []
documents = []

# Define a list of characters to ignore
ignore_letters = ['!', '?', ',', '.']

In [5]:
for intent in intents['intents']:
    for pattern in intent['patterns']:
        #tokenize each word
        word = nltk.word_tokenize(pattern)
        words.extend(word)        
        #add documents in the corpus
        documents.append((word, intent['tag']))
        # add to our classes list
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

print(documents)

[(['Hi', 'there'], 'greeting'), (['How', 'are', 'you'], 'greeting'), (['Is', 'anyone', 'there', '?'], 'greeting'), (['Hey'], 'greeting'), (['Hola'], 'greeting'), (['Hello'], 'greeting'), (['Good', 'day'], 'greeting'), (['Bye'], 'goodbye'), (['See', 'you', 'later'], 'goodbye'), (['Goodbye'], 'goodbye'), (['Nice', 'chatting', 'to', 'you', ',', 'bye'], 'goodbye'), (['Till', 'next', 'time'], 'goodbye'), (['Thanks'], 'thanks'), (['Thank', 'you'], 'thanks'), (['That', "'s", 'helpful'], 'thanks'), (['Awesome', ',', 'thanks'], 'thanks'), (['Thanks', 'for', 'helping', 'me'], 'thanks'), (['How', 'you', 'could', 'help', 'me', '?'], 'options'), (['What', 'you', 'can', 'do', '?'], 'options'), (['What', 'help', 'you', 'provide', '?'], 'options'), (['How', 'you', 'can', 'be', 'helpful', '?'], 'options'), (['What', 'support', 'is', 'offered'], 'options'), (['How', 'to', 'check', 'Adverse', 'drug', 'reaction', '?'], 'adverse_drug'), (['Open', 'adverse', 'drugs', 'module'], 'adverse_drug'), (['Give', 'm

**Explanation:**
- The code is designed to process a dataset of intents, where each intent represents a specific category or type of user input (e.g., "greeting," "farewell," "ordering").
- It loops through each intent within the 'intents' dataset.
- Inside each intent, it iterates through the 'patterns,' which are examples of user queries or sentences associated with that intent.
- For each pattern, it tokenizes the text, which means it breaks down the sentence into individual words. This is important for understanding and analyzing the text because it separates words from one another.
- The tokenized words are then added to the 'words' list, which accumulates all the words from all the patterns and intents.
- Additionally, the code creates a 'documents' list. This list stores the tokenized words along with their associated intent tags. This is useful for training a machine learning model to understand which intents are associated with which words.
- The 'classes' list keeps track of all the intent tags encountered. If a tag is not already in the list, it is added.
- The purpose of this code is to prepare and structure textual data for a chatbot or natural language processing application. By tokenizing the text and organizing it into lists, it becomes easier to analyze, classify, and respond to user input effectively.
- The final 'print' statement is there to show the 'documents' list, which is a structured representation of the data that will be used to train a chatbot or perform further NLP tasks. This 'documents' list will contain tokenized patterns and their corresponding intent tags, which are crucial for understanding user intent and generating appropriate responses.

# $$ Lemmatization $$
 - we are applyi lemmatization to every word in our dataset. By doing so, we remove duplicates and ensure that our model considers different word forms as a unified concept. This step is crucial for making our model more efficient and capable of handling a wide range of user inputs effectively.

In [6]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to C:\Users\Usman
[nltk_data]     Ghias\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [7]:
# lemmaztize and lower each word and remove duplicates
words = [lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_letters]
words = sorted(list(set(words)))

# sort classes
classes = sorted(list(set(classes)))
# documents = combination between patterns and intents
print (len(documents), "documents")

# classes = intents
print (len(classes), "classes", classes)

# words = all words, vocabulary
print (len(words), "unique lemmatized words", words)

pickle.dump(words,open('words.pkl','wb'))
pickle.dump(classes,open('classes.pkl','wb'))

47 documents
9 classes ['adverse_drug', 'blood_pressure', 'blood_pressure_search', 'goodbye', 'greeting', 'hospital_search', 'options', 'pharmacy_search', 'thanks']
87 unique lemmatized words ["'s", 'a', 'adverse', 'all', 'anyone', 'are', 'awesome', 'be', 'behavior', 'blood', 'by', 'bye', 'can', 'causing', 'chatting', 'check', 'could', 'data', 'day', 'detail', 'do', 'dont', 'drug', 'entry', 'find', 'for', 'give', 'good', 'goodbye', 'have', 'hello', 'help', 'helpful', 'helping', 'hey', 'hi', 'history', 'hola', 'hospital', 'how', 'i', 'id', 'is', 'later', 'list', 'load', 'locate', 'log', 'looking', 'lookup', 'management', 'me', 'module', 'nearby', 'next', 'nice', 'of', 'offered', 'open', 'patient', 'pharmacy', 'pressure', 'provide', 'reaction', 'related', 'result', 'search', 'searching', 'see', 'show', 'suitable', 'support', 'task', 'thank', 'thanks', 'that', 'there', 'till', 'time', 'to', 'transfer', 'up', 'want', 'what', 'which', 'with', 'you']


### This code snippet demonstrates the following:

- Lemmatizes each word, converts them to lowercase, and removes duplicates. This step further prepares the words for use in a machine learning model.
- Sorts the unique intent classes and prints the count.
- Counts the number of documents (combinations of patterns and intents), intent classes, and unique lemmatized words.
- Serializes and saves the processed words and intent classes to files using Pickle for future use, such as training a chatbot or natural language processing model.

The purpose of this code is to finalize the preprocessing of textual data, ensuring that words are in their lemmatized and lowercase form while saving these processed words and intent classes for future use in machine learning tasks.

# Create Training and Testing Data

In [8]:
import numpy as np

# Create the training data
training = []
output_empty = [0] * len(classes)

for doc in documents:
    bag = []
    word_patterns = doc[0]
    word_patterns = [lemmatizer.lemmatize(word.lower()) for word in word_patterns]
    for word in words:
        bag.append(1) if word in word_patterns else bag.append(0)
    
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1
    training.append([bag, output_row])

# Shuffle the training data
random.shuffle(training)

# Separate the bag and output rows into NumPy arrays
train_x = np.array([item[0] for item in training])
train_y = np.array([item[1] for item in training])

print("Training data is created")


Training data is created


# Training the Model

In [9]:
# deep neural networks model
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))

# Compiling model. Use the recommended optimizer arguments.
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

# Training and saving the model
hist = model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)
model.save('chatbot_model.h5', hist)

print("Model is created")


Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

  saving_api.save_model(


**Model Architecture and Training:**

Our model is designed as a neural network consisting of three dense layers. Each layer serves a specific purpose in understanding and classifying user inputs:

1. **First Layer (128 Neurons):** This initial layer processes the input data and extracts relevant features. With 128 neurons, it can capture a variety of patterns and information.

2. **Second Layer (64 Neurons):** The second layer further refines the feature representation with 64 neurons. This reduces the complexity of the data while retaining crucial information.

3. **Last Layer (Number of Classes Neurons):** The final layer is tailored to the number of unique intent classes we have. It's where the model makes predictions about which intent best matches the input.

**Dropout Layers:**

To prevent overfitting, we've introduced dropout layers. These layers randomly deactivate a fraction of neurons during each training iteration, ensuring that the model doesn't become too specialized on the training data. This promotes better generalization to unseen inputs.

**Optimizer:**

We've employed the Stochastic Gradient Descent (SGD) optimizer for training our model. SGD is a widely used optimization algorithm for updating model weights during training. It helps the model converge to a solution more effectively.

**Training:**

We feed our preprocessed and tokenized data into the model and train it. In this code, the model is trained over 200 epochs, which means it goes through the entire dataset 200 times. This extensive training allows the model to learn patterns and associations in the data.

**Model Saving:**

Once the training is complete, we save the trained model using Keras' `model.save("chatbot_model.h5")` function. This saved model can be used for chatbot development or other natural language processing tasks without the need to retrain it from scratch. It preserves the learned patterns and relationships in the data, making it ready for deployment in practical applications.