## Classifying utterances using Sklearn's SVM LinearSVC

### The Scenario - A Salon Booking Virtual Assistant

Lets suppose that we are faced with a classification problem in which we are task to build a Salon Virtual Assistant. One of our first requirements is to have the Virtual Assistant be able to classify customer utterances to specific intents. Lets take for example a customer who is asking if he can book an appointment (see dialogue below):

**Customer**: Hello! Good Morning

**VA**: intent: greeting

**Customer**: Do you do haircuts?

**VA**: intent ask_about_service

**Customer**: I wanna book a haircut

**VA**: intent: book_appointment

In the above dialogue flow, we can see that the VA is able to predict an intent for what the customer is asking about.
To deliver this experience, we'll explore and use Sklearn's SVM Linear SVC Algorithm to build an utterance classifier.



### Imports

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
import pickle

**Here, we import the following:**

**pandas** - Responsible for reading and converting the `classification_data.json` file to a data frame that will be used as input to train the model.

**train_test_split** - Would be used to split the training data (The utterances and their respective intents from `classification_data.json`) into a training set and a testing set.

**CountVectorizer** - Responsible for creating a numerical matrix that represents the number of words in each sentence and how often each word appears within each sentence.

-----------

Before we can learn how `TfidfTransformer` works, lets first understand what is `tfidf`. `tfidf` (Term Frequency Inverse Document Frequency) is responsible for giving each word in each sentence a unique number depending on how many times the word appear in the sentence.

**TfidfTransformer** - Takes the `CountVectorizer` matrix (described above) as input and creates a `tfidf` matrix out of it.

**LinearSVC** - This is the classification model we will be training to make classification predictions.

**pickle** - Our reason for using `pickle` here is to be able to save the `CountVectorizer` (a metaclass type object) we imported from `sklearn_feature_extraction.text` to disk so that we can load and use it later. This saves us some processing time since we don't need to reimport the `CountVectorizer` again when we are making a classification prediction.
The way how this works is that, with the pickle module, we are able take the `CountVectorizer` object we imported, and convert it into a character stream that would later on be used to reconstruct the `CountVectorizer` when we need to use it.

## The training and prediction process

### Training data (what the contents of classification_data.json looks like)

In [2]:
[
    {
        "intent": "greeting",
        "utterance": "whats up"
    },
    {
        "intent": "book_appointment",
        "utterance": "I am looking to book a haircut"
    },
    {
        "intent": "book_appointment",
        "utterance": "help me to set up a haircut with usual stylist"
    },
    {
        "intent": "book_appointment",
        "utterance": "is tomorrow at 10am ok too book a simple manicure"
    },
    {
        "intent": "business_hours",
        "utterance": "do you open for business on weekends"
    },
    {
        "intent": "business_hours",
        "utterance": "when are you closing?"
    },
    {
        "intent": "business_hours",
        "utterance": "when are you open for business"
    }
]

[{'intent': 'greeting', 'utterance': 'whats up'},
 {'intent': 'book_appointment', 'utterance': 'I am looking to book a haircut'},
 {'intent': 'book_appointment',
  'utterance': 'help me to set up a haircut with usual stylist'},
 {'intent': 'book_appointment',
  'utterance': 'is tomorrow at 10am ok too book a simple manicure'},
 {'intent': 'business_hours',
  'utterance': 'do you open for business on weekends'},
 {'intent': 'business_hours', 'utterance': 'when are you closing?'},
 {'intent': 'business_hours', 'utterance': 'when are you open for business'}]

### Create the train_linear_svc_model() function

This function will prepare the training data and train the classification model then save the model to disk. Apart from this, the function the saves `CountVectorizer` to disk.

In [3]:
def train_linear_svc_model(classification_training_data_json):
    try:
        # Create the a pandas dataframe from the raw training data (from json file).        
        df = pd.read_json(classification_training_data_json)

        # Create the train test split.
        X_train, X_test, y_train, y_test = train_test_split(df['utterance'], df['intent'], random_state=0)


        count_vect = CountVectorizer()
        # Prepare the training data.

        X_train_counts = count_vect.fit_transform(X_train)
        tfidf_transformer = TfidfTransformer()
        X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

        # Train the model on the training data.
        model = LinearSVC().fit(X_train_tfidf, y_train)

        # Saving the vectorizer to disk for later use.
        clf_model_vectorizer_pickle_file = 'model_vectorizer.pickle'
        pickle.dump(count_vect, open(clf_model_vectorizer_pickle_file, 'wb'))

        # Saving the model to disk for later use.
        clf_model_file = 'classification.model'
        pickle.dump(model, open(clf_model_file, 'wb'))

        print('------LinearSVC CLASSIFICATION TRAINING COMPLETED------')
        print("Vectorizer model saved to the current directory")
        print("Classification model saved to the current directory")
    except:
        error = 'Error training classification model'
        print(error)


### Run the train_linear_svc_model() function
This block of code runs the function we created above to train the **LinearSVC** classifier model.

In [4]:
# Reference to the classification_data.json file.
classification_training_data_json = "classification_data.json"

# Using the classification_data.json file to train in the model on.
train_linear_svc_model(classification_training_data_json)

------LinearSVC CLASSIFICATION TRAINING COMPLETED------
Vectorizer model saved to the current directory
Classification model saved to the current directory


### Create the classify_intent() function
This function will take the **utterance**, the **vectorizer_file** and the **model_file** as parameters to make a classification prediction, then returns the intent.

In [5]:
def classify_intent(utterance, vectorizer_file, model_file):
    # load the vectorizer.
    loaded_model_vectorizer = pickle.load(open(vectorizer_file, 'rb'))

    # load the model.
    loaded_model = pickle.load(open(model_file, 'rb'))

    # make a prediction.
    raw_intent = loaded_model.predict(loaded_model_vectorizer.transform([utterance]))
    intent = str(raw_intent[0])

    return intent

### Making a prediction
Here we are calling the `classify_intente()` function that we created above to make a prediction given an utterance.

In [6]:
# Loading model and vectorizer files from disk.
model_vectorizer_file = "model_vectorizer.pickle"
model_file = "classification.model"
utterance = "hey whats up Virtual Assistant"

intent = classify_intent(utterance, model_vectorizer_file, model_file)

response = {
    "intent": intent,
    "utterance": utterance
}

print(response)

{'intent': 'greeting', 'utterance': 'hey whats up Virtual Assistant'}
