<a href="https://colab.research.google.com/github/xCosmicx/ATA/blob/main/week14/intent_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intent Recognition

In this practical, we will learn how to apply the HuggingFace Transformers library to our own Intent Recognition task for our chatbot.

####**NOTE: Be sure to set your runtime to a GPU instance!**

## Install the Hugging Face Transformers Library

Run the following cell below to install the transformers library.

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 14.0 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 58.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 13.6 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 74.0 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninsta

## Getting the data and prepare the data

In [2]:
import pandas as pd

data_url = 'https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/datasets/airchat_intents.csv'
df = pd.read_csv(data_url)

df.head()

Unnamed: 0,Label,Text
0,atis_abbreviation,what is fare code h
1,atis_abbreviation,what is booking class c
2,atis_abbreviation,what does fare code q mean
3,atis_abbreviation,what is fare code qw
4,atis_abbreviation,what does the fare code f mean


We noticed that there are two columns 'Label' and 'Text'. Let's just examine what are the different labels we have and how many samples we have for each labels.

In [3]:
df['Label'].value_counts()

atis_flight                                 3666
atis_airfare                                 423
atis_ground_service                          255
atis_airline                                 157
atis_abbreviation                            147
atis_yes                                      82
atis_aircraft                                 81
atis_no                                       67
atis_flight_time                              54
atis_greeting                                 53
atis_quantity                                 51
atis_flight#atis_airfare                      21
atis_distance                                 20
atis_airport                                  20
atis_city                                     19
atis_ground_fare                              18
atis_capacity                                 16
atis_flight_no                                12
atis_meal                                      6
atis_restriction                               6
atis_airline#atis_fl

We can see that some labels have very few sample such as 'atis_meal', 'atis_airline#atis_flight_no', 'atis_cheapest', and so on. With so few samples, our model will have difficulty in learning any meaningful pattern from it. We will group these labels (with few samples) into a new label called 'others'.  

---



### Re-define our Classification Labels

Here we define the labels we are interested in classifying based on the original labels, and also we added a new label called 'Others'.
 

In [4]:
# Create a list of unique labels that we will recognize.
#
sentence_labels = [
              "others",
              "atis_abbreviation",
              "atis_aircraft",
              "atis_airfare",
              "atis_airline",
              "atis_flight",
              "atis_flight_time",
              "atis_greeting",
              "atis_ground_service",
              "atis_quantity",
              "atis_yes",
              "atis_no"]

# This creates a reverse mapping dictionary of "label" -> index.
# 
sentence_labels_id_by_label = dict((t, i) for i, t in enumerate(sentence_labels))

Now we will map the previous labels to the few ones we specified in the cell above. We will also convert the text labels into numeric labels (e.g. others->0, atis_abbreviation->1, etc). We can use the `map()` function in dataframe to help us do that. We define a lambda function that do the mapping.

In [5]:
df['Label'] = df['Label'].map(lambda label: 
                              sentence_labels_id_by_label[label] 
                              if label in sentence_labels_id_by_label 
                              else 0)

In [6]:
# examine a few random samples 
df.sample(10)

Unnamed: 0,Label,Text
4368,5,what flights are there tuesday morning from d...
2383,5,i need a flight from kansas city to chicago n...
2428,5,flights from denver to philadelphia include f...
246,3,list the fares of us air flights from boston ...
2888,5,what afternoon flights are available from den...
852,0,where is general mitchell international located
4955,8,ground transportation please in the city of b...
4814,8,please give me ground transportation informat...
2472,5,i'm interested in a flight from dallas to was...
2564,5,list all afternoon flights on united airlines...


### Split Our Data

We will now separate the texts and labels and call them all_texts and all_labels and we will split the dataset into training and validation set. We do a stratified split to ensure we have equal representation of different labels in both train and validation set.

In [7]:
all_texts = df['Text']
all_labels = df['Label']

In [8]:
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = train_test_split(all_texts, 
                                                                    all_labels, 
                                                                    test_size=0.2, 
                                                                    stratify=all_labels)

In [9]:
train_labels.value_counts()/len(train_labels)

5     0.707770
3     0.081564
8     0.049228
4     0.030405
1     0.028475
0     0.027751
10    0.015927
2     0.015685
11    0.012790
6     0.010376
7     0.010135
9     0.009894
Name: Label, dtype: float64

In [10]:
val_labels.value_counts()/len(val_labels)

5     0.707529
3     0.082046
8     0.049228
4     0.029923
1     0.027992
0     0.027992
10    0.015444
2     0.015444
11    0.013514
6     0.010618
7     0.010618
9     0.009653
Name: Label, dtype: float64

### Tokenize the text 

Before we can use the text for classification, we need to tokenize them. We will use Tokenizer of the pretrained model 'distilbert-base-uncased' as we will be fine-tunining on a pretrained model 'distilbert-base-uncased'. 


In [11]:
len(sentence_labels)

12

In [12]:
## before we can feed the texts to tokenizer, we need to convert our texts into list of text string instead of 
## panda Series. We can do this by using to_list(). 

train_texts = train_texts.to_list()
train_labels = train_labels.to_list()
val_texts = val_texts.to_list()
val_labels = val_labels.to_list()

In [13]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [14]:
train_encodings = tokenizer(train_texts, padding=True, truncation=True)
val_encodings = tokenizer(val_texts, padding=True, truncation=True)

Once we have the encodings, we will go ahead and create a tensorflow dataset, ready to be used to train our model. Since the HuggingFace pretrained model (the tensorflow version) is a Keras model, it can consume the tf.data dataset. 

In [15]:
import tensorflow as tf

batch_size = 16

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
)).batch(batch_size)

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
)).batch(batch_size)

## Train Your Sentence Classification Model

Run the following cell to download the "distilbert-base-uncased" and perform fine-tuning training using the dataset that we have above.

In [16]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(
        "distilbert-base-uncased",num_labels=len(sentence_labels))

Downloading:   0%|          | 0.00/347M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_projector', 'vocab_layer_norm', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'dropout_19', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

As in previous lab, we start with a smaller learning rate 5e-5 (0.00005) and slowly reduce the learning rate over the course of training.

In [17]:
from tensorflow.keras.optimizers.schedules import PolynomialDecay

num_epochs = 2

# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs. Since our dataset is already batched, we can simply take the len.
num_train_steps = len(train_dataset) * num_epochs

lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5, end_learning_rate=0.0, decay_steps=num_train_steps
)

In [18]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy

opt = Adam(learning_rate=lr_scheduler)

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=opt, loss=loss, metrics=["accuracy"])

model.fit(train_dataset, validation_data=val_dataset, epochs=num_epochs)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f72193ef950>

### Evaluating the Model

Run the following code to evaluate our model with entire validation data set.

We also print out the classification report to see how the model performs for each label. Note that those with smaller number of samples typically have lower F1-score.


In [19]:
output = model.predict(val_dataset, batch_size=1)
pred_probs = tf.nn.softmax(output.logits, axis=-1)
preds = tf.argmax(pred_probs, axis=-1)

In [20]:
val_labels = []
for _, labels in val_dataset.as_numpy_iterator():
    val_labels.extend(labels)

In [21]:
from sklearn.metrics import classification_report

print(classification_report(val_labels, preds))

              precision    recall  f1-score   support

           0       0.95      0.62      0.75        29
           1       0.85      0.97      0.90        29
           2       1.00      1.00      1.00        16
           3       0.95      0.98      0.97        85
           4       1.00      1.00      1.00        31
           5       0.99      0.99      0.99       733
           6       0.91      0.91      0.91        11
           7       0.91      0.91      0.91        11
           8       0.98      0.98      0.98        51
           9       0.77      1.00      0.87        10
          10       1.00      0.88      0.93        16
          11       0.88      1.00      0.93        14

    accuracy                           0.98      1036
   macro avg       0.93      0.94      0.93      1036
weighted avg       0.98      0.98      0.98      1036



### Saving the Model

When you training has completed, run the following cell to save your model.

Remember to download the model from Google Colab if you want to use later.

In [22]:
# Save the model

model.save_pretrained("intent_model")

## Putting Our Model to the Test

Run the following cell to create the necessary classes and functions to load our model and perform inference.


In [23]:
# Import the necessary libraries
#
from transformers import (
    AutoTokenizer,
    TFAutoModelForSequenceClassification
)

# Create the DistilBERT tokenizer
#
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

# Define a function to perform inference on a single input text.
# 
def infer_intent(model, text):
    # Passes the text into the tokenizer
    #
    input = tokenizer(text, truncation=True, padding=True, return_tensors="tf")
    
    # Sends the result from the tokenizer into our classification model
    #
    output = model(input)

    # Extract the output logits and convert to softmax 
    # Find the classification index with the highest value.
    #  
    pred_label = tf.argmax(tf.nn.softmax(output.logits, axis=-1), axis=-1)

    return pred_label

# Create a list of unique labels that we will recognize.
# Obviously this has to match what we trained our model with
# earlier.
#
sentence_labels = [
              "others",
              "atis_abbreviation",
              "atis_aircraft",
              "atis_airfare",
              "atis_airline",
              "atis_flight",
              "atis_flight_time",
              "atis_greeting",
              "atis_ground_service",
              "atis_quantity",
              "atis_yes",
              "atis_no"]

# Load the saved model file
#
intent_model = TFAutoModelForSequenceClassification.from_pretrained("intent_model")



Some layers from the model checkpoint at intent_model were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at intent_model and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [24]:
text = input()

print (sentence_labels[infer_intent(intent_model, text)[0]])

hi, i want to book a ticket
atis_airfare


In [25]:
!zip -r intent_model.zip intent_model

  adding: intent_model/ (stored 0%)
  adding: intent_model/tf_model.h5 (deflated 8%)
  adding: intent_model/config.json (deflated 57%)


In [26]:
# from google.colab import drive
# drive.mount('/content/drive')

Mounted at /content/drive


In [27]:
# !cp intent_model.zip drive/MyDrive