# CRT in AI Training Week: NLP
## Hands-on Lab for NLP w/ Language Models

Instructor: Dhairya Dalal

For this exercise we'll be using the CLINC-150 intent dataset. This dataset contains about 150 intents related to the banking domain. Intents range from common banking activites like bill_pay and new_card to more conversational topics like do_you_have_pet and tell_joke. Additionally, there is an OOS (out of scope)label which consists of random utterances that not supported by the existing intents. The goal of OOS label is maps random utterances or things the bot can't support to a single label that bot can use to provide response like "I don't know" or "I'm sorry, I can't answer that for you". You can learn more about the dataset here: https://www.aclweb.org/anthology/D19-1131.pdf


## Environment Setup and Data Loading
Before we get started, run the cells below. The first cell will install NLP libraries we'll be using for the lab. Note this cell may take a few minutes to run. The next cell will load nltk into the environment. The last cell will load our toy dataset in the colab enviroment.


In [None]:
!pip install datasets >> NULL
!pip install transformers==4.28.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.28.0
  Downloading transformers-4.28.0-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m55.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.28.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m110.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.13.3 transformers-4.28.0


In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00570/clinc150_uci.zip
! unzip clinc150_uci.zip

--2023-06-06 14:31:07--  https://archive.ics.uci.edu/ml/machine-learning-databases/00570/clinc150_uci.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1053960 (1.0M) [application/x-httpd-php]
Saving to: ‘clinc150_uci.zip’


2023-06-06 14:31:07 (3.47 MB/s) - ‘clinc150_uci.zip’ saved [1053960/1053960]

Archive:  clinc150_uci.zip
   creating: clinc150_uci/
  inflating: clinc150_uci/data_small.json  
   creating: __MACOSX/
   creating: __MACOSX/clinc150_uci/
  inflating: __MACOSX/clinc150_uci/._data_small.json  
  inflating: clinc150_uci/meta.txt   
  inflating: __MACOSX/clinc150_uci/._meta.txt  
  inflating: clinc150_uci/LICENSE    
  inflating: clinc150_uci/data_oos_plus.json  
  inflating: __MACOSX/clinc150_uci/._data_oos_plus.json  
  inflating: clinc150_uci/data_imbalanced.json  
  inflating: __MACOSX/clinc150_uci/._d

In [None]:
# Run this cell to load the data into memory
import json
import pandas as pd
data = json.load(open("clinc150_uci/data_imbalanced.json", "r"))

# Keys in data dict
print(f"Keys in data dict:  {data.keys()}")

# Sample of data in raw form
print("Sample of train data")
display(data["train"][:2])

Keys in data dict:  dict_keys(['oos_val', 'val', 'train', 'oos_test', 'test', 'oos_train'])
Sample of train data


[['what are the steps for setting up direct deposit for my paycheck',
  'direct_deposit'],
 ['how is a direct deposit set up', 'direct_deposit']]

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

train_oos = pd.DataFrame(data["oos_train"], columns=["text", "label"])
train = pd.DataFrame(data["train"], columns=["text", "label"])
train_oos = pd.DataFrame(data["oos_test"], columns=["text", "label"])
train["split"] = "train"
train_oos["split"] = "train"

train, val = train_test_split(train, test_size=.1, random_state=1988)
train_oos, val_oos = train_test_split(train_oos, test_size=.1, random_state=1988)

val["split"] = "val"
val_oos["split"] = "val"

test_oos = pd.DataFrame(data["oos_test"], columns=["text", "label"])
test = pd.DataFrame(data["test"], columns=["text", "label"])
test_oos["split"] = "test"
test["split"] = "test"

# Combine all into a single dataframe
combined_data = pd.concat([train, train_oos, val, val_oos, test, test_oos])

In [None]:
combined_data

Unnamed: 0,text,label,split
230,engage whisper mode now,whisper_mode,train
3457,what's the deal with my health care,insurance,train
133,what is delta's carry on policy,carry_on,train
1447,if you could remind me about doing laundry i w...,todo_list_update,train
34,how do i get my check directly deposited,direct_deposit,train
...,...,...,...
995,find my wallet,oos,test
996,can you give me the gps location of harvey,oos,test
997,where's my buddy steve right this second,oos,test
998,locate jenny at her present position,oos,test


# Simple ML Baseline Model using TFIDF Features

In [None]:
# Simple TFIDF Model
from sklearn.feature_extraction.text import TfidfVectorizer

# Feel free to use the train and test dataframes below.
train = combined_data.query("split == 'train' ")
test = combined_data.query("split == 'test' ")


# Fit input vectorizer
tfidf = TfidfVectorizer(lowercase=True)
tfidf.fit(combined_data["text"])

# Generate X,y inputs for model

# Train features and labels
X_train = tfidf.transform(train["text"])
y_train = train["label"]

# Test features and labels
X_test = tfidf.transform(combined_data.query("split=='test'")["text"])
y_test = combined_data.query("split=='test'")["label"]


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Model training
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Model evaluation
preds = model.predict(X_test)  # generate predictions

print(classification_report(y_test, preds))

                           precision    recall  f1-score   support

      accept_reservations       0.96      0.87      0.91        30
          account_blocked       0.91      0.70      0.79        30
                    alarm       0.97      0.97      0.97        30
       application_status       0.89      0.80      0.84        30
                      apr       0.87      0.90      0.89        30
            are_you_a_bot       0.90      0.93      0.92        30
                  balance       0.90      0.63      0.75        30
             bill_balance       0.70      0.23      0.35        30
                 bill_due       0.78      0.47      0.58        30
              book_flight       0.93      0.87      0.90        30
               book_hotel       0.97      0.97      0.97        30
               calculator       0.86      0.40      0.55        30
                 calendar       0.79      0.63      0.70        30
          calendar_update       0.79      0.87      0.83     

## NLP w/ Transformer Language Models

For this part of the lab, we will explore using neural language models. Modern language models (BERT, ERNIE, GPT, T5, etc) have been found to be very effective across a wide range of NLP tasks. These models are usually deep neural networds which have been pretrained on large text corpora (i.e. Wikipedia, Common Crawl, BooksCorpus, etc) and are able to learn about the various aspects of language (syntax, grammar, semantics, etc) which can be tranferred acroos various domains and NLP tasks. The Transformer architecture (https://jalammar.github.io/illustrated-transformer/) tends to be the backbone for most modern language models. We sort the transformer models into two categories: autoregressive models and autencoding models. Autoregressive models (e.g. GPT, XLNet) are pretrained on the next word prediction. Given a sequence (the cat sat on the [BLANK]), the model attempts to predict the likely next word the sequence. In contrast autoencoding models (BERT, T5, RoBERTa, ERNIE) are trained to reconstruct corrupted sequences. So a given sentence like the cat sat on the mat would be corrupted by masking a random set of words, e.g. the [MASK] sat on the [MASK], where model must predict the masked tokens. Unlike autoregressive pretraining, the model uses the context of the full input to understand its masked constituent parts.

There are two ways to use to these models given an arbitrary task. The weights of the model can be frozen and the last hidden layer output of the model can be used as a set of fixed features. While the method is very quick, it is limited in its efficacy. The other ways is finetuning. Finetuning is the process of updating the pretrained weights in order to adapt the model to a new task and domain (e.g. sentiment classification). Since the model already has internalized its own understanding of language, grammar, and semantics, finetuning usually only takes 1-5 epochs of additional gradient updates to condition the model to support the new task.


The `HuggingFace` library hosts implementations and trained weights for nearly all the cutting edge Transformer models and has a unified and easy to use API for finetuning these models. For the final part of the lab we'll walk through how to prepare data for finetuning and train a model for our sentiment task.

We'll explore finetuning the DistilBERT (https://arxiv.org/abs/1910.01108) model for the last section. DistilBERT reduces the size of BERT model (fewer parameters and hidden layers) which allows for quicker finetuning while still retaining 90% of BERT's performance. For all other consideritions (input encoding, training, etc) DistilBERT is identical to BERT. We recommend for this section to change your runtime type to GPU as it will dramatically speed up training time. You may find you'll need to rerun earlier cells to ensure the dataset is reloaded into memory.

### 4.1 Data Preparation

DistilBERT expects wordpiece ids as input. Wordpiece is subword (https://huggingface.co/course/chapter6/6?fw=pt) tokenization algorithm learns a fixed set of token and partial token units which can be used to construct any word. The `AutoTokenizer` class from can be used to load the BERT wordpiece vocabulary and automatically enocde any text to a sequence of wordpiece ids. Let's explore this a bit further below.


In [None]:
# Import Autokenizer method from transformers
from transformers import AutoTokenizer

# Load DistilBert vocabulary using the .from_pretrained method
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Encoding a sentence
sent = "The quick brown fox jumped over the lazy dog."
print(f"Tokenizer output a dictionary: {tokenizer(sent)}")

# We can also decode ids to vocabulary
#print(tokenizer.decode([101, 1996, 4248, 2829, 4419, 5598, 2058, 1996, 13971, 3899, 1012, 102]))

Tokenizer output a dictionary: {'input_ids': [101, 1996, 4248, 2829, 4419, 5598, 2058, 1996, 13971, 3899, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(combined_data["label"])
combined_data["encoded_label"] = le.transform(combined_data["label"])
combined_data

train = combined_data.query("split == 'train' ")
val = combined_data.query("split == 'val' ")
test = combined_data.query("split == 'test' ")

In [None]:
train

Unnamed: 0,text,label,split,encoded_label
230,engage whisper mode now,whisper_mode,train,147
3457,what's the deal with my health care,insurance,train,57
133,what is delta's carry on policy,carry_on,train,19
1447,if you could remind me about doing laundry i w...,todo_list_update,train,128
34,how do i get my check directly deposited,direct_deposit,train,35
...,...,...,...,...
829,find out what gentlemen are wearing to wedding...,oos,train,80
936,how many keys does a xylophone have,oos,train,80
817,explain how small talk helps bind groups together,oos,train,80
565,what is the best way to kill microbes,oos,train,80


Next we need to create a custom Pytorch Dataset class to store the the generated encodings for our train corpus. The code below creates a custom class and generates the datasets for the train, val, and test sets.

In [None]:
from transformers import AutoTokenizer
from torch.utils.data import Dataset
import torch

# Define Custom Class for DistilBert Inputs
class HFDataset(Dataset):

    def __init__(self, encodings: dict):
        self.encodings = encodings

    def __len__(self) -> int:
        return len(self.encodings["input_ids"])

    def __getitem__(self, idx: int) -> dict:
        e = {k: v[idx] for k,v in self.encodings.items()}
        return e

# Define Tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")


# Train Inputs
train_encodings = tokenizer(
    train["text"].tolist(),
    padding=True,           # pad all inputs to max length
    max_length=128,         # Bert max is 512, we choose 128 due to compute limitations
    return_tensors="pt",    # Return format pytorch tensor
    truncation=True
)
train_encodings["labels"] = torch.tensor(train["encoded_label"].tolist())  # Update train inputs with labels
train_dataset = HFDataset(train_encodings)

# # Val Inputs
val_encodings = tokenizer(
    val["text"].tolist(),
    padding=True,           # pad all inputs to max length
    max_length=128,         # Bert max is 512, we choose 128 due to compute limitations
    return_tensors="pt",     # Return format pytorch tensor
    truncation=True
)
val_encodings["labels"] = torch.tensor(val["encoded_label"].tolist())  # Update train inputs with labels
val_dataset = HFDataset(val_encodings)


# Test Inputs
test_encodings = tokenizer(
    test["text"].tolist(),
    padding=True,           # pad all inputs to max length
    max_length=128,         # Bert max is 512, we choose 128 due to compute limitations
    return_tensors="pt",     # Return format pytorch tensor
    truncation=True
)
test_y = test["encoded_label"].tolist()
test_dataset = HFDataset(test_encodings)

In [None]:
# Lets take a look at whats in the train dataset
display(train_dataset[:2])

{'input_ids': tensor([[ 101, 8526, 7204, 5549, 2085,  102,    0,    0,    0,    0,    0,    0,
             0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
             0,    0,    0,    0,    0,    0,    0,    0,    0],
         [ 101, 2054, 1005, 1055, 1996, 3066, 2007, 2026, 2740, 2729,  102,    0,
             0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
             0,    0,    0,    0,    0,    0,    0,    0,    0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'labels': tensor([147,  57])}

In [None]:
le.inverse_transform([147])

array(['whisper_mode'], dtype=object)

## 4.2 Model Training
HuggingFace makes it simple to finetune transformer models for any task. First we load the pretrained model. `AutoModelForSequenceClassification` is a generic class combines an language model encoder with a classification head. Next create a `TrainingArgs` object which contains the training configuration details. Finally we create a `Trainer` object which will handle all the requisite training steps (i.e. learning rate scheduling, gradient backprop, etc).

In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import Trainer,  TrainingArguments


model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=151)

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'pre_classi

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=5,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="steps",
    logging_steps=100,
    per_device_train_batch_size = 32,
    per_device_eval_batch_size = 32,
    load_best_model_at_end=True,
    fp16=True
)

trainer = Trainer(
    model,
    training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()



Epoch,Training Loss,Validation Loss
1,2.3761,1.622281
2,0.7497,0.533945
3,0.3015,0.295464
4,0.1213,0.222921
5,0.0714,0.207084


TrainOutput(global_step=1625, training_loss=0.9681735335129958, metrics={'train_runtime': 141.0913, 'train_samples_per_second': 367.563, 'train_steps_per_second': 11.517, 'total_flos': 443954004306120.0, 'train_loss': 0.9681735335129958, 'epoch': 5.0})

Now that our model is trained, we can generate prediction.

In [None]:
preds = trainer.predict(test_dataset)
preds

PredictionOutput(predictions=array([[-3.268, -6.188, -4.957, ..., -3.521, -5.09 , -4.027],
       [-3.617, -5.92 , -4.84 , ..., -3.479, -5.055, -3.824],
       [-3.15 , -6.17 , -5.51 , ..., -3.959, -4.973, -3.664],
       ...,
       [-4.516, -4.586, -5.785, ..., -4.312, -4.527, -6.56 ],
       [-4.848, -4.156, -4.633, ..., -3.262, -3.85 , -6.832],
       [-4.61 , -3.871, -5.023, ..., -4.203, -3.969, -6.406]],
      dtype=float16), label_ids=None, metrics={'test_runtime': 1.7243, 'test_samples_per_second': 3189.732, 'test_steps_per_second': 99.752})

In [None]:
import numpy as np
preds = le.inverse_transform(np.argmax(preds.predictions, axis=1))

NameError: ignored

In [None]:
from sklearn.metrics import classification_report
print(classification_report(le.inverse_transform(test_y), preds))

              precision    recall  f1-score   support

           0       0.93      0.90      0.92        30
           1       0.87      0.90      0.89        30
           2       0.97      0.97      0.97        30
           3       1.00      1.00      1.00        30
           4       0.93      0.93      0.93        30
           5       1.00      0.97      0.98        30
           6       0.79      0.77      0.78        30
           7       0.88      0.70      0.78        30
           8       0.89      0.83      0.86        30
           9       0.96      0.90      0.93        30
          10       1.00      1.00      1.00        30
          11       0.97      0.93      0.95        30
          12       0.96      0.73      0.83        30
          13       0.79      0.90      0.84        30
          14       1.00      1.00      1.00        30
          15       0.88      1.00      0.94        30
          16       0.97      1.00      0.98        30
          17       1.00    