# Intents Classification for Neural Text Generation
**General Context:**  
The identification of both Dialog Acts (DA) and Emotion/Sentiment (E/S) in spoken language is an important step toward
improving model performances on spontaneous dialogue task. Especially, it is essential to avoid the generic response
problem, i.e., having an automatic dialog system generate an unspecific response — that can be an answer to a very large
number of user utterances. DAs and emotions are identified through sequence labeling systems that are trained in a
supervised manner DAs and emotions have been particularly useful for training ChatGPT.

**Problem Statement:**  
We start by formally defining the Sequence Labelling Problem. At the highest level, we have a set $D$ of conversations
composed of utterances, i.e., $D = (C_1,C_2,\dots,C_{|D|})$ with $Y= (Y_1,Y_2,\dots,Y_{|D|})$ being the corresponding
set of labels (e.g., DA,E/S). At a lower level each conversation $C_i$ is composed of utterances $u$, i.e $C_i= (
u_1,u_2,\dots,u_{|C_i|})$ with $Y_i = (y_1, y_2, \dots, y_{|C_i|})$ being the corresponding sequence of labels: each
$u_i$ is associated with a unique label $y_i$. At the lowest level, each utterance $u_i$ can be seen as a sequence of
words, i.e $u_i = (\omega^i_1, \omega^i_2, \dots, \omega^i_{|u_i|})$.

The goal is to predict Y from D !

**Your Task:**  
Build an intent classifier. Several benchmark have been released involving english [1] or multlingual setting [2]


## Google colab

In [1]:
!pip install datasets
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py39-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0.0,>=0.2.0
  Downloading hu

## Imports

In [2]:
from datasets import load_dataset, load_metric
import time

import torch.nn as nn
from transformers import BertModel, BertTokenizer

import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader

import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc

import matplotlib.pyplot as plt
import seaborn as sn
import pandas as pd

from tqdm.auto import tqdm

## Helpers functions

In [10]:
def predict(sentence):
    inputs = tokenizer(sentence, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    return names[predictions[0].item()]

## Loading the *dyda_da* dataset 

In [4]:
dataset = load_dataset("silicone", "dyda_da")
names = dataset["train"].features["Label"].names

Downloading builder script:   0%|          | 0.00/25.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/44.3k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/23.0k [00:00<?, ?B/s]

Downloading and preparing dataset silicone/dyda_da to /root/.cache/huggingface/datasets/silicone/dyda_da/1.0.0/af617406c94e3f78da85f7ea74ebfbd3f297a9665cb54adbae305b03bc4442a5...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.23M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/206k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/202k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87170 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/8069 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7740 [00:00<?, ? examples/s]

Dataset silicone downloaded and prepared to /root/.cache/huggingface/datasets/silicone/dyda_da/1.0.0/af617406c94e3f78da85f7ea74ebfbd3f297a9665cb54adbae305b03bc4442a5. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

## Tokenize dataset

In [11]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
tokenized_datasets = dataset.map(lambda example: tokenizer(example["Utterance"], padding="max_length", truncation=True), batched=True)

Map:   0%|          | 0/87170 [00:00<?, ? examples/s]

Map:   0%|          | 0/8069 [00:00<?, ? examples/s]

Map:   0%|          | 0/7740 [00:00<?, ? examples/s]

In [18]:
tokenized_datasets = tokenized_datasets.remove_columns(["Dialogue_ID", "Dialogue_Act", "Idx", "Utterance", "token_type_ids"])
tokenized_datasets = tokenized_datasets.rename_column("Label", "labels")
tokenized_datasets.set_format("torch")

## Dataloaders

In [20]:
train_dataloader = DataLoader(tokenized_datasets["train"], shuffle=True, batch_size=8)
eval_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=8)
test_dataloader = DataLoader(tokenized_datasets["test"], batch_size=8)

In [21]:
len(train_dataloader), len(eval_dataloader), len(test_dataloader)

(10897, 1009, 968)

## Models
La principale différence entre BERT uncased et BERT cased réside dans la façon dont ils traitent la casse dans le texte.

BERT uncased, abréviation de "uncased" en anglais, signifie que toutes les lettres majuscules ont été converties en minuscules. Cela permet de réduire la taille du vocabulaire et d'améliorer les performances du modèle en général.

D'un autre côté, BERT cased, abréviation de "cased" en anglais, conserve la casse des lettres dans le texte.

En résumé, BERT uncased est plus approprié pour les tâches de classification de texte, tandis que BERT cased peut être plus utile pour les tâches nécessitant une reconnaissance fine de la casse, comme la reconnaissance d'entités nommées.

In [22]:
# Load pre-trained BERT model
bert_model = BertModel.from_pretrained('bert-base-uncased')
embedding_dim = bert_model.config.hidden_size

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [23]:
# Add a multi-layer neural network on top of BERT
class BertClassifier(nn.Module):
    def __init__(self, bert_model, num_classes):
        super(BertClassifier, self).__init__()
        self.bert = bert_model
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Sequential(
            nn.Linear(embedding_dim, 512),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(512, num_classes)
        )

    def forward(self, input_ids, attention_mask):
        output = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = output[1]  # Use the output of [CLS] token 
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return logits

In [24]:
# Create an instance of the classifier
num_classes = len(names)  # Number of output classes
model = BertClassifier(bert_model, num_classes)

## Hyperparameters

In [25]:
criterion = nn.CrossEntropyLoss()
optimizer = AdamW(model.parameters(), lr=5e-5)

## Training loops
The token_type_ids is an optional input to BERT that identifies which token belongs to which segment. In other words, it helps BERT to distinguish between the two different sequences of tokens that are passed to it in a single input.

This is useful in scenarios where we want to perform some task on two different text sequences, such as question answering or natural language inference. For example, in the case of question answering, we want to pass both the question and the answer to the model, and the model needs to understand which tokens belong to the question and which tokens belong to the answer. The token_type_ids helps the model to differentiate between the two.

In practice, token_type_ids is a sequence of integers, where each integer corresponds to one of the input sequences. In the case of BERT, where we use the [CLS] token to represent the entire input sequence for classification tasks, we typically assign a value of 0 to all tokens that belong to the first input sequence and a value of 1 to all tokens that belong to the second input sequence.

In [26]:
# Train the model
def train(model, train_dataloader, optimizer, criterion, device):
    train_loss = 0
    progress_bar = tqdm(range(len(train_dataloader)))
    model.train()
    for step, batch in enumerate(train_dataloader):
        # Load batch to GPU
        b_input_ids = batch['input_ids'].to(device)
        b_input_mask = batch['attention_mask'].to(device)
        b_labels = batch['labels'].to(device)

        # Clear gradients
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(b_input_ids, attention_mask=b_input_mask)
        loss, logits = outputs[:2]
        loss = loss.mean()

        # Backward pass
        loss.backward()

        # Update parameters and take a step using the computed gradient
        optimizer.step()

        # Update tracking variables
        train_loss += loss.item()
        
        progress_bar.set_description(f'Training loss: {round(train_loss/(step+1),3)}')
        progress_bar.update(1)
    train_mean_loss = train_loss/len(train_dataloader)
    return train_mean_loss

def eval(model, eval_dataloader, device):    
    val_loss = 0
    nb_val_steps = 0
    nb_val_correct = 0
    nb_val_total = 0
    progress_bar = tqdm(range(len(eval_dataloader)))
    model.eval()
    with torch.no_grad():
        for batch in eval_dataloader:
            # Load batch to GPU
            b_input_ids = batch['input_ids'].to(device)
            b_input_mask = batch['attention_mask'].to(device)
            b_labels = batch['labels'].to(device)

            # Forward pass
            outputs = model(b_input_ids, attention_mask=b_input_mask)
            loss, logits = outputs[:2]

            # Calculate mean loss
            loss = loss.mean()
            val_loss += loss.item()
            nb_val_steps += 1

            # Calculate accuracy
            preds = torch.argmax(logits)
            nb_val_correct += (preds == b_labels).sum().item()
            nb_val_total += len(b_labels)
            progress_bar.update(1)

    val_mean_loss = val_loss / nb_val_steps
    val_mean_accuracy = nb_val_correct / nb_val_total
    return val_mean_loss, val_mean_accuracy

In [None]:
# Define the training parameters
epochs = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Train the model for multiple epochs
for epoch in range(epochs):
    print("EPOCH", epoch)
    print("==========================TRAINING==========================")
    train_loss = train(model, train_dataloader, optimizer, criterion, device)
    print("==========================EVALUATION==========================")
    val_loss, val_accuracy = eval(model, eval_dataloader, device)
    print(f"Epoch {epoch+1}/{epochs}, Train Loss: {np.mean(train_loss):.4f}, Val Loss: {val_loss:.4f}, Val Acc: {val_accuracy:.4f}")

EPOCH 0


  0%|          | 0/10897 [00:00<?, ?it/s]

In [None]:
# Save model
model.save_pretrained('models/bert')

## Evaluation

In [None]:
metric = load_metric("accuracy")
model.eval()
preds, trues = [], []
for i, batch in tqdm(enumerate(test_dataloader), desc="evaluating", total=test_dataloader.__len__()):
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

    _, tag_seq  = torch.max(logits, 1)
    preds.extend(tag_seq.cpu().detach().tolist())
    trues.extend(batch['labels'].cpu().detach().tolist())

metric.compute()

In [None]:
print(classification_report(np.array(trues).flatten(), np.array(preds).flatten(), target_names=names))

In [None]:
cm = confusion_matrix(np.array(trues).flatten(), np.array(preds).flatten())
df_cm = pd.DataFrame(cm, index=names, columns=names)
# config plot sizes
sn.set(font_scale=1)
sn.heatmap(df_cm, annot=True, annot_kws={"size": 8}, cmap='coolwarm', linewidth=0.5, fmt="")
plt.show()

## Prediction