<a href="https://colab.research.google.com/github/fpgmina/DeepNLP/blob/main/L3_Part_2_NER_and_Intent_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Giuseppe Gallipoli

**Credits:** Moreno La Quatra

**Practice 3:** Named Entity Recognition (part 1) & Intent Detection (part 2)

## Intent Detection

In data mining, intention mining or intent mining is the problem of determining a user's intention from logs of his/her behavior in interaction with a computer system, such as in search engines. Intent Detection is the identification and categorization of what a user online intended or wanted to find when they type or speak with a conversational agent (or a search engine).

![https://d33wubrfki0l68.cloudfront.net/32e2326762c75a0357ab1ae1976a60d4bbce724b/f4ac0/static/a5878ba6b0e4e77163dc07d07ecf2291/2b6c7/intent-classification-normal.png](https://d33wubrfki0l68.cloudfront.net/32e2326762c75a0357ab1ae1976a60d4bbce724b/f4ac0/static/a5878ba6b0e4e77163dc07d07ecf2291/2b6c7/intent-classification-normal.png)

In this section, you will use the ATIS dataset: https://github.com/yvchen/JointSLU; https://www.kaggle.com/siddhadev/atis-dataset-clean/home

The task is to classify the intent of a sentence. The dataset is split into train, validation and test sets. **Use the provided splits** to train and evaluate your models.

In [1]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/IntentDetection/atis.train.csv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/IntentDetection/atis.dev.csv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/IntentDetection/atis.test.csv

--2025-10-29 18:15:08--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/IntentDetection/atis.train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 838864 (819K) [text/plain]
Saving to: ‘atis.train.csv’


2025-10-29 18:15:09 (23.8 MB/s) - ‘atis.train.csv’ saved [838864/838864]

--2025-10-29 18:15:09--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/IntentDetection/atis.dev.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 112033 (109K) [text/plain]
Saving to: ‘atis.dev.c

### **Question 5: Two-step classification model**

Train a classification model to identify the intent from a given sentence. The model is required to leverage on pre-trained BERT model to generate sentence embeddings (important: **no fine-tuning**). The model is required to use the embeddings to perform classification.

Once extracted the embeddings, you can use any classifier you want. For example, you can use a linear classifier (e.g., Logistic Regression) or a neural network (e.g., MLP). For your convenience, you can use the `sklearn` library for training the classifier (https://scikit-learn.org/stable/supervised_learning.html).

![https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P4/IntentDetection/no_finetuning.png?raw=true](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P4/IntentDetection/no_finetuning.png?raw=true)


Assess the performance of the trained model (the model on top of BERT) on the test set by using the **classification accuracy**, **precision**, **recall** and **F1-score**. You can use the `sklearn` library for computing the metrics (https://scikit-learn.org/stable/api/sklearn.metrics.html).


Note: you can use the `sentence-transformers` library to generate sentence embeddings (https://www.sbert.net/docs/pretrained_models.html).

In [2]:
%%capture
!pip install sentence-transformers
!pip install sklearn

In [3]:
import pandas as pd
TEXT_COLUMN = 'tokens'
LABEL_COLUMN = 'intent'

try:
  # Load training and test data using pandas
  train_df = pd.read_csv("atis.train.csv")
  test_df = pd.read_csv("atis.test.csv")
  # We also load the dev (validation) set, though we won't use it in this
  # specific train/test script.
  dev_df = pd.read_csv("atis.dev.csv")

except FileNotFoundError:
  print("="*50)
  print("ERROR: Could not find data files.")
  print("Please make sure 'atis.train.csv', 'atis.dev.csv', and 'atis.test.csv' ")
  print("are in the same directory as this script.")
  print("="*50)

In [4]:
train_df.head()

Unnamed: 0,id,tokens,slots,intent
0,train-00001,BOS what is the cost of a round trip flight fr...,O O O O O O O B-round_trip I-round_trip O O B-...,atis_airfare
1,train-00002,BOS now i need a flight leaving fort worth and...,O O O O O O O B-fromloc.city_name I-fromloc.ci...,atis_flight
2,train-00003,BOS i need to fly from kansas city to chicago ...,O O O O O O B-fromloc.city_name I-fromloc.city...,atis_flight
3,train-00004,BOS what is the meaning of meal code s EOS,O O O O O O B-meal_code I-meal_code I-meal_code O,atis_abbreviation
4,train-00005,BOS show me all flights from denver to pittsbu...,O O O O O O B-fromloc.city_name O B-toloc.city...,atis_flight


In [5]:
sentences_train = train_df[TEXT_COLUMN].tolist()
y_train = train_df[LABEL_COLUMN].tolist()

sentences_test = test_df[TEXT_COLUMN].tolist()
y_test = test_df[LABEL_COLUMN].tolist()

num_train = len(sentences_train)
num_test = len(sentences_test)
num_classes = len(set(y_train)) # Get the number of unique intents

print(f"Loaded {num_train} training sentences from 'atis.train.csv'.")
print(f"Loaded {num_test} test sentences from 'atis.test.csv'.")
print(f"Found {num_classes} unique intent classes in the training data.")
print("-" * 30, "\n")

Loaded 4274 training sentences from 'atis.train.csv'.
Loaded 586 test sentences from 'atis.test.csv'.
Found 17 unique intent classes in the training data.
------------------------------ 



In [6]:
set(y_train)

{'atis_abbreviation',
 'atis_aircraft',
 'atis_airfare',
 'atis_airline',
 'atis_airport',
 'atis_capacity',
 'atis_city',
 'atis_distance',
 'atis_flight',
 'atis_flight#atis_airfare',
 'atis_flight_no',
 'atis_flight_time',
 'atis_ground_fare',
 'atis_ground_service',
 'atis_meal',
 'atis_quantity',
 'atis_restriction'}

#### Generating Sentence Embeddings (no-fine-tuning)

Note on all-MiniLM-L6-v2:

1. What is all-MiniLM-L6-v2?

It is a Transformer model, BERT based. The "LM" stands for "Language Model." It was created by Microsoft using a process called "distillation" from a much larger model. Think of it as a small, fast, and highly optimized version of BERT.

SentenceTransformer Model: This specific model has been further trained by the SentenceTransformer library (using a siamese network structure) specifically for the task of creating high-quality sentence embeddings.

2. Why Not Use bert-base-uncased Directly?

This is the most important part, and it relates directly to your "no fine-tuning" constraint.

A standard BERT model (like bert-base-uncased from the transformers library) is a great base for fine-tuning.

However, if you just take its raw output embeddings without any fine-tuning (for example, by taking the [CLS] token's embedding), they are generally not very good for representing the meaning of a whole sentence. The model's pre-training (Masked Language Model) doesn't teach it to create good sentence-level representations by default.

You would get embeddings, but they would likely lead to poor classification accuracy.

In [7]:
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer('all-MiniLM-L6-v2')
# Convert sentences to vector representations (embeddings)
print("Encoding training sentences... ")
X_train_embeddings = embedder.encode(sentences_train, show_progress_bar=True)
print("Encoding test sentences...")
X_test_embeddings = embedder.encode(sentences_test, show_progress_bar=True)

print(f"Embeddings generated. Shape of training embeddings: {X_train_embeddings.shape}")
print(f"Embeddings generated. Shape of test embeddings: {X_test_embeddings.shape}")
print("-" * 30, "\n")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Encoding training sentences... 


Batches:   0%|          | 0/134 [00:00<?, ?it/s]

Encoding test sentences...


Batches:   0%|          | 0/19 [00:00<?, ?it/s]

Embeddings generated. Shape of training embeddings: (4274, 384)
Embeddings generated. Shape of test embeddings: (586, 384)
------------------------------ 



#### Classification

A note on sklearn LogisticRegression:

While its name comes from logistic (binary) regression, the scikit-learn implementation of LogisticRegression handles multi-class classification problems automatically.

It does this using one of two main strategies:

1. One-vs-Rest (OvR) (Default): This is the default method. If you have N classes (e.g., "greeting", "goodbye", "order_pizza"), it trains N separate binary logistic regression classifiers:

    * One classifier to distinguish "greeting" (class 1) from "all other classes" (class 0).

    * One classifier to distinguish "goodbye" (class 1) from "all other classes" (class 0).

    * One classifier to distinguish "order_pizza" (class 1) from "all other classes" (class 0). When you give it a new sentence, it runs all N classifiers and picks the class that gives the highest confidence score.

2. Multinomial (Softmax): You can also explicitly tell it to use this method by setting multi_class='multinomial'. This trains a single classifier model that directly outputs a probability for every class, all summing to 1. This is often called Softmax Regression.

##### Training Classifier

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report
)

# We increase max_iter as the dataset is larger and may need more iterations to converge.
classifier = LogisticRegression(random_state=42, max_iter=1000)

print("Training classifier...")
# Train the classifier on the *training* embeddings and labels
classifier.fit(X_train_embeddings, y_train)

print("Classifier training complete.")
print("-" * 30, "\n")

Training classifier...
Classifier training complete.
------------------------------ 



##### Eval Classifier

In [9]:
y_pred = classifier.predict(X_test_embeddings)

# --- Calculate Metrics ---

# We use average='weighted' to account for class imbalance (some intents
# are more common than others).
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)
recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)
f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)

print("\n--- Overall Metrics (Weighted Average) ---")
print(f"**Accuracy:** {accuracy:.4f}")
print(f"**Precision:** {precision:.4f}")
print(f"**Recall:** {recall:.4f}")
print(f"**F1-Score:** {f1:.4f}")

# --- Detailed Classification Report ---
# This is now much more useful as it shows performance per-intent
print("\n--- Detailed Classification Report (Per-Intent) ---")
# We set `labels` to the unique classes found in the test set to ensure
# the report is complete and in a consistent order.
unique_labels = sorted(list(set(y_test)))
report = classification_report(y_test, y_pred, labels=unique_labels, zero_division=0)
print(report)


--- Overall Metrics (Weighted Average) ---
**Accuracy:** 0.9113
**Precision:** 0.8782
**Recall:** 0.9113
**F1-Score:** 0.8886

--- Detailed Classification Report (Per-Intent) ---
                          precision    recall  f1-score   support

       atis_abbreviation       0.94      0.94      0.94        16
           atis_aircraft       1.00      0.38      0.55         8
            atis_airfare       0.91      0.96      0.94        54
            atis_airline       0.57      0.44      0.50        18
            atis_airport       0.00      0.00      0.00         4
           atis_capacity       1.00      1.00      1.00         4
               atis_city       0.00      0.00      0.00         3
           atis_distance       1.00      0.67      0.80         3
             atis_flight       0.91      0.99      0.95       424
atis_flight#atis_airfare       0.00      0.00      0.00         3
          atis_flight_no       0.00      0.00      0.00         2
        atis_flight_time   

### **Question 6: Fine-tuning end-to-end classification model**

Another approach is to fine-tune the BERT model for the classification task. A classification head is added on top of the pre-trained BERT model. The classification head is trained end-to-end with the BERT model.
This approach is more effective than the previous one because the model is trained end-to-end. However, the model requires more training time and resources.

Train a new BERT model for the task of [sequence classification](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForSequenceClassification) (include BERT fine-tuning).  

![https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P4/IntentDetection/finetuning.png?raw=true](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P4/IntentDetection/finetuning.png?raw=true)

Assess the performance of the generated model by using the same metrics used in the previous question.

Which model has better performance? Why?

In [15]:
import numpy as np
from datasets import Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)

#### Prepare Labels

In [10]:
# Transformers models require integer labels.
# We create a mapping from the string intent (e.g., 'atis_flight') to an integer (e.g., 0).
unique_labels = sorted(list(train_df['intent'].unique()))
label_to_id = {label: i for i, label in enumerate(unique_labels)}
id_to_label = {i: label for i, label in enumerate(unique_labels)}

In [13]:
label_to_id

{'atis_abbreviation': 0,
 'atis_aircraft': 1,
 'atis_airfare': 2,
 'atis_airline': 3,
 'atis_airport': 4,
 'atis_capacity': 5,
 'atis_city': 6,
 'atis_distance': 7,
 'atis_flight': 8,
 'atis_flight#atis_airfare': 9,
 'atis_flight_no': 10,
 'atis_flight_time': 11,
 'atis_ground_fare': 12,
 'atis_ground_service': 13,
 'atis_meal': 14,
 'atis_quantity': 15,
 'atis_restriction': 16}

In [16]:
# Apply the mapping to our dataframes
train_df['label'] = train_df['intent'].map(label_to_id)
dev_df['label'] = dev_df['intent'].map(label_to_id)
test_df['label'] = test_df['intent'].map(label_to_id)

# --- Convert to Hugging Face Dataset object ---
train_dataset = Dataset.from_pandas(train_df)
dev_dataset = Dataset.from_pandas(dev_df)
test_dataset = Dataset.from_pandas(test_df)

#### Model Loading and Tokenization

In [31]:
# We'll use DistilBERT. It's a smaller, faster version of BERT that
# retains ~97% of its performance. It's perfect for Colab.
# The principle is IDENTICAL to using 'bert-base-uncased'.
MODEL_CHECKPOINT = "distilbert-base-uncased"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

# Create a function to tokenize the 'sentence' column
def tokenize_function(examples):
    # padding="max_length" pads all sentences to the same length.
    # truncation=True cuts off sentences that are too long.
    return tokenizer(examples['tokens'], padding='max_length', truncation=True)

# Apply the tokenization to all datasets at once using .map()
print("Tokenizing datasets...")
tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_dev_dataset = dev_dataset.map(tokenize_function, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.1",
  "vocab_size": 30522
}

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/vocab.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/t

Tokenizing datasets...


Map:   0%|          | 0/4274 [00:00<?, ? examples/s]

Map:   0%|          | 0/572 [00:00<?, ? examples/s]

Map:   0%|          | 0/586 [00:00<?, ? examples/s]

`AutoModelForSequenceClassification`

* Auto: This is the "automatic" part. You give it a model checkpoint name (like "bert-base-uncased" or "distilbert-base-uncased"), and it automatically knows which model architecture to load (BERT, DistilBERT, RoBERTa, etc.). You don't have to manually import BertForSequenceClassification or DistilBertForSequenceClassification.

* Model: It loads a pre-trained Transformer model (like BERT), which acts as the "body." This part is already an expert at understanding language.

* ForSequenceClassification: This is the key part. It tells the Auto class to add a classification "head" on top of the model's "body." "Sequence Classification" is the task of taking one sequence (like a sentence) and assigning it a single label (like "positive," "negative," or "atis_flight").

In [32]:
# Load the pre-trained model
# AutoModelForSequenceClassification automatically does this:
# 1. Loads the pre-trained BERT/DistilBERT body.
# 2. ADDS A CLASSIFICATION HEAD on top.
# 3. Knows it needs to be fine-tuned.
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_CHECKPOINT,
    num_labels=len(unique_labels),
    id2label=id_to_label, # For cleaner outputs
    label2id=label_to_id
)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "atis_abbreviation",
    "1": "atis_aircraft",
    "2": "atis_airfare",
    "3": "atis_airline",
    "4": "atis_airport",
    "5": "atis_capacity",
    "6": "atis_city",
    "7": "atis_distance",
    "8": "atis_flight",
    "9": "atis_flight#atis_airfare",
    "10": "atis_flight_no",
    "11": "atis_flight_time",
    "12": "atis_ground_fare",
    "13": "atis_ground_service",
    "14": "atis_meal",
    "15": "atis_quantity",
    "16": "atis_restriction"
  },
  "initializer_range": 0.02,
  "label2id": {
    "atis_abbreviation": 0,
    "atis_aircraft": 1,
    "atis_airfare": 2,
    "atis_airline": 3,
 

#### Training

In [33]:
from transformers.trainer_utils import EvalPrediction
from typing import Dict

def compute_metrics(eval_pred: EvalPrediction) -> Dict[str, float]:
    """
    Computes classification metrics for the model's predictions.

    This function is passed to the `Trainer` and is called at the end of
    each evaluation phase.

    Args:
        eval_pred (EvalPrediction): A named tuple provided by the Trainer,
            containing:
            - predictions (np.ndarray): The raw logits (unnormalized scores)
              output by the model. Shape is (num_samples, num_labels).
            - label_ids (np.ndarray): The true labels for the evaluation set.
              Shape is (num_samples,).

    Returns:
        Dict[str, float]: A dictionary mapping metric names (e.g., 'accuracy', 'f1')
                          to their float values. This dictionary is logged by
                          the Trainer.
    """

    # 1. Get Logits and Labels
    # 'eval_pred.predictions' holds the raw logits from the model's final layer.
    # 'eval_pred.label_ids' holds the ground-truth labels.
    logits = eval_pred.predictions
    labels = eval_pred.label_ids

    # 2. Convert Logits to Class Predictions
    # We take the 'argmax' of the logits along the last dimension (axis=-1)
    # to find the index (i.e., the class ID) with the highest score.
    # This is our model's final "guess" for each input.
    predictions = np.argmax(logits, axis=-1)

    # 3. Calculate Scikit-learn Metrics

    # We use 'average="weighted"' because this is a multi-class
    # problem. This calculates metrics for each class independently and then
    # computes a weighted average based on the number of samples in each
    # class (its "support"). This is crucial for imbalanced datasets.
    # 'zero_division=0' prevents warnings if a class has no predictions.
    precision = precision_score(y_true=labels, y_pred=predictions, average='weighted', zero_division=0)
    recall = recall_score(y_true=labels, y_pred=predictions, average='weighted', zero_division=0)
    f1 = f1_score(y_true=labels, y_pred=predictions, average='weighted', zero_division=0)

    # accuracy_score calculates the simple, overall accuracy.
    # (Total correct predictions / Total predictions)
    acc = accuracy_score(y_true=labels, y_pred=predictions)

    # 4. Return Metrics as a Dictionary
    # The Trainer expects a dictionary where keys are the metric names.
    # These names will be used in the logging (e.g., "eval_f1").
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [34]:
# These tell the Trainer how to train the model
training_args = TrainingArguments(
  output_dir="atis_finetune_results",  # Where to save the model
  num_train_epochs=3,                 # 3 epochs is standard for fine-tuning
  learning_rate=2e-5,                 # Standard learning rate for BERT
  per_device_train_batch_size=16,
  per_device_eval_batch_size=16,
  weight_decay=0.01,
  eval_strategy="epoch",              # Run evaluation at the end of each epoch
  save_strategy="epoch",              # Save the model at the end of each epoch
  logging_strategy="epoch",
  load_best_model_at_end=True,        # Load the best model (by loss) at the end
  report_to="none"                    # Disables extra logging (like wandb)
)

# Initialize the Trainer
trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=tokenized_train_dataset,
  eval_dataset=tokenized_dev_dataset,  # Use dev set for validation
  compute_metrics=compute_metrics,
  processing_class=tokenizer,
)

PyTorch: setting up devices


By default, trainer.train() will update all the model weights—both the pre-trained BERT/DistilBERT body and the new classification head.

This is the very definition of fine-tuning.

When the model is loaded, all its parameters are "unfrozen" (i.e., they are set to requires_grad=True). When the Trainer computes the loss from your ATIS data, the error gradient flows all the way back through the entire network.

This means:

* The Classification Head (which started with random weights) learns to map the BERT outputs to your specific intents.

* The BERT Body (which started with general language knowledge) has its weights slightly adjusted to get better at producing representations that are specifically useful for the ATIS intent task.

In [35]:
trainer.train()

The following columns in the Training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: id, slots, intent, tokens. If id, slots, intent, tokens are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 4,274
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 804
  Number of trainable parameters = 66,966,545


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.7583,0.26361,0.951049,0.933703,0.919334,0.951049
2,0.1887,0.149586,0.973776,0.967147,0.963367,0.973776
3,0.1152,0.126447,0.977273,0.971499,0.967943,0.977273


The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: id, slots, intent, tokens. If id, slots, intent, tokens are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 572
  Batch size = 16
Saving model checkpoint to atis_finetune_results/checkpoint-268
Configuration saved in atis_finetune_results/checkpoint-268/config.json
Model weights saved in atis_finetune_results/checkpoint-268/model.safetensors
tokenizer config file saved in atis_finetune_results/checkpoint-268/tokenizer_config.json
Special tokens file saved in atis_finetune_results/checkpoint-268/special_tokens_map.json
The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: id, slots, intent, tokens. If id, slots, intent, tokens are not expecte

TrainOutput(global_step=804, training_loss=0.3540817018765122, metrics={'train_runtime': 267.7073, 'train_samples_per_second': 47.896, 'train_steps_per_second': 3.003, 'total_flos': 1698951339804672.0, 'train_loss': 0.3540817018765122, 'epoch': 3.0})

#### Eval

In [36]:
# The 'trainer' object now holds the optimized model from training.
# We can directly evaluate it on the unseen test set.

print("Running final evaluation on the test set...")
eval_results = trainer.evaluate(eval_dataset=tokenized_test_dataset)

print("\n--- Test Set Performance ---")
print(f"**Accuracy:** {eval_results['eval_accuracy']:.4f}")
print(f"**F1-Score (Weighted):** {eval_results['eval_f1']:.4f}")
print(f"**Precision (Weighted):** {eval_results['eval_precision']:.4f}")
print(f"**Recall (Weighted):** {eval_results['eval_recall']:.4f}")

The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: id, slots, intent, tokens. If id, slots, intent, tokens are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 586
  Batch size = 16


Running final evaluation on the test set...



--- Test Set Performance ---
**Accuracy:** 0.9710
**F1-Score (Weighted):** 0.9646
**Precision (Weighted):** 0.9597
**Recall (Weighted):** 0.9710


### **Question 7: Intent Detection using Large Language Models**
**Credits:** Giuseppe Gallipoli

#### Introduction
[Large Language Models](https://en.wikipedia.org/wiki/Large_language_model) (LLMs) are a type of deep learning model capable of language generation. These models are built on deep learning architectures, primarily using neural networks, and are trained on massive amounts of text data. LLMs generally leverage the *Transformer* architecture, which allows them to process language in context, capturing complex relationships between words and concepts.

Large Language Models have demonstrated excellent capabilities across a wide variety of tasks, making them versatile models which can be applied in diverse scenarios and use cases. Although they are more typically used for *generative* tasks (e.g., text generation, text summarization, open-ended Question Answering), they can also be employed in *discriminative* tasks.

In this practice, we will use a Large Language Model to address an intent detection task. Rather than using a pre-trained encoder-only model without (Question 5) or with fine-tuning (Question 6), we will ask the LLM to classify the intent given a sentence of interest.
<br><br>
For now, we will use the [Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) 3.8B model, i.e., `microsoft/Phi-4-mini-instruct`.

#### LLM Prompting
To interact with LLMs, we need to define a **prompt**, which is a piece of text containing the instruction or question we want to give or ask the model. As we saw in Practice 1 (Exercise 11), a prompt can both include only the instruction/question for the LLM but also additional information (i.e., *context*) which can be exploited by the model to generate the answer.

For now, we will ask the model to classify the intent of a given sentence <u>without providing</u> any additional context. Please note that, since we want to (potentially) limit the choice of the LLM to a predefined list of intents (i.e., the set of labels of our dataset), we will also provide this list to the model.\
*Example of prompt*:\
Which is the intent of the following sentence?\
Choose among: label1, label2, ...
<br><br>

**Hint**: For those data instances whose intent is the concatenation of two intents (i.e., `atis_flight#atis_airfare`), consider only the first one.

<u>Suggestion</u>: To increase speed, switch to a GPU runtime. You can do this by clicking on Runtime → Change runtime type → Hardware accelerator → Select T4 GPU.\
If you encounter an `OutOfMemoryError`, try restarting the session by clicking on Runtime → Restart session.

In [None]:
# your code here

#### Few-Shot Learning
When we provide the LLM with additional context to be leveraged for generating an answer, this is known as *in-context learning* (ICL).\
A common technique to perform ICL involves including one or more **additional examples** of questions (or instructions) and their expected answers in the prompt.\
This can be useful for several reasons: telling the model the required output format, both in terms of structure and style, making the model reason about the provided examples to better adapt to the (new) task, tailoring model's responses to a user's specific needs, ...\
All of this is done directly at inference, without the need of further training the model, simply by leveraging the reasoning and generalization capabilities of LLMs.\
When no examples are provided, as in the previous point of the exercise, we talk about **zero-shot** learning. Conversely, when we supply additional input-output examples in the model's prompt, we talk about **few-shot** learning or $k$-shot learning, where $k$ represents the number of ICL examples.\
Examples can be selected according to different strategies, either randomly or in a more clever way, but please note that they <u>must be chosen from the training set</u> to avoid *data leakage*.
<br><br>

Now, try to implement few-shot learning by modifying the previous point of the exercise so that the prompt can include $k$ additional examples.

In [None]:
# your code here

Try with different $k$ values, e.g., $k \in [1, 3, 5, 10]$, and see if and how the model's performance changes.

In [None]:
# your code here

Until now, we have used the [Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) 3.8B model, i.e., `microsoft/Phi-4-mini-instruct`, but there are plenty of LLMs!\
Try another model of your choice, e.g., [Mistral-Instruct](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) 7B `mistralai/Mistral-7B-Instruct-v0.3`. Remember to always keep in mind the GPU memory constraints you have.
<br><br>

**Hint**: To download and use certain models, it may be needed to request access to them (e.g., [Llama3.2-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) 3B). This can be done from the Hugging Face model page after logging in. Once access is granted, you need to create your personal access token (go to your HF profile, Settings → Access Tokens → Create new token) and authenticate using it when downloading both the model and the tokenizer.\
To authenticate, you can either use the additional `token` parameter in the `from_pretrained()` method or via the [HF Command Line Interface](https://huggingface.co/docs/huggingface_hub/guides/cli).
```
model_tag = 'MODEL_TAG'
HF_TOKEN = 'MY_HF_TOKEN'

model = AutoModelForCausalLM.from_pretrained(model_tag, torch_dtype=torch.float16, token=HF_TOKEN, device_map=device)
tokenizer = AutoTokenizer.from_pretrained(model_tag, token=HF_TOKEN)
```

In [None]:
# your code here

After analyzing the performance of the different settings and/or models, here you can find some questions to reason about the results:
- What is the performance in the zero-shot learning setting? (i.e., $k=0$)
- How does it change when providing one additional example? (i.e., $k=1$)
- What happens when increasing the number of ICL examples?
- If you tested additional models, which one performs best in the zero-shot and few-shot learning settings?
- What challenges or limitations did you observe?