# Automating Customer Support with Question Classification

In this exercise, we will fine-tune a pre-trained RoBERTa model on a dataset of categorized questions to create an efficient and accurate question classification system. This system will be integrated into your company's customer support workflow, reducing the burden on human agents and speeding up response times. We will use the TREC dataset, which is a benchmark dataset for the task of question classification, for the model fine-turning.

By fine-tuning a pre-trained model, we can create a highly specialized solution that can be immediately deployed in a business setting. By the end of this exercise, you will have a fine-tuned model capable of categorizing customer inquiries.

Remember to change your "Runtime" to GPU before running the code.

## Step 1: Install and Import Necessary Libraries

In [None]:
# Install the necessary libraries
!pip install transformers datasets==3.6.0 evaluate

# Import necessary libraries
import torch  # PyTorch for tensor operations and deep learning framework
from datasets import load_dataset  # Hugging Face function to load various datasets
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer  # Core classes for tokenization, model, and training
import numpy as np  # NumPy for numerical operations
import evaluate  # Hugging Face library for loading and computing evaluation metrics

# Load the TREC dataset
dataset = load_dataset("trec", trust_remote_code=True)  # Downloads and loads the TREC dataset, enabling remote code trust for dataset scripts

# Sample a smaller subset of the data to make the training faster
train_data = dataset['train'].shuffle(seed=42).select(range(1000))  # Shuffle training data with a fixed seed for reproducibility, then select first 1000 samples
test_data = dataset['test'].shuffle(seed=42).select(range(200))     # Shuffle test data with the same seed, then select first 200 samples



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Step 2: Preprocess the Data

In [None]:
# Initialize the tokenizer
model_name = "distilroberta-base"  # Define the name of the pre-trained model to be used
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)  # Load the tokenizer for the specified model, using the fast (Rust-based) implementation

# Tokenize the data
def tokenize_function(examples):  # Define a function to tokenize batches of text
    return tokenizer(  # Use the tokenizer to process the given text
        examples["text"],           # The column in the dataset containing raw text
        padding="max_length",       # Pad all sequences to the maximum model length for uniform shape
        truncation=True             # Truncate sequences longer than the model's maximum length
    )

# Apply the tokenizer to the dataset
tokenized_train = train_data.map(tokenize_function, batched=True)  # Apply tokenization to the training data in batches
tokenized_test = test_data.map(tokenize_function, batched=True)    # Apply tokenization to the test data in batches

# Rename label column for Trainer compatibility
tokenized_train = tokenized_train.rename_column("coarse_label", "labels")  # Rename 'coarse_label' to 'labels' so Trainer can recognize it
tokenized_test = tokenized_test.rename_column("coarse_label", "labels")    # Rename 'coarse_label' to 'labels' in the test set

# Set the format for PyTorch
tokenized_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])  # Keep only required columns and convert them to PyTorch tensors
tokenized_test.set_format("torch", columns=["input_ids", "attention_mask", "labels"])   # Same formatting for the test set


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

## Step 3: Fine-tune the Pre-trained Model

In [None]:
# Initialize the tokenizer and model
model_name = "distilroberta-base"  # Define the name of the pre-trained DistilRoBERTa model
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)  # Load the tokenizer for the model, using fast implementation
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=6)  # Load the model with a classification head for 6 labels

# Load the accuracy metric from Hugging Face's evaluate library
accuracy = evaluate.load("accuracy")  # This will be used to measure classification performance

# Define the function to compute evaluation metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred                 # Unpack predictions (logits) and true labels
    if isinstance(logits, tuple):              # Handle case where Trainer returns logits as a tuple
        logits = logits[0]
    preds = np.argmax(logits, axis=-1)          # Take the index of the highest logit as the predicted class
    return accuracy.compute(predictions=preds, references=labels)  # Compute accuracy using the metric

# Define training arguments for Hugging Face Trainer
training_args = TrainingArguments(
    output_dir="./results",                     # Directory to save model checkpoints and results
    eval_strategy="epoch",                      # Determines when to run evaluation (per epoch here)
    per_device_train_batch_size=4,              # Batch size for training on each device
    per_device_eval_batch_size=4,               # Batch size for evaluation on each device
    num_train_epochs=3,                         # Number of complete passes through the training dataset
    weight_decay=0.01,                          # Weight decay (L2 regularization) for optimizer
    logging_dir="./logs",                       # Directory for storing logs
    save_total_limit=1,                         # Limit the total number of saved checkpoints
    report_to="none"                            # Disable logging to external services like W&B
)

# Initialize the Trainer object for training and evaluation
trainer = Trainer(
    model=model,                                # Model to be trained
    args=training_args,                         # Training configurations
    train_dataset=tokenized_train,              # Processed training dataset
    eval_dataset=tokenized_test,                # Processed evaluation dataset
    tokenizer=tokenizer,                        # Tokenizer used for the model
    compute_metrics=compute_metrics,            # Function to compute metrics during evaluation
)

# Start the training process
trainer.train()  # This will train the model according to the arguments and datasets provided


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.456508,0.88
2,0.699700,0.370771,0.915
3,0.699700,0.38559,0.925


TrainOutput(global_step=750, training_loss=0.5228366088867188, metrics={'train_runtime': 188.7258, 'train_samples_per_second': 15.896, 'train_steps_per_second': 3.974, 'total_flos': 397430544384000.0, 'train_loss': 0.5228366088867188, 'epoch': 3.0})


## Step 4: Evaluate the Model

In [None]:
import json  # Import the JSON library for pretty-printing Python dictionaries

# Evaluate the model on the test set
results = trainer.evaluate()  # Runs evaluation on the eval_dataset specified in Trainer, returns metrics as a dictionary

# Nicely print the full results dictionary
print("Evaluation Results:\n", json.dumps(results, indent=2))  # Convert results dict to a formatted JSON string with indentation

# Extract and print the accuracy safely
accuracy = results.get("eval_accuracy", 0.0)  # Retrieve 'eval_accuracy' from results, default to 0.0 if key is missing
print(f"Test Accuracy: {accuracy:.4f}")  # Print the accuracy formatted to 4 decimal places


Evaluation Results:
 {
  "eval_loss": 0.3855900168418884,
  "eval_accuracy": 0.925,
  "eval_runtime": 2.7499,
  "eval_samples_per_second": 72.73,
  "eval_steps_per_second": 18.183,
  "epoch": 3.0
}
Test Accuracy: 0.9250
