<a href="https://colab.research.google.com/github/donlap/ds352-labs/blob/main/Lab11_Gemma3_with_Unsloth.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #11

# Finetuning a Gemma-3 Model for Text Classification with Unsloth

Today, you will learn how to take a pre-trained Large Language Model (LLM) and specialize it for **text classification**.

We will be using [**Unsloth**](https://docs.unsloth.ai/get-started/all-our-models) to speeds up finetuning and reduces memory usage, making it possible to train in Google Colab.

**Goal:** Finetune the `Gemma-3-1B` model on the `wisesight_sentiment` dataset to classify Thai text into one of four categories.

In [None]:
%%capture
%pip install transformers>=4.52.4
%pip install --no-deps bitsandbytes xformers==0.0.29.post3
%pip install git+https://github.com/donlap/unsloth-zoo.git@patch/skip-no-quant-state
%pip install datasets
%pip install git+https://github.com/donlap/unsloth.git@feature/sequence_classification

In [None]:
import torch
major_version, minor_version = torch.cuda.get_device_capability()
print(f"Major: {major_version}, Minor: {minor_version}")
from datasets import load_dataset
import datasets

from unsloth import FastModel, FastLanguageModel, tokenizer_utils
from unsloth.models import gemma3_sequence_classification
import pandas as pd
import numpy as np

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import warnings
from typing import Any, Dict, List, Tuple, Union
import matplotlib.pyplot as plt

## Load and Prepare the Dataset

We'll use the `wisesight_sentiment` dataset, which contains Thai text labeled with one of four sentiment categories (positive, negative, neutral, question). We'll rename the columns to `text` and `label` to match what the `Trainer` expects.

In [None]:
# Load the Wisesight Sentiment dataset
dataset = load_dataset("wisesight_sentiment")
for set_name in dataset:
    dataset[set_name] = dataset[set_name].rename_column("texts", "text")
    dataset[set_name] = dataset[set_name].rename_column("category", "label")

## Configure the Model and Tokenizer

In the following code block, we will:
1.  Define our model parameters.
2.  Load a 4-bit quantized version of `unsloth/gemma-3-1b-it-unsloth-bnb-4bit` using `FastLanguageModel`. Quantization reduces the model's memory footprint significantly.

In [None]:
NUM_CLASSES = 4 # number of classes in the csv

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+

model_name = "unsloth/gemma-3-1b-it-unsloth-bnb-4bit" #"unsloth/Llama-3.2-1B-bnb-4bit"

model, tokenizer = FastModel.from_pretrained(
    model_name = model_name,
    auto_model = AutoModelForSequenceClassification,
    num_labels = NUM_CLASSES,
    max_seq_length = max_seq_length,
    load_in_4bit = True,
    load_in_8bit = False,
    full_finetuning = False,
)

def get_output_embeddings():
    return model.score

model.get_output_embeddings = get_output_embeddings

3.  Configure **LoRA (Low-Rank Adaptation)**, a parameter-efficient finetuning (PEFT) technique. Instead of training all the model's billions of parameters, we only train a small number of "adapter" layers.

In [None]:
# model.score = torch.nn.Linear(1152, 4, bias=False, device=model.device)

model = FastModel.get_peft_model(
    model,
    max_seq_length = max_seq_length,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",],
    r = 32,           # Larger = higher accuracy, but might overfit
    lora_alpha = 32,  # Recommended alpha == r at least
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = True,  # Supports rank stabilized LoRA
    task_type = TaskType.SEQ_CLS # Sequence to Classification Task
)

print("trainable parameters:", sum(p.numel() for p in model.parameters() if p.requires_grad))

## Finetune the Model

Now, we'll set up the `Trainer` from the HuggingFace's `transformers` library. This class handles the entire training loop, including batching, gradient updates, and logging.

In [None]:
def tokenize_function(example):
    return tokenizer(example["text"])


tokenized_dataset = dataset['train'].map(tokenize_function, batched=True)

trainer = Trainer(
    model = model,
    processing_class = tokenizer,
    train_dataset = tokenized_dataset,
    #eval_dataset = dataset['validation'],
    args = TrainingArguments(
        per_device_train_batch_size = 8,
        gradient_accumulation_steps = 1,
        warmup_steps = 10,
        learning_rate = 1e-5,
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = 3407,
        output_dir = "outputs",
        num_train_epochs = 1,
        report_to = "none",
        group_by_length = True,
    ),
)

In [None]:
trainer_stats = trainer.train()

## Inference

Here's an example of the model's prediction on a sample text:

In [None]:
FastModel.for_inference(model)  # Unsloth has 2x faster inference!

test_df = dataset['test'].to_pandas()

with torch.inference_mode():
    text = test_df['text'].iloc[2]
    inputs = tokenizer([text], return_tensors="pt").to(model.device)
    preds = model(**inputs).logits
    print(text)
    print(preds)


## Exercise 1: Evaluate the Model

Training is done,now it's time to evaluate your model on the test set, which is stored in a Pandas dataframe `test_df`.

**Your Task:**
1.  Iterate through the test data in batches. A batch size of 16 or 32 is a good choice.
2.  For each batch:
    *   Tokenize the texts. Make sure to add `padding=True` and `return_tensors="pt"` to get a PyTorch tensor.
    *   Move the tokenized inputs to the same device as the model (`model.device`).
    *   Get the model's predictions (logits).
    *   Find the predicted class for each text by taking the `argmax` of the logits.
3.  Keep track of how many predictions are correct.
4.  After the loop, calculate and print the final accuracy (in percent).
5.  To inspect our predictions, print the `Text`, `True Label`, and `Predicted Label` for the first example in each batch.

Fill in the code in the cell below.

In [None]:
import gc

# Free some memory
gc.collect()
torch.cuda.empty_cache()

# Evaluation parameters. You can add more.
batch_size = 32
num_correct = 0

with torch.inference_mode():  # Make predictions in this scope so that you won't accidentally modify the parameters.
    ### YOUR CODE HERE ###




## Exercise 2: Visualizing Model Attention

How does the model decide on a classification? **Attention** is a key mechanism. It allows the model to weigh the importance of different words in the input text when making a prediction.

By visualizing the attention matrix, we can get a glimpse into the model's "thought process."

**Your Task:**
1.  Read through the `visualize_attention` helper function provided below. It handles the complex parts of extracting and plotting the attention weights.
2.  Write your own text.
3.  Call the `visualize_attention` function with your text to see which words the model focuses on.

In [None]:
def visualize_attention(text, layer=-1, head=0):
    """
    Visualizes the attention matrix for a given text.

    Args:
        model: The finetuned model.
        tokenizer: The tokenizer.
        text (str): The input text to visualize.
        layer (int): The model layer to visualize. Default is the last layer.
        head (int): The attention head to visualize.
    """
    # To get attention weights, we need to run the model in evaluation mode
    # and pass the `output_attentions=True` flag.
    model.eval()

    # Tokenize the input text
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=max_seq_length).to(model.device)
    with torch.no_grad():
        # Get model outputs, including attention weights
        outputs = model(**inputs, output_attentions=True)

    # The `attentions` output is a tuple, one for each layer in the model.
    # Each element has shape: [batch_size, num_heads, sequence_length, sequence_length]
    attention_matrix = outputs.attentions[layer][0, head].cpu().numpy()

    # Get the tokens to use as labels for our plot
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

    # Create the plot
    plt.figure(figsize=(12, 10))
    sns.heatmap(attention_matrix, xticklabels=tokens, yticklabels=tokens, cmap='viridis')
    plt.title(f'Attention Matrix - Layer {layer}, Head {head}')
    plt.xlabel('Key/Memory Tokens')
    plt.ylabel('Query/Input Tokens')
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.show()


In [None]:
### YOUR CODE HERE ###




## Exercise 3: Conceptual Questions

Please answer the following questions in the text cell provided.

**Question 1: Pros and Cons of Finetuning**
Based on this lab and your understanding, what are the pros and cons of finetuning a large pre-trained model compared to training a smaller model (e.g. logistic regression of SVM) from scratch for a specific task?

**Question 2: LoRA Parameters**
In Step 3, we configured LoRA with `r=16` and `lora_alpha=16`. Briefly explain the role of these two parameters. What might happen if you set `r` to a very high value (e.g., 256) for this small dataset?

**Question 3: LLM Model Choice**
Look through a [**list of models here**](https://docs.unsloth.ai/get-started/all-our-models). Name one model that you think might perform well when fine-tuned to the Thai text classification task. Why did you choose this model?

**Answer 1:**



**Answer 2:**



**Answer 3:**



