# Overview

The purpose of this notebook is to fine-tune an open-weight model on a non-trivial dataset and monitor it's performance changes.

The model to be finetuned is [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B), due to it's open weight nature and small parameter count.
It's low parameter count reduces computational cost and training time, which is a higher priority that usual due to the resource constrained nature of our training environment (Kaggle notebooks).

While the `Qwen` series of models are capable of multiple tasks including conversation, reasoning, and text generation, the focus for this particular fine-tuning session will be on improving it's natural language classification abilities by fine-tuning it on the [sh0416/ag_news](https://huggingface.co/datasets/sh0416/ag_news) dataset. 

**The primary performance metric is F1.**

To further optimize resource consumption, the training procedure will leverage a parameter efficient fine-tuning (PEFT) ready version of the model, with LoRA adapters attached. While 6 million parameters is relatively small for a model, it is still too large for two T4 GPU's to train in a reasonable amount of time, hence why PEFT is the go-to technique.

The fine-tuning library used was HuggingFace's `Trainer` library.

There were five hyperparameters adjusted during training, with two options per parameter, for a total of 32 different hyperparameter combinations.

- learning_rate: 1e-4, 1e-3
- num_train_epochs: 1, 2
- lr_scheduler_type: linear, cosine
- gradient_accumulation_steps: 4, 8
- weight_decay: 0.01, 0.1

While it would have been ideal to do a full-grid search across all combinations, hardware constraints prevented this. Instead, a randomized grid-search was performed where the hyperparameter combinations were shuffled, and the first 16 were then chosen as the ones to test.

# Evaluation Strategy

The code used to evaluate performance can be found in [this cell.](#Initial-Performance)

# Results

The fine-tuned model/adapter can be found in this HuggingFace repository: [cli08/qwen3-0.6-finetuned](https://huggingface.co/cli08/qwen3-0.6-finetuned)

The base model had a dismal F1 score of 0.253 on the hold-out set before fine-tuning, but this improved drastically to 0.908 after fine-tuning.

|Initial F1|Fine-tuned F1|
|-------|----|
|0.253|0.908|

The best hyperparameter combination was:

- learning_rate: 0.001
- num_train_epochs: 2
- lr_scheduler_type: linear
- gradient_accumulation_steps: 4
- weight_decay: 0.01

# Analysis

There was a drastic improvement in the hold-out set's F1 score after fine-tuning, which speaks to the impressive power of PEFT. That being said, the number of parameters was quite small to begin with, and the classification dataset only has four possible labels, so this improvement in F1 might be a result of overfitting to a simple, low-complexity text-classification dataset with little diversity in it's output labels. If there were more output labels, the fine-tuned F1 might not be as high as it currently is. 

Nevertheless, this experiment was a success, and lends credibility to the strategy that some organizations are using where they fine-tune small language models on proprietary data for internal use cases, instead of renting cloud-hosted LLM's and hoping the providers don't leak data.

# Risks

As with all LLM and AI-based technologies, there is a risk of abuse and bias in the model outputs, especially when fed dangerous training data. For example, a malicious actor might use the convenient and accessible nature of open-weight model fine-tuning technology to coerce models into outputting harmful content by fine-tuning it on source materials espousing harmful ideas.

##### AI Usage Disclosure
The code in this notebook was created with assistance from AI tools. The code has been reviewed and edited by a human. For more information on the extent and nature of AI usage, please contact the author.

# Packages

In [None]:
%pip install evaluate peft huggingface-hub

In [None]:
import os
from dataclasses import dataclass
from typing import Dict, List, Optional, Any

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    AutoModelForCausalLM,
    pipeline,
    EarlyStoppingCallback
)

import evaluate
from evaluate import evaluator

import itertools
import json
from copy import deepcopy

from peft import LoraConfig, TaskType, get_peft_model

from IPython.display import display, Markdown

import random

from huggingface_hub import HfApi

from kaggle_secrets import UserSecretsClient

In [None]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Dataset

In [None]:
dataset = load_dataset("sh0416/ag_news")

dataset = {
    "train": dataset["train"].shuffle(seed=42).select(range(5000)),
    "valid": dataset["train"].shuffle(seed=42).select(range(5000, 6000)),
    "test": dataset["test"].shuffle(seed=42).select(range(2000))
}

num_labels = len(set(dataset["train"]["label"]))

# Tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Model

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    "Qwen/Qwen3-0.6B",
    num_labels=num_labels,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    pad_token_id=tokenizer.pad_token_id,
)

lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1
)

model = get_peft_model(model, lora_config)

device = "cuda" if torch.cuda.is_available() else "cpu"

model.to(device)

# Preprocess

In [None]:
max_length = 256

def preprocess(dataset_split, select_columns: List[str]):
    def run_tokenizer(row):
      return tokenizer(
          row["text"],
          padding="max_length",
          truncation=True,
          max_length=max_length,
      )

    def prepare_columns(row):
      row["text"] = row["title"] + " " + row["description"]
      row["label"] = row["label"] - 1

      return row

    dataset_split = dataset_split.map(prepare_columns)

    encoded_dataset = dataset_split.map(run_tokenizer, batched=True)

    encoded_dataset = encoded_dataset.remove_columns(
        [col for col in encoded_dataset.column_names if col not in select_columns]
    )

    return encoded_dataset.with_format("torch")

train_dataset = preprocess(dataset['train'], ["input_ids", "attention_mask", "label"])
valid_dataset = preprocess(dataset['valid'], ["input_ids", "attention_mask", "label"])
test_dataset = preprocess(dataset['test'], ["label", "text"])

# Metrics

In [None]:
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    results = accuracy.compute(predictions=preds, references=labels)
    results.update(
        f1.compute(predictions=preds, references=labels, average="macro")
    )

    return results

# Initial Performance

In [None]:
def test_performance(model: str):
    model = AutoModelForSequenceClassification.from_pretrained(
        model,
        torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
        pad_token_id=tokenizer.pad_token_id,
        num_labels=num_labels
    )

    classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

    label_mapping={f"LABEL_{i}": i for i in range(num_labels)}

    predictions_raw = classifier(list(test_dataset["text"]), batch_size=32)

    predicted_labels_str = [pr["label"] for pr in predictions_raw]

    predictions = [label_mapping[str_label] for str_label in predicted_labels_str]

    final_f1 = evaluate.load("f1").compute(predictions=predictions, references=list(test_dataset["label"]), average="macro")["f1"]

    return round(final_f1, 3)

initial_f1 = test_performance("Qwen/Qwen3-0.6B")

# Hyperparameter Tuning

## Training Arguments

In [None]:
training_arguments = TrainingArguments(
    output_dir="qwen3-0.6-finetuned",
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="steps",
    logging_steps=100,
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=5e-5,
    weight_decay=0.01,
    warmup_steps=0,
    lr_scheduler_type="cosine",
    gradient_accumulation_steps=4,
    fp16=False,
    report_to="none",
    save_steps=50,
    eval_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)

early_stop = EarlyStoppingCallback(early_stopping_patience=2, early_stopping_threshold=0.02)

trainer_kwargs = dict(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    compute_metrics=compute_metrics,
    callbacks=[early_stop]
)

## Search Space

In [None]:
hyperparameters = []

for learning_rate in [1e-4, 1e-3]:
    for num_train_epochs in [1, 2]:
        for lr_scheduler_type in ["linear", "cosine"]:
            for gradient_accumulation_steps in [4, 8]:
                for weight_decay in [0.01, 0.1]:
                    hyperparameters.append({
                        "learning_rate": learning_rate,
                        "num_train_epochs": num_train_epochs,
                        "lr_scheduler_type": lr_scheduler_type,
                        "gradient_accumulation_steps": gradient_accumulation_steps,
                        "weight_decay": weight_decay
                    })

random.seed(42)
random.shuffle(hyperparameters)
hyperparameters = hyperparameters[:len(hyperparameters) // 2]

## Helper Functions

In [None]:
def run_single_experiment(
    base_training_args: TrainingArguments,
    trainer_cls,
    trainer_kwargs: Dict[str, Any],
    hp_config: Dict[str, float|int],
    idx: int
) -> Dict[str, Any]:
    args_dict = base_training_args.to_dict()
    for k, v in hp_config.items():
        args_dict[k] = v

    training_args = TrainingArguments(**args_dict)

    trainer = trainer_cls(
        args=training_args,
        **trainer_kwargs,
    )

    train_output = trainer.train()
    eval_metrics = trainer.evaluate()

    trainer.save_model(f"checkpoint_{idx}")

    trainer.push_to_hub(
        commit_message=f"checkpoint_{idx}",
        token=UserSecretsClient().get_secret("HF_TOKEN")
    )

    result = {
        "hp_config": hp_config,
        "train_samples": train_output.metrics.get("train_samples", None),
        "eval_metrics": eval_metrics,
    }

    return result

def grid_search_hyperparams(
    base_training_args: TrainingArguments,
    trainer_cls,
    trainer_kwargs: Dict[str, Any],
    hyperparameters: Dict[int, Dict[str, int|float]],
    results_path: str = "grid_search_results.jsonl",
) -> List[Dict[str, Any]]:
    all_results = {}

    os.makedirs(os.path.dirname(results_path) or ".", exist_ok=True)

    with open(results_path, "w", encoding="utf-8") as f_out:
        for idx, combo in enumerate(hyperparameters):
            print("\n=== Running config:", combo, "===")

            result = run_single_experiment(
                base_training_args=base_training_args,
                trainer_cls=trainer_cls,
                trainer_kwargs=deepcopy(trainer_kwargs),
                hp_config=combo,
                idx=idx
            )

            # Persist each result as one JSON line
            f_out.write(json.dumps(result) + "\n")
            f_out.flush()

            all_results[idx] = result

    return all_results

# Train

In [None]:
results = grid_search_hyperparams(
    base_training_args=training_arguments,
    trainer_cls=Trainer,
    trainer_kwargs=trainer_kwargs,
    hyperparameters=hyperparameters,
    results_path="grid_search_results.jsonl",
)

In [None]:
best_config = max(results.items(), key=lambda r: r[1]["eval_metrics"].get("eval_f1", 0.0))

print("Best index:", best_config[0])
print("Best config:", best_config[1]["hp_config"])
print("Best metrics:", best_config[1]["eval_metrics"])

# Post Fine-Tuning Performance

In [None]:
finetuned_f1 = test_performance(f"./checkpoint_{best_config[0]}")

In [None]:
display(Markdown("""
|Initial|Post|
|-------|----|
|{initial_f1}|{finetuned_f1}|
""".format(initial_f1=initial_f1, finetuned_f1=finetuned_f1)))