# Fine tuning LLMs for text classification with Unsloth

This notebook fine tunes llama3-8b-Instruct, using LoRA adapters, to classify arabic statements into "True" or "False" for whether they are "Triggering" for immediate physician review. The LoRA adapters are saved into a google drive folder. Notebook 'merge_ft_model_and_upload' loads these adapters and merges them with the base model to push to hf, then 'evaluate_memo_llama_3b_instruct_arabic_model.ipynb' evaluates the fine tuned model on the test dataset and saves the model predictions+metrics. 

The following documentation was used: https://colab.research.google.com/github/timothelaborie/text_classification_scripts/blob/main/unsloth_classification.ipynb

### Unsloth Notes:
- Input: csv file with 'text' and 'label' columns.
- Trims the classification head to contain only the number tokens such as "1", "2" etc, which saves 1 GB of VRAM, allows you to train the head without massive memory usage, and makes the start of the training session more stable.
- Only the last token in the sequence contributes to the loss, the model doesn't waste its capacity by trying to predict the input
- includes "group_by_length = True" which speeds up training significantly for unbalanced sequence lengths
- Efficiently evaluates the accuracy on the validation set using batched inference

In [1]:
! pip install -q unsloth datasets transformers trl accelerate peft

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m61.8/61.8 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m353.0/353.0 kB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m506.8/506.8 kB[0m [31m41.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m564.7/564.7 kB[0m [31m41.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m59.4/59.4 MB[0m [31m44.9 MB/s[0m eta [36m0:00:00[0

In [2]:
import os
import torch
import pandas as pd
import numpy as np
from datasets import Dataset
from sklearn.model_selection import train_test_split
from transformers import DataCollatorForLanguageModeling
from unsloth import FastLanguageModel, tokenizer_utils
from trl import SFTTrainer, SFTConfig
import warnings
warnings.filterwarnings("ignore")


Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel, tokenizer_utils


ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!


In [3]:
# Some environments require this hack for untrained-token handling in unsloth:
def do_nothing(*args, **kwargs):
    pass
tokenizer_utils.fix_untrained_tokens = do_nothing

In [4]:
# --- Device check (optional)
try:
    major_version, minor_version = torch.cuda.get_device_capability()
    print(f"CUDA device capability: {major_version}.{minor_version}")
except Exception as e:
    print("CUDA device capability not available:", e)

CUDA device capability: 8.9


In [5]:
from google.colab import drive
drive.mount("/content/drive", force_remount=True)

Mounted at /content/drive


In [None]:
# ---------------------------
#  USER CONFIG - EDIT THESE
# ---------------------------
MODEL_NAME = "unsloth/llama3-8b-Instruct"   # source model
LOAD_IN_4BIT = True                          #can set False while debugging
MAX_SEQ_LENGTH = 8192
NUM_CLASSES = 2

CSV_PATH = "/content/drive/MyDrive/Memo_Dataset.csv"
OUTPUT_DIR = "/content/drive/MyDrive/llama-3b-instruct-ft-clean"

In [7]:
# --- Load model & tokenizer (unsloth FastLanguageModel wrapper)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = MODEL_NAME,
    load_in_4bit = LOAD_IN_4BIT,
    max_seq_length = MAX_SEQ_LENGTH,
    dtype = None,
)

# IMPORTANT: disable chat templates so Unsloth doesn't wrap your prompt with extra eot tokens
# This ensures nothing is appended after the final token of our prompt.
if hasattr(tokenizer, "chat_template"):
    tokenizer.chat_template = None
if hasattr(model, "chat_template"):
    model.chat_template = None

==((====))==  Unsloth 2025.11.3: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

In [8]:
# --- Data: load csv, downsample to balanced, rename fields
df = pd.read_csv(CSV_PATH)
# keep only columns we need
df = df[['Question', 'Trigger']].dropna()
df = df.rename(columns={'Question': 'text', 'Trigger': 'label'})
df['label'] = df['label'].astype(int)

print("Raw dataset size:", len(df))
print(df['label'].value_counts())

# Balance classes by downsampling the majority class
min_count = df['label'].value_counts().min()
df_bal = df.groupby('label', group_keys=False).apply(lambda x: x.sample(min_count, random_state=42)).reset_index(drop=True)
print("Balanced dataset size:", len(df_bal))
print(df_bal['label'].value_counts())

train_df, test_df = train_test_split(df_bal, test_size=0.2, random_state=42, stratify=df_bal['label'])
print(f"Train: {len(train_df)}, Test: {len(test_df)}")

Raw dataset size: 13142
label
1    7687
0    5455
Name: count, dtype: int64
Balanced dataset size: 10910
label
0    5455
1    5455
Name: count, dtype: int64
Train: 8728, Test: 2182


In [9]:
from typing import Union


In [10]:
# --- Prompt function (COMPLETION style). Label is literally the last token.
def llama_prompt(statement: str, label: Union[int, str]):
    # No system prompt, no chat template. Label is the final token with NO trailing whitespace.
    # Make sure label is a string with '0' or '1'
    label_str = str(label)
    return (
        "### Instruction:\n"
        "Classify the following patient statement as 0 (Not Triggered) or 1 (Triggered).\n"
        "Respond with exactly one character: 0 or 1.\n\n"
        "### Statement:\n"
        f"{statement}\n\n"
        "### Response:\n"
        f"{label_str}"
    )

# Build train/test datasets with the prompt text (label stored too but we will only use text)
train_df["text"] = train_df.apply(lambda r: llama_prompt(r["text"], r["label"]), axis=1)
test_df["text"] = test_df.apply(lambda r: llama_prompt(r["text"], r["label"]), axis=1)

train_dataset = Dataset.from_pandas(train_df[['text','label']], preserve_index=False)
test_dataset = Dataset.from_pandas(test_df[['text','label']], preserve_index=False)

In [11]:
# --- Ensure single-tokenization for "0" and "1".
# If the tokenizer splits "0" or "1" into multiple tokens (rare), we add special tokens instead.
digit_token_ids = {}
need_to_add_specials = False
for d in ("0","1"):
    toks = tokenizer.encode(d, add_special_tokens=False)
    print(f"Digit '{d}' tokenized to ids: {toks} -> decoded: '{tokenizer.decode(toks)}'")
    if len(toks) != 1:
        need_to_add_specials = True

if need_to_add_specials:
    # Add explicit single tokens to represent class labels
    specials = ["<|class_0|>", "<|class_1|>"]
    tokenizer.add_tokens(specials)
    # if the model wrapper needs resize (unsloth wraps the model), ensure embeddings resized:
    if hasattr(model, "resize_token_embeddings"):
        model.resize_token_embeddings(len(tokenizer))
    # compute ids for mapping
    class0_id = tokenizer.encode(specials[0], add_special_tokens=False)[0]
    class1_id = tokenizer.encode(specials[1], add_special_tokens=False)[0]
    digit_token_ids = {"0": class0_id, "1": class1_id}
    # Replace prompts in datasets to use the new special tokens
    def prompt_with_special(statement, label):
        lbl = "<|class_1|>" if int(label)==1 else "<|class_0|>"
        return (
            "### Instruction:\n"
            "Classify the following patient statement as 0 (Not Triggered) or 1 (Triggered).\n"
            "Respond with exactly one token that represents the class.\n\n"
            "### Statement:\n"
            f"{statement}\n\n"
            "### Response:\n"
            f"{lbl}"
        )
    train_df["text"] = train_df.apply(lambda r: prompt_with_special(r["text"], r["label"]), axis=1)
    test_df["text"] = test_df.apply(lambda r: prompt_with_special(r["text"], r["label"]), axis=1)
    train_dataset = Dataset.from_pandas(train_df[['text','label']], preserve_index=False)
    test_dataset = Dataset.from_pandas(test_df[['text','label']], preserve_index=False)
    # Build number_token_ids mapping from tokenizer
    number_token_ids = [digit_token_ids["0"], digit_token_ids["1"]]
else:
    # digits are single token; compute their token ids
    number_token_ids = [tokenizer.encode("0", add_special_tokens=False)[0],
                        tokenizer.encode("1", add_special_tokens=False)[0]]

print("Number token ids used for labels:", number_token_ids)

Digit '0' tokenized to ids: [15] -> decoded: '0'
Digit '1' tokenized to ids: [16] -> decoded: '1'
Number token ids used for labels: [15, 16]


In [35]:
from transformers import AutoTokenizer, DataCollatorForLanguageModeling
from torch.utils.data import DataLoader

# --- 1. Tokenizer & dataset
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

# Make sure train_dataset and test_dataset have 'text' and 'label' keys
print("Columns:", train_dataset.column_names)

# Detect token IDs for numeric labels "0" and "1"
# number_token_ids is already defined from previous cell `Tfx-QmkhGrmn`

# --- 2. Collator
class CollatorForSFT(DataCollatorForLanguageModeling):
    def __init__(self, tokenizer, number_token_ids, ignore_index=-100, max_length=2048, mlm: bool = False):
        super().__init__(tokenizer=tokenizer, mlm=mlm)
        self.number_token_ids = number_token_ids
        self.ignore_index = ignore_index
        self.max_length = max_length
        self.reverse_map = {tok: i for i, tok in enumerate(number_token_ids)}
        if not all(isinstance(t, int) for t in number_token_ids):
            raise RuntimeError("number_token_ids must be ints")

    def __call__(self, examples):
        # Use the parent __call__ to perform tokenization + labels.
        # It handles `dataset_text_field` implicitly if examples are dicts or explicit text.
        # It handles padding, truncation, and setting initial labels.
        batch = super().__call__(
            examples,
            # max_length=self.max_length, # Removed: Handled by SFTTrainer passing max_seq_length to tokenizer
            # truncation=True,            # Removed: Handled by SFTTrainer passing max_seq_length to tokenizer
            # padding="max_length" if self.tokenizer.pad_token else False, # Removed: Handled by SFTTrainer passing max_seq_length to tokenizer
            return_tensors="pt"
        )
        labels = batch.get("labels", None)
        if labels is None:
            return batch

        for i in range(batch["input_ids"].size(0)):
            lab_seq = labels[i]
            att_seq = batch["attention_mask"][i]

            non_pad_indices = (att_seq == 1).nonzero(as_tuple=False)
            if non_pad_indices.numel() == 0:
                labels[i] = self.ignore_index
                continue

            last_token_pos = int(non_pad_indices[-1].item())

            found_idx = None
            for j in range(last_token_pos, -1, -1):
                tok = int(batch["input_ids"][i][j]) # Check original input_ids for the token ID
                if tok in self.number_token_ids:
                    found_idx = j
                    break

            lab_seq[:] = self.ignore_index
            if found_idx is not None:
                lab_seq[found_idx] = batch["input_ids"][i][found_idx]

        batch["labels"] = labels
        return batch

collator = CollatorForSFT(tokenizer, number_token_ids, max_length=MAX_SEQ_LENGTH)


Columns: ['text', 'label']


In [36]:
from peft import LoftQConfig

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = [
        "lm_head", # can easily be trained because it now has a small size
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    # init_lora_weights = 'loftq',
    # loftq_config = LoftQConfig(loftq_bits = 4, loftq_iter = 1), # And LoftQ
)
print("trainable parameters:", sum(p.numel() for p in model.parameters() if p.requires_grad))

Unsloth: Already have LoRA adapters! We shall skip this step.


Unsloth: Training lm_head in mixed precision to save VRAM
trainable parameters: 567279616


In [None]:
# --- SFTTrainer config & training
sft_args = SFTConfig(
    per_device_train_batch_size = 8,   # lower if OOM; experiment
    gradient_accumulation_steps = 1,
    warmup_steps = 10,
    learning_rate = 1e-4,
    fp16 = not torch.cuda.is_bf16_supported(),
    bf16 = torch.cuda.is_bf16_supported(),
    logging_steps = 1,
    optim = "adamw_8bit",
    weight_decay = 0.01,
    lr_scheduler_type = "cosine",
    seed = 3407,
    output_dir = "/content/drive/MyDrive/llama-3b-instruct-ft-clean SFTTrainer output",
    num_train_epochs = 1,
    report_to = "none",
    group_by_length = True,
)

In [38]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    max_seq_length = MAX_SEQ_LENGTH,
    dataset_num_proc = 1,
    packing = False,
    args = sft_args,
    data_collator = collator,
    dataset_text_field = "text",
)

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/8728 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/2182 [00:00<?, ? examples/s]

In [31]:
test_dataset[0]

{'text': '### Instruction:\nClassify the following patient statement as 0 (Not Triggered) or 1 (Triggered).\nRespond with exactly one character: 0 or 1.\n\n### Statement:\nŸÖÿßŸáŸä ÿßÿ≥ÿ®ÿßÿ® ÿ£ŸÑŸÖ ŸÖŸÜÿ∑ŸÇÿ© ÿßŸÑÿßÿ®Ÿáÿ± ŸÅŸä ÿßŸÑÿ∏Ÿáÿ± ŸÖÿπ ÿ∑ŸÇÿ∑ŸÇÿ© ÿßŸÑÿ∏Ÿáÿ± ÿßŸÑÿØÿßÿ¶ŸÖÿ© ŸÖÿπ ÿßŸÑÿπŸÑŸÖ ÿßŸÜ ÿπŸÖÿ±Ÿä Ÿ¢Ÿ° ÿπÿßŸÖ\n\n### Response:\n0',
 'label': 0}

In [39]:
# Train (runs until completion for the configured epochs)
trainer_stats = trainer.train()
print("Training finished. Stats:", trainer_stats)

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 2
   \\   /|    Num examples = 8,728 | Num Epochs = 1 | Total steps = 1,091
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 567,279,616 of 8,597,540,864 (6.60% trained)


Step,Training Loss
1,0.9226
2,0.6897
3,0.9451
4,0.7345
5,0.7463
6,1.0161
7,1.0334
8,0.673
9,0.7189
10,0.7631


Unsloth: Will smartly offload gradients to save VRAM!
Training finished. Stats: TrainOutput(global_step=1091, training_loss=0.6138387782572607, metrics={'train_runtime': 1308.2991, 'train_samples_per_second': 6.671, 'train_steps_per_second': 0.834, 'total_flos': 3.970362112278528e+16, 'train_loss': 0.6138387782572607, 'epoch': 1.0})


In [42]:
# --- After training: test generation on a small example
model.eval()
def predict_one(statement):
    # Build prompt without label
    prompt = (
        "### Instruction:\n"
        "Classify the following patient statement as 0 (Not Triggered) or 1 (Triggered). A statement is considered ‚ÄúTriggered‚Äù if it indicates the patient may require immediate or urgent medical attention, such as follow-up contact or evaluation by a healthcare professional within a short timeframe (e.g., the same day or within 24 hours).\n"
        "Respond with exactly one character: 0 or 1.\n\n"
        "### Statement:\n"
        f"{statement}\n\n"
        "### Response:\n"
    )

    # Tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Use generate to create 1 token
    # Pass input_ids and attention_mask from the tokenized inputs
    out_ids = model.generate(input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, max_new_tokens=1, do_sample=False)

    # Decode only the newly generated token(s) by slicing from the length of input_ids
    out_dec = tokenizer.decode(out_ids[0, inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()

    # Try to extract 0 or 1 from the returned string
    for ch in out_dec:
        if ch in ("0","1"):
            return ch
    return out_dec

print("Example prediction (inspect):", predict_one("ŸÑŸÇÿØ ÿ£ÿµÿ®ÿ™ ÿ®ÿ∑ŸÅÿ≠ ÿ¨ŸÑÿØŸä ÿ¥ÿØŸäÿØ ÿ£ÿ´ŸÜÿßÿ° ÿ™ŸÜÿßŸàŸÑ ÿßŸÑÿØŸàÿßÿ°."))

Example prediction (inspect): 1


In [43]:
model.save_pretrained(OUTPUT_DIR)
print("saved lora adapters")

saved lora adapters
