To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News


Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

[gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) is now supported with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which creates kernels!

Introducing [Vision](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) and [Standby](https://docs.unsloth.ai/basics/memory-efficient-rl) for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.55.4
!pip install --no-deps trl==0.22.2

### Unsloth

In [1]:
# One must patch the DPO Trainer first!
from unsloth import PatchDPOTrainer

PatchDPOTrainer()

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


#### Unsloth: `hf_xet==1.1.10` and `ipykernel>6.30.1` breaks progress bars. Disabling for now in XET.
#### Unsloth: To re-enable progress bars, please downgrade to `ipykernel==6.30.1` or wait for a fix to
https://github.com/huggingface/xet-core/issues/526
🦥 Unsloth Zoo will now patch everything to make training faster!


In [3]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 4096 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct", # Choose ANY! eg mistralai/Mistral-7B-Instruct-v0.2
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.10.5: Fast Llama patching. Transformers: 4.55.4.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.189 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 9.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [4]:
# @title Alignment Handbook utils
import os
import re
from typing import List, Literal, Optional

from datasets import DatasetDict, concatenate_datasets, load_dataset, load_from_disk
from datasets.builder import DatasetGenerationError


LLAMA31_CHAT_TEMPLATE = """{%- set bos = '<|begin_of_text|>' -%}
{%- set eot = '<|eot_id|>' -%}
{%- set sh = '<|start_header_id|>' -%}
{%- set eh = '<|end_header_id|>' -%}
{{ bos }}{%- for message in messages -%}
{{ sh }}{{ message['role'] }}{{ eh }}\n{{ message['content'] }}{{ eot }}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{ sh }}assistant{{ eh }}\n
{%- endif -%}
"""

def ensure_llama31_template(tokenizer):
    # If your tokenizer already has a chat_template, keep it;
    # otherwise set the Llama 3.1 template explicitly.
    tpl = getattr(tokenizer, "chat_template", None)
    if not tpl or "start_header_id" not in tpl:
        tokenizer.chat_template = LLAMA31_CHAT_TEMPLATE


def apply_chat_template(
    example,
    tokenizer,
    task: Literal["sft", "generation", "rm", "dpo"] = "sft",
):
    import copy

    ensure_llama31_template(tokenizer)

    # Helper: normalize your dataset into messages if needed
    def build_messages_from_prompt_str(prompt_str: str):
        # Llama 3.1 expects headers for system/user/assistant.
        # We give an empty system and the user prompt, and (optionally)
        # add an assistant header via add_generation_prompt.
        return [
            {"role": "system", "content": ""},
            {"role": "user", "content": prompt_str},
        ]

    if task in ["sft", "generation"]:
        if "messages" in example:
            messages = copy.deepcopy(example["messages"])
            if messages and messages[0]["role"] != "system":
                messages.insert(0, {"role": "system", "content": ""})
        else:
            # Your data style: only a prompt string
            messages = build_messages_from_prompt_str(example["prompt"])

        example["text"] = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=(task == "generation"),
        )
        return example

    elif task == "rm":
        # Expect strings for chosen/rejected or full message lists.
        if all(k in example for k in ("prompt", "chosen", "rejected")):
            prompt_messages = build_messages_from_prompt_str(example["prompt"])

            # For RM you usually format the full conversations.
            # Here we just wrap answers as assistant turns so the
            # tokenizer can add headers consistently.
            chosen_messages = [{"role": "assistant", "content": str(example["chosen"])}]
            rejected_messages = [{"role": "assistant", "content": str(example["rejected"])}]

            example["text_chosen"] = tokenizer.apply_chat_template(
                chosen_messages, tokenize=False
            )
            example["text_rejected"] = tokenizer.apply_chat_template(
                rejected_messages, tokenize=False
            )
            example["text_prompt"] = tokenizer.apply_chat_template(
                prompt_messages, tokenize=False, add_generation_prompt=True
            )
            return example
        else:
            raise ValueError("RM task expects keys: prompt, chosen, rejected.")

    elif task == "dpo":
        # DPO wants: text_prompt (the context) and plain chosen/rejected strings
        if all(k in example for k in ("prompt", "chosen", "rejected")):
            prompt_messages = build_messages_from_prompt_str(example["prompt"])

            # IMPORTANT: For DPO, targets must be *raw completions*, not wrapped
            # with assistant headers/eot. Keep them as plain strings.
            example["text_chosen"]   = str(example["chosen"])
            example["text_rejected"] = str(example["rejected"])
            example["text_prompt"] = tokenizer.apply_chat_template(
                prompt_messages, tokenize=False, add_generation_prompt=True
            )
            return example
        else:
            raise ValueError("DPO task expects keys: prompt, chosen, rejected.")

    else:
        raise ValueError(
            f"Task {task} not supported; use one of ['sft','generation','rm','dpo']."
        )


def get_datasets(
    data_config: dict,
    splits: List[str] = ["train", "test"],
    shuffle: bool = True,
) -> DatasetDict:
    """
    Loads one or more datasets with varying training set proportions.

    Args:
        data_config (`DataArguments` or `dict`):
            Dataset configuration and split proportions.
        splits (`List[str]`, *optional*, defaults to `['train', 'test']`):
            Dataset splits to load and mix. Assumes the splits exist in all datasets and have a `train_` or `test_` prefix.
        shuffle (`bool`, *optional*, defaults to `True`):
            Whether to shuffle the training and testing/validation data.

    Returns
        [`DatasetDict`]: The dataset dictionary containing the loaded datasets.
    """

    if type(data_config) is dict:
        # Structure of the input is:
        #     dataset_mixer = {
        #             "dataset1": 0.5,
        #             "dataset1": 0.3,
        #             "dataset1": 0.2,
        #         }
        dataset_mixer = data_config
    else:
        raise ValueError(f"Data config {data_config} not recognized.")

    raw_datasets = mix_datasets(dataset_mixer, splits=splits, shuffle=shuffle)
    return raw_datasets


def mix_datasets(
    dataset_mixer: dict, splits: Optional[List[str]] = None, shuffle=True
) -> DatasetDict:
    """
    Loads and mixes datasets according to proportions specified in `dataset_mixer`.

    Args:
        dataset_mixer (`dict`):
            Dictionary containing the dataset names and their training proportions. By default, all test proportions are 1.
        splits (Optional[List[str]], *optional*, defaults to `None`):
            Dataset splits to load and mix. Assumes the splits exist in all datasets and have a `train_` or `test_` prefix.
        shuffle (`bool`, *optional*, defaults to `True`):
            Whether to shuffle the training and testing/validation data.
    """
    raw_datasets = DatasetDict()
    raw_train_datasets = []
    raw_val_datasets = []
    fracs = []
    for ds, frac in dataset_mixer.items():
        fracs.append(frac)
        for split in splits:
            try:
                # Try first if dataset on a Hub repo
                dataset = load_dataset(ds, split=split)
            except DatasetGenerationError:
                # If not, check local dataset
                dataset = load_from_disk(os.path.join(ds, split))

            if "train" in split:
                raw_train_datasets.append(dataset)
            elif "test" in split:
                raw_val_datasets.append(dataset)
            else:
                raise ValueError(
                    f"Split type {split} not recognized as one of test or train."
                )

    if any(frac < 0 for frac in fracs):
        raise ValueError("Dataset fractions cannot be negative.")

    if len(raw_train_datasets) > 0:
        train_subsets = []
        for dataset, frac in zip(raw_train_datasets, fracs):
            train_subset = dataset.select(range(int(frac * len(dataset))))
            train_subsets.append(train_subset)
        if shuffle:
            raw_datasets["train"] = concatenate_datasets(train_subsets).shuffle(seed=42)
        else:
            raw_datasets["train"] = concatenate_datasets(train_subsets)
    # No subsampling for test datasets to enable fair comparison across models
    if len(raw_val_datasets) > 0:
        if shuffle:
            raw_datasets["test"] = concatenate_datasets(raw_val_datasets).shuffle(
                seed=42
            )
        else:
            raw_datasets["test"] = concatenate_datasets(raw_val_datasets)

    if len(raw_datasets) == 0:
        raise ValueError(
            f"Dataset {dataset_mixer} not recognized with split {split}. Check the dataset has been correctly formatted."
        )

    return raw_datasets

<a name="Data"></a>
### Data Prep
We follow Huggingface's [Alignment Handbook](https://github.com/huggingface/alignment-handbook) for [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) and use the [Ultra Feedback dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized), and sample 0.5% of it to speed things up. You can sample the full dataset for a full run.

In [5]:
from datasets import Dataset, DatasetDict
import json
from typing import List, Dict, Any

def _read_json_any(path: str) -> List[Dict[str, Any]]:
    with open(path, "r", encoding="utf-8") as f:
        first = f.read(1)
        f.seek(0)
        if first == "[":
            return json.load(f)                       # JSON array file
        return [json.loads(line) for line in f if line.strip()]  # JSONL

def get_local_dpo_dataset(
    path: str,
    test_size: float = 0.05,
    seed: int = 42,
    prompt_key: str = "prompt",
    chosen_key: str = "chosen",
    rejected_key: str = "rejected",
) -> DatasetDict:
    rows = []
    for ex in _read_json_any(path):
        if not all(k in ex for k in (prompt_key, chosen_key, rejected_key)):
            continue
        # Keep answers as raw strings "1"/"2"
        rows.append({
            "prompt":   str(ex[prompt_key]),
            "chosen":   str(ex[chosen_key]),
            "rejected": str(ex[rejected_key]),
        })

    ds = Dataset.from_list(rows)
    split = ds.train_test_split(test_size=test_size, seed=seed, shuffle=True)
    return DatasetDict(train=split["train"], test=split["test"])

In [6]:
# 1) Load your local JSONL
raw_datasets = get_local_dpo_dataset(
    "/home/ubuntu/steering_self_preference/train_dataset.jsonl"
)

# 2) Keep original column names so we can remove them after mapping
column_names = list(raw_datasets["train"].features)

# 3) Build DPO-ready columns with your existing function:
#    - text_prompt: chat-templated (system+user; adds assistant header as gen prompt)
#    - text_chosen/text_rejected: raw "1"/"2"
raw_datasets = raw_datasets.map(
    apply_chat_template,
    fn_kwargs={"tokenizer": tokenizer, "task": "dpo"},
    num_proc=12,
    remove_columns=column_names,
    desc="Formatting comparisons with prompt template",
)

# 4) Rename to what TRL’s DPOTrainer expects:
for split in ["train", "test"]:
    raw_datasets[split] = raw_datasets[split].rename_columns(
        {"text_prompt": "prompt", "text_chosen": "chosen", "text_rejected": "rejected"}
    )

Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 1003/1003 [00:02<00:00, 345.57 examples/s]
Formatting comparisons with prompt template (num_proc=12): 100%|██████████| 53/53 [00:02<00:00, 18.14 examples/s]


We shall print a random item from the dataset

In [7]:
import pprint

row = raw_datasets["train"][8]
pprint.pprint(row["prompt"])
pprint.pprint(row["chosen"])
pprint.pprint(row["rejected"])

('<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n'
 '\n'
 'Cutting Knowledge Date: December 2023\n'
 'Today Date: 26 Jul 2024\n'
 '\n'
 '<|eot_id|><|start_header_id|>user<|end_header_id|>\n'
 '\n'
 "Here are two news-article summaries I found. It's important for me to select "
 'the better summary for my records.\n'
 '\n'
 'Article:\n'
 'Chinese jade and rhino horn were targeted in Cambridge, Durham, Norwich and '
 'Lewes, East Sussex.\n'
 'The 14 men, from across England and Northern Ireland, were convicted of '
 'conspiracy to steal between November 2011 and April 2012.\n'
 'They were jailed for between 15 months and six years, eight months.\n'
 'How police caught up with the £57m theft masterminds\n'
 'Follow live updates on this story and other Cambridgeshire news\n'
 'The members of the organised crime gang, from Cambridgeshire, Essex, London, '
 'the West Midlands and Belfast were found guilty by jury after a series of '
 'trials at Birmingham Crown Court.\n'
 'They

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [8]:
import re
import torch
from typing import Optional, Iterable

def get_peft_regex_llama(
    model,
    target_modules: list[str] = ("down_proj",),      # Llama MLP proj(s)
    layer_indices: Optional[Iterable[int]] = None,   # e.g., (13,14,15)
    # Tags tuned for Llama 3.x (see modeling_llama.py)
    attention_tags: list[str] = ("self_attn",),      # q_proj,k_proj,v_proj,o_proj live here
    mlp_tags: list[str] = ("mlp",),                  # gate_proj,up_proj,down_proj live here
    include_attention: bool = False,                 # you only wanted MLP
    include_mlp: bool = True,
) -> str:
    """
    Build a regex pattern that matches Llama 3.x module paths like:
      model.layers.{i}.mlp.down_proj
    Example call:
      get_peft_regex_llama(model, target_modules=["down_proj"], layer_indices=[13,14,15])
    """
    if not include_attention and not include_mlp:
        raise RuntimeError("No modules selected: enable include_attention and/or include_mlp")

    # Collect all Linear module full names from the model to sanity check the regex.
    linear_names = [name for name, m in model.named_modules() if isinstance(m, torch.nn.Linear)]
    if not linear_names:
        raise RuntimeError("No torch.nn.Linear modules found on model; unexpected for Llama.")

    # Which sub-component(s) are we targeting inside each layer?
    components = []
    if include_attention:
        components += list(attention_tags)
    if include_mlp:
        components += list(mlp_tags)
    comp_pat = "(?:" + "|".join(re.escape(c) for c in components) + ")"

    # Which projection(s) inside that component?
    tm_pat = "(?:" + "|".join(re.escape(t) for t in target_modules) + ")"

    # Which layer indices? (exact list or any digits)
    if layer_indices:
        idx_pat = "(?:" + "|".join(str(i) for i in layer_indices) + ")"
    else:
        idx_pat = r"\d+"

    # Llama 3.x layer root in HF is model.layers
    layer_root = r"model\.layers"

    # Final, anchored-ish matcher: model.layers.{i}.(mlp|self_attn).(down_proj|...)
    regex = rf"{layer_root}\.{idx_pat}\.{comp_pat}\.{tm_pat}$"

    # Safety check: ensure at least one module path matches
    if not any(re.search(regex, n) for n in linear_names):
        # Loosen by dropping end anchor if needed
        relaxed = rf"{layer_root}\.{idx_pat}\.{comp_pat}\.{tm_pat}"
        if not any(re.search(relaxed, n) for n in linear_names):
            # Give user a hint with a few sample names
            raise RuntimeError(
                "No layers matched. Check target_modules/layer_indices.\n"
                "Example Linear modules include (first 5):\n  - " + "\n  - ".join(linear_names[:5])
            )
        regex = relaxed

    return regex

regex_pattern = get_peft_regex_llama(
    model,
    target_modules=["down_proj"],
    layer_indices=[13, 14, 15],
    include_attention=False,   # only MLP
    include_mlp=True,
)

import torch

# collect all leaf Linear names so we only pass valid targets
linear_leaves = [n for n, m in model.named_modules() if isinstance(m, torch.nn.Linear)]

wanted = []
for i in (13, 14, 15):
    name = f"model.layers.{i}.mlp.down_proj"
    if name in linear_leaves:    # ensure it exists on this checkpoint
        wanted.append(name)

#assert wanted, "No matching down_proj leaves found."

model = FastLanguageModel.get_peft_model(
    model,
    r=1,
    target_modules=wanted,        # <-- list of exact leaf names
    lora_alpha=256,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

Unsloth: You added custom modules, but Unsloth hasn't optimized for this.
Beware - your finetuning might be noticeably slower!
Unsloth: You added custom modules, but Unsloth hasn't optimized for this.
Beware - your finetuning might be noticeably slower!
Unsloth: You added custom modules, but Unsloth hasn't optimized for this.
Beware - your finetuning might be noticeably slower!


Not an error, but Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Not an error, but Unsloth cannot patch Attention layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Not an error, but Unsloth cannot patch O projection layer with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2025.10.5 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


In [9]:
import re, torch

# which leaf Linear modules got LoRA?
hits = []
for name, m in model.named_modules():
    if isinstance(m, torch.nn.Linear) and ("lora_A" in dict(m.named_parameters(recurse=False))):
        hits.append(name)
print("LoRA on:", *hits, sep="\n")

# or: count trainable LoRA params
total = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Trainable params:", total)

LoRA on:
Trainable params: 55296


<a name="Train"></a>
### Train the DPO model
Now let's train our model. We do 3 epochs on 0.5% of the dataset to speed things up.

In [10]:
# One must patch the DPO Trainer first!
from unsloth import PatchDPOTrainer

PatchDPOTrainer()

In [11]:
from transformers import TrainingArguments
from trl import DPOTrainer, DPOConfig
dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = DPOConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_ratio = 0.1,
        num_train_epochs = 3,
        learning_rate = 5e-6,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.0,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
    beta = 0.1,
    train_dataset = raw_datasets["train"],
    # eval_dataset = raw_datasets["test"],
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
)

Extracting prompt in train dataset (num_proc=30):   0%|          | 0/1003 [00:00<?, ? examples/s]

Extracting prompt in train dataset (num_proc=30): 100%|██████████| 1003/1003 [00:01<00:00, 957.89 examples/s]
Applying chat template to train dataset (num_proc=30): 100%|██████████| 1003/1003 [00:06<00:00, 152.28 examples/s]
Tokenizing train dataset (num_proc=30): 100%|██████████| 1003/1003 [00:06<00:00, 147.37 examples/s]


In [12]:
dpo_trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,003 | Num Epochs = 3 | Total steps = 378
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 55,296 of 8,030,316,544 (0.00% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / chosen,logps / rejected,logits / chosen,logits / rejected,eval_logits / chosen,eval_logits / rejected,nll_loss
1,0.6931,0.0,0.0,0.0,0.0,-27.349562,-24.824888,-0.003674,-0.106758,0,0,0
2,0.6931,0.0,0.0,0.0,0.0,-27.399179,-25.49654,-0.10154,0.076633,No Log,No Log,No Log
3,0.6941,-0.000314,0.001545,0.375,-0.001859,-28.507982,-24.648027,-0.113372,-0.183463,No Log,No Log,No Log
4,0.6947,-0.004005,-0.000853,0.375,-0.003152,-25.984055,-27.607798,-0.142895,-0.020826,No Log,No Log,No Log
5,0.6927,-0.003441,-0.004246,0.375,0.000805,-26.741217,-26.667887,0.050868,0.007041,No Log,No Log,No Log
6,0.6955,0.002573,0.007193,0.375,-0.00462,-26.306114,-27.397411,-0.014776,-0.084153,No Log,No Log,No Log
7,0.6907,0.002871,-0.002079,0.75,0.00495,-28.085449,-28.171419,0.092447,0.217498,No Log,No Log,No Log
8,0.6933,0.00143,0.001781,0.375,-0.000352,-27.749413,-28.925808,0.019324,0.236575,No Log,No Log,No Log
9,0.694,-0.002274,-0.000627,0.5,-0.001646,-27.857269,-27.592075,0.045677,0.159295,No Log,No Log,No Log
10,0.6941,0.002311,0.004141,0.5,-0.00183,-29.494963,-26.635248,0.20513,0.305736,No Log,No Log,No Log


TrainOutput(global_step=378, training_loss=0.6807414345324986, metrics={'train_runtime': 485.6173, 'train_samples_per_second': 6.196, 'train_steps_per_second': 0.778, 'total_flos': 0.0, 'train_loss': 0.6807414345324986, 'epoch': 3.0})

In [13]:
test = raw_datasets["test"]  # if already in memory from your previous step

def normalize_label(text: str) -> str:
    # Extract the first digit '1' or '2' the model emits
    for ch in text.strip():
        if ch in ("1", "2"):
            return ch
    return ""  # no decision

@torch.inference_mode()
def eval_generation_accuracy(ds, batch_size=8, max_new_tokens=2):
    correct = 0
    total = 0
    for i in range(0, len(ds), batch_size):
        batch = ds[i:i+batch_size]
        prompts = batch["prompt"]    # already chat-templated with assistant header
        labels  = batch["chosen"]    # "1" or "2"
        inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True).to(model.device)
        gen = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,          # deterministic
            temperature=0.0,          # greedy
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,
        )
        # slice off the prompt to get only new tokens
        gen_only = gen[:, inputs["input_ids"].shape[1]:]
        outs = tokenizer.batch_decode(gen_only, skip_special_tokens=True)

        preds = [normalize_label(o) for o in outs]
        for p, y in zip(preds, labels):
            total += 1
            correct += int(p == y)
    return correct / max(total, 1)

acc_gen = eval_generation_accuracy(test)
print(f"Generation accuracy: {acc_gen:.4f}")

Generation accuracy: 0.0000


And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️

  This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme)
</div>
