To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News


Introducing Unsloth [Standby for RL](https://docs.unsloth.ai/basics/memory-efficient-rl): GRPO is now faster, uses 30% less memory with 2x longer context.

Gpt-oss fine-tuning now supports 8× longer context with 0 accuracy loss. [Read more](https://docs.unsloth.ai/basics/long-context-gpt-oss-training)

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [10]:
%%capture
# We're installing the latest Torch, Triton, OpenAI's Triton kernels, Transformers and Unsloth!
!pip install --upgrade -qqq uv
try: import numpy; get_numpy = f"numpy=={numpy.__version__}"
except: get_numpy = "numpy"
!uv pip install -qqq \
    "torch>=2.8.0" "triton>=3.4.0" {get_numpy} torchvision bitsandbytes "transformers>=4.55.3" \
    "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
    "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
    git+https://github.com/triton-lang/triton.git@05b2c186c1b6c9a08375389d5efe9cb4c401c075#subdirectory=python/triton_kernels
!uv pip install transformers==4.55.4

### Unsloth

We're about to demonstrate the power of the new OpenAI GPT-OSS 20B model through a finetuning example. To use our `MXFP4` inference example, use this [notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/GPT_OSS_MXFP4_(20B)-Inference.ipynb) instead.

In [None]:
# ✅ 7B 4bit (GPU-only) — T4에서 가장 안정
import os, torch
from unsloth import FastLanguageModel

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True,max_split_size_mb:128"

model_name = "unsloth/Qwen2.5-7B-Instruct-bnb-4bit"  # 또는 "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
max_seq_length = 1024  # 처음엔 1024로 시작 (메모리 여유 생기면 올리세요)
dtype = None

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name      = model_name,
    dtype           = dtype,
    max_seq_length  = max_seq_length,
    load_in_4bit    = True,
    full_finetuning = False,
    device_map      = {"": 0},   # 🔒 4bit는 오프로딩 불가 → GPU만 사용
)

print("✅ Loaded:", model_name, "| device:", getattr(model, "device", "cuda:0"))


==((====))==  Unsloth 2025.9.1: Fast Qwen2 patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.55G [00:00<?, ?B/s]

We now add LoRA adapters for parameter efficient finetuning - this allows us to only efficiently train 1% of all parameters.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 8, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

In [None]:
# === Collect 10 wrong answers from GSM8K (test split)
#     — robust to format/token-length; only truly-wrong items are kept ===
import re, json, random, torch
from fractions import Fraction
from decimal import Decimal, InvalidOperation

# =========================
# PRE-FLIGHT (반드시 포함)
# =========================
# 1) 모델/토크나이저 존재 확인
if "model" not in globals() or "tokenizer" not in globals() or model is None:
    raise RuntimeError("먼저 모델 로딩 셀을 실행해 주세요. (model, tokenizer 필요)")

# 2) model.device 보정 (HF 폴백 대비)
if not hasattr(model, "device"):
    try:
        model.device = next(model.parameters()).device
    except Exception:
        pass

# 3) chat template 폴백 (HF 토크나이저일 때 대비)
if not hasattr(tokenizer, "apply_chat_template") or tokenizer.apply_chat_template is None:
    def _simple_chat_template(messages, add_generation_prompt=True,
                              return_tensors="pt", return_dict=True, **kw):
        # Mistral/Qwen instruct 스타일의 아주 단순한 템플릿
        sys = ""
        text = ""
        msgs = list(messages)
        if msgs and msgs[0].get("role") == "system":
            sys = (msgs[0].get("content") or "").strip()
            msgs = msgs[1:]

        for m in msgs:
            role = m.get("role")
            content = (m.get("content") or "").strip()
            if role == "user":
                if sys:
                    content = f"<<SYS>>\n{sys}\n<</SYS>>\n\n{content}"
                    sys = ""
                text += f"[INST] {content} [/INST]"
            elif role == "assistant":
                text += content

        enc = tokenizer(text, return_tensors=return_tensors)
        if return_dict:
            return enc
        return enc.input_ids
    tokenizer.apply_chat_template = _simple_chat_template

# 4) Unsloth BF16 autocast로 인한 dtype 충돌 방지용 안전 generate
def _safe_generate_call(model, **kwargs):
    """
    Unsloth가 교체한 fast_generate를 우회하여,
    원래 HF generate(_old_generate)를 BF16 autocast OFF로 호출.
    """
    gen_fn = getattr(model, "_old_generate", None)
    if not callable(gen_fn):
        gen_fn = model.generate
    # eos 안전 주입
    eos_id = getattr(tokenizer, "eos_token_id", None)
    if eos_id is not None and "eos_token_id" not in kwargs:
        kwargs["eos_token_id"] = eos_id
    try:
        # Updated autocast call
        with torch.amp.autocast(device_type="cuda", enabled=False):
            return gen_fn(**kwargs)
    except Exception:
        return gen_fn(**kwargs)

# =========================
# 유틸/파서 (포맷/길이 안전)
# =========================
def _clean(s): return (s or "").strip()

BOXED_RE = re.compile(r"\\boxed\{([^}]*)\}")
HASH_RE  = re.compile(r"####\s*([^\n]+)")
FINAL_RE = re.compile(r"(?:final answer|answer is|ans(?:wer)?\:?)\s*[:\-]?\s*([^\n]+)", re.I)
NUMBER_RE = re.compile(r"-?\d+/\d+|-?\d+(?:\.\d+)?")  # 분수 또는 소수/정수

def normalize(s: str) -> str:
    s = _clean(str(s))
    m = BOXED_RE.search(s)
    if m: s = m.group(1)
    s = re.sub(r"[,\s]+","",s)
    s = re.sub(r"[.:;]+$","",s)
    return s.lower()

def extract_final(text: str) -> str:
    t = _clean(text)
    for rx in (HASH_RE, BOXED_RE, FINAL_RE):
        m = rx.search(t)
        if m: return _clean(m.group(1))
    nums = NUMBER_RE.findall(t)
    if nums: return _clean(nums[-1])  # 본문 내 마지막 수를 백업 정답으로
    lines = [l for l in t.splitlines() if _clean(l)]
    return _clean(lines[-1]) if lines else t

def candidates_from_text(text: str):
    return [normalize(x) for x in NUMBER_RE.findall(text or "")]

def as_fraction(x: str):
    x = _clean(x)
    try:
        if "/" in x: return Fraction(x)
        return Fraction(Decimal(x))
    except (InvalidOperation, ZeroDivisionError, ValueError):
        return None

def numerically_equal(a: str, b: str) -> bool:
    fa, fb = as_fraction(a), as_fraction(b)
    return (fa is not None and fb is not None and fa == fb)

def is_correct(pred_text: str, gold_text: str) -> bool:
    pred_final = extract_final(pred_text)
    gold = normalize(gold_text)
    pred_norm = normalize(pred_final)
    if pred_norm and gold and (pred_norm == gold or numerically_equal(pred_norm, gold)):
        return True
    cand = candidates_from_text(pred_text)
    if gold in cand:
        return True
    for c in cand:
        if numerically_equal(c, gold):
            return True
    return False

def truncate_after_hashline(text: str) -> str:
    t = text or ""
    m = HASH_RE.search(t)
    if not m: return t
    end = t.find("\n", m.start())
    return t if end == -1 else t[:end]

# =========================
# 생성 래퍼 (최종 줄 강제: "#### <숫자>")
# =========================
def ask_only_new_tokens(question, effort="medium", max_new_tokens=256):
    sys_prompt = (
        "reasoning language: English\n"
        "You are a helpful math assistant.\n"
        "Show concise reasoning, THEN end with exactly ONE final line:\n"
        "#### <final numeric answer>\n"
        "No words, units, or extra lines after that final line."
    )
    msgs = [
        {"role":"system","content":sys_prompt},
        {"role":"user","content": question},
    ]
    inputs = tokenizer.apply_chat_template(
        msgs,
        add_generation_prompt=True,
        return_tensors="pt",
        return_dict=True,
        reasoning_effort=effort,
    ).to(model.device)

    model.eval()
    with torch.no_grad():
        outputs = _safe_generate_call(
            model,
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
        )

    gen_ids = outputs[:, inputs["input_ids"].shape[1]:]
    pred_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True)[0]
    pred_text = truncate_after_hashline(pred_text)

    # 백업: '####'가 없으면 초간단 모드로 재시도
    if "####" not in pred_text:
        strict_msgs = [
            {"role":"system","content":"Output only the final numeric answer on one line prefixed by '#### '. No other text."},
            {"role":"user","content": question},
        ]
        strict_inputs = tokenizer.apply_chat_template(
            strict_msgs,
            add_generation_prompt=True,
            return_tensors="pt",
            return_dict=True,
            reasoning_effort="low",
        ).to(model.device)
        with torch.no_grad():
            strict_out = _safe_generate_call(
                model,
                **strict_inputs,
                max_new_tokens=64,
                do_sample=False,
            )
        gen_ids2 = strict_out[:, strict_inputs["input_ids"].shape[1]:]
        pred_text2 = tokenizer.batch_decode(gen_ids2, skip_special_tokens=True)[0]
        pred_text = truncate_after_hashline(pred_text2) or pred_text

    return pred_text

# =========================
# 데이터 로드 & 오답 10개 수집
# =========================
try:
    from datasets import load_dataset
except Exception:
    raise RuntimeError("`datasets` 가 필요합니다. 설치 셀을 먼저 실행하세요.")

ds = load_dataset("openai/gsm8k", "main", split="test")

random.seed(3407)
indices = list(range(len(ds)))
random.shuffle(indices)

wrong = []
for i in indices:
    q   = _clean(ds[i]["question"])
    sol = _clean(ds[i]["answer"])
    gold = extract_final(sol)

    pred_raw = ask_only_new_tokens(q, effort="medium", max_new_tokens=256)

    # ✨ 포맷/잘림과 무관하게 '수학적으로 정답'이면 스킵
    if is_correct(pred_raw, gold):
        # 진행상황 보고 싶으면 아래 주석 해제
        # print(f"[GSM8K] idx={i} correct — skipped")
        continue

    pred_final = extract_final(pred_raw)
    wrong.append({
        "dataset": "GSM8K",
        "idx": i,
        "question": q,
        "solution": sol,        # SFT 타겟으로 쓰기 좋음(해설 전체)
        "gold": gold,
        "pred_before": pred_final,
        "raw_before": pred_raw,
        "gen": {"effort":"medium","max_new_tokens":256}
    })
    print(f"[GSM8K] collected wrong #{len(wrong)} (idx={i})")

    if len(wrong) >= 10:
        break

# 저장
out_path = "wrong_gsm8k_10.jsonl"
with open(out_path, "w", encoding="utf-8") as f:
    for r in wrong:
        f.write(json.dumps(r, ensure_ascii=False) + "\n")

print(f"✅ Saved {len(wrong)} items to {out_path}")

### Reasoning Effort
The `gpt-oss` models from OpenAI include a feature that allows users to adjust the model's "reasoning effort." This gives you control over the trade-off between the model's performance and its response speed (latency) which by the amount of token the model will use to think.

----

The `gpt-oss` models offer three distinct levels of reasoning effort you can choose from:

* **Low**: Optimized for tasks that need very fast responses and don't require complex, multi-step reasoning.
* **Medium**: A balance between performance and speed.
* **High**: Provides the strongest reasoning performance for tasks that require it, though this results in higher latency.

In [None]:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "low", # **NEW!** Set reasoning effort to low, medium or high
).to(model.device)

_ = model.generate(**inputs, max_new_tokens = 64, streamer = TextStreamer(tokenizer))

Changing the `reasoning_effort` to `medium` will make the model think longer. We have to increase the `max_new_tokens` to occupy the amount of the generated tokens but it will give better and more correct answer

In [None]:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "medium", # **NEW!** Set reasoning effort to low, medium or high
).to(model.device)

_ = model.generate(**inputs, max_new_tokens = 64, streamer = TextStreamer(tokenizer))

Lastly we will test it using `reasoning_effort` to `high`

In [None]:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "high", # **NEW!** Set reasoning effort to low, medium or high
).to(model.device)

_ = model.generate(**inputs, max_new_tokens = 64, streamer = TextStreamer(tokenizer))

<a name="Data"></a>
### Data Prep

The `HuggingFaceH4/Multilingual-Thinking` dataset will be utilized as our example. This dataset, available on Hugging Face, contains reasoning chain-of-thought examples derived from user questions that have been translated from English into four other languages. It is also the same dataset referenced in OpenAI's [cookbook](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers) for fine-tuning. The purpose of using this dataset is to enable the model to learn and develop reasoning capabilities in these four distinct languages.

In [None]:
def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset

dataset = load_dataset("HuggingFaceH4/Multilingual-Thinking", split="train")
dataset

To format our dataset, we will apply our version of the GPT OSS prompt

In [None]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

Let's take a look at the dataset, and check what the 1st example shows

In [None]:
print(dataset[0]['text'])

What is unique about GPT-OSS is that it uses OpenAI [Harmony](https://github.com/openai/harmony) format which support conversation structures, reasoning output, and tool calling.

<a name="Train"></a>
### Train the model
Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [None]:
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 30,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
trainer_stats = trainer.train()

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
messages = [
    {"role": "system", "content": "reasoning language: French\n\nYou are a helpful assistant that can solve mathematical problems."},
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "medium",
).to(model.device)
from transformers import TextStreamer
_ = model.generate(**inputs, max_new_tokens = 64, streamer = TextStreamer(tokenizer))

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** Currently finetunes can only be loaded via Unsloth in the meantime - we're working on vLLM and GGUF exporting!

In [None]:
model.save_pretrained("finetuned_model")
# model.push_to_hub("hf_username/finetuned_model", token = "hf_...") # Save to HF

To run the finetuned model, you can do the below after setting `if False` to `if True` in a new instance.

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "finetuned_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 1024,
        dtype = None,
        load_in_4bit = True,
    )

messages = [
    {"role": "system", "content": "reasoning language: French\n\nYou are a helpful assistant that can solve mathematical problems."},
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "high",
).to(model.device)
from transformers import TextStreamer
_ = model.generate(**inputs, max_new_tokens = 64, streamer = TextStreamer(tokenizer))

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
