### Installation

Run this as the article instructs for setting up your vertual env.

In [1]:
%%capture
# We're installing the latest Torch, Triton, OpenAI's Triton kernels, Transformers and Unsloth!
!pip install --upgrade -qqq uv
try: import numpy; install_numpy = f"numpy=={numpy.__version__}"
except: install_numpy = "numpy"
!uv pip install -qqq \
    "torch>=2.8.0" "triton>=3.4.0" {install_numpy} \
    "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
    "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
    torchvision bitsandbytes \
    git+https://github.com/huggingface/transformers \
    git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels


### Unsloth

We're about to demonstrate the power of the new OpenAI GPT-OSS 20B model through an inference example.

In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 4096
dtype = None

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/gpt-oss-20b-unsloth-bnb-4bit", # 20B model using bitsandbytes 4bit quantization
    "unsloth/gpt-oss-120b-unsloth-bnb-4bit",
    "unsloth/gpt-oss-20b", # 20B model using MXFP4 format
    "unsloth/gpt-oss-120b",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b",
    dtype = dtype, # None for auto detection
    max_seq_length = max_seq_length, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm
    PyTorch 2.7.1+cu126 with CUDA 1208 (you have 2.8.0+cu128)
    Python  3.9.23 (you have 3.11.13)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


🦥 Unsloth Zoo will now patch everything to make training faster!


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


==((====))==  Unsloth 2025.8.5: Fast Gpt_Oss patching. Transformers: 4.56.0.dev0.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.988 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards: 100%|██████████| 4/4 [00:25<00:00,  6.49s/it]


We now add LoRA adapters for parameter efficient finetuning - this allows us to only efficiently train 1% of all parameters.

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 8, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Making `model.base_model.model.model` require gradients


### Reasoning Effort
The `gpt-oss` models from OpenAI include a feature that allows users to adjust the model's "reasoning effort." This gives you control over the trade-off between the model's performance and its response speed (latency) which by the amount of token the model will use to think.

----

The `gpt-oss` models offer three distinct levels of reasoning effort you can choose from:

* **Low**: Optimized for tasks that need very fast responses and don't require complex, multi-step reasoning.
* **Medium**: A balance between performance and speed.
* **High**: Provides the strongest reasoning performance for tasks that require it, though this results in higher latency.

In [4]:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "low", # **NEW!** Set reasoning effort to low, medium or high
).to(model.device)

_ = model.generate(**inputs, max_new_tokens = 512, streamer = TextStreamer(tokenizer))

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-13

Reasoning: low

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>analysis<|message|>Equation: x^5 + 3x^4 - 10 = 3 => x^5 + 3x^4 - 13 = 0. Need roots. Try integer roots ±1, ±13. f(1)=1+3-13=-9. f(-1)= -1+3-13=-11. f(13?) huge. Maybe use numerical methods. Maybe factor (x^4)(x+3)-13... not. Probably approximate root around something. Let's find approximate root. Try x=1: -9. x=2: 32+48-13=67. so root between 1 and 2. x=1.5: 7.59+11.4-13=6-? compute:1.5^5=7.59375, 3*1.5^4=3*5.0625=15.1875 sum=22.78125-13=9.78125 positive. So root between 1 and 1.2? Actually at 1.2: 1.2^5=2.488, 1.2^4=2.0736*3=6.2208 sum=8.7088-13=-4.2912. at 1.3:1.3^5=3.712,1.3^4=2.856*3=

Changing the `reasoning_effort` to `medium` will make the model think longer. We have to increase the `max_new_tokens` to occupy the amount of the generated tokens but it will give better and more correct answer

In [5]:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "medium", # **NEW!** Set reasoning effort to low, medium or high
).to(model.device)

_ = model.generate(**inputs, max_new_tokens = 1024, streamer = TextStreamer(tokenizer))

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-13

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>analysis<|message|>The user says: "Solve x^5 + 3x^4 - 10 = 3." This appears to be an equation: x^5 + 3x^4 - 10 = 3. Solve for x, presumably real solutions. Let's rewrite: x^5 + 3x^4 - 10 = 3. So x^5 + 3x^4 - 13 = 0. So we need roots of polynomial f(x) = x^5 + 3x^4 - 13 = 0. Solve? Hard to factor? Let's attempt. The polynomial is degree 5. Let's see if any rational roots? Rational root test: divisors of constant term: ±1, ±13. Try x=1: 1 + 3 - 13 = -9 ≠ 0. x= -1: -1 + 3 - 13 = -11. x=13: huge positive. x=-13: negative large. x=... none obvious. What about x=2: 32 + 48 - 13 = 67. x = -2:

Lastly we will test it using `reasoning_effort` to `high`

In [6]:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "high", # **NEW!** Set reasoning effort to low, medium or high
).to(model.device)

_ = model.generate(**inputs, max_new_tokens = 2048, streamer = TextStreamer(tokenizer))

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-13

Reasoning: high

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>analysis<|message|>We need to solve the equation:

x^5 + 3x^4 - 10 = 3. So we want solve for real x? The problem states "Solve x^5 + 3x^4 - 10 = 3." It's likely a math problem: find x such that x^5 + 3x^4 - 10 = 3. That is a polynomial equation. We need to solve for x. Rearranging: x^5 + 3x^4 - 10 = 3 => x^5 + 3x^4 - 10 - 3 = 0 => x^5 + 3x^4 - 13 = 0. So we have polynomial: x^5 + 3x^4 - 13 = 0. Solve for real roots? This is a quintic polynomial; there's likely no explicit closed form for general quintic, but might be simplified or factorable. Let's check if there's a rational root via ra

<a name="Data"></a>
### Data Prep

We use the "HuggingFaceH4/Multilingual-Thinking" dataset (from Hugging Face Datasets), which contains 1,000 examples of user questions with chain-of-thought reasoning chains translated into 4 languages (derived from English originals). This dataset is inspired by OpenAI's fine-tuning cookbook for GPT-OSS.

### Pre-processing steps applied to the dataset:

1. Loading the Dataset:

   * Loads the "train" split:` dataset = load_dataset("HuggingFaceH4/Multilingual-Thinking", split="train")`.
   * Features include: `'reasoning_language', 'developer', 'user', 'analysis', 'final', 'messages'`.
   * 'messages' is a list of conversation turns (e.g., system prompt, user query, assistant analysis/final response).


2. Standardization:

     * Calls `standardize_sharegpt(dataset)` from Unsloth's chat templates module. This converts the dataset into a standardized ShareGPT format (a common JSON-like structure for chat datasets, with roles like "system", "user", "assistant").


3. Formatting with Chat Template:

   * Defines a mapping function: `formatting_prompts_func(examples)`.

     * Takes the 'messages' column (conversations).
     * Applies the model's chat template: `tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)`.

       * This wraps each conversation in the GPT-OSS-specific Harmony format (e.g., `<|start|>role<|message|>content<|end|>`), including system prompts for reasoning effort, channels (analysis/commentary/final), and tool calls.
       * No tokenization here (just text formatting); tokenization happens during training.


     * Adds a new column 'text' with the formatted strings.


   * Applies this batched: `dataset = dataset.map(formatting_prompts_func, batched=True)`.


4. No Other Changes:

   * No filtering, augmentation, or shuffling beyond this.
   * The dataset remains 1,000 rows, now with a 'text' column ready for supervised fine-tuning (where the model learns to predict the full conversation text).

Example of a formatted entry (from dataset[0]['text'] in the notebook):

* It includes a system prompt, developer instructions (e.g., "reasoning language: French"), user query, and assistant responses in channels like `<|channel|>analysis<|message|>` (in French) and `<|channel|>final<|message|> `(in English).

This prepares the data for the trainer, ensuring it matches the model's expected input format for multilingual reasoning.

In [7]:
def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset

dataset = load_dataset("HuggingFaceH4/Multilingual-Thinking", split="train")
dataset

Dataset({
    features: ['reasoning_language', 'developer', 'user', 'analysis', 'final', 'messages'],
    num_rows: 1000
})

To format our dataset, we will apply our version of the GPT OSS prompt

In [8]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

Let's take a look at the dataset, and check what the 1st example shows

In [9]:
print(dataset[0]['text'])

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-13

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions

reasoning language: French

You are an AI chatbot with a lively and energetic personality.<|end|><|start|>user<|message|>Can you show me the latest trends on Twitter right now?<|end|><|start|>assistant<|channel|>analysis<|message|>D'accord, l'utilisateur demande les tendances Twitter les plus récentes. Tout d'abord, je dois vérifier si j'ai accès à des données en temps réel. Étant donné que je ne peux pas naviguer sur Internet ou accéder directement à l'API de Twitter, je ne peux pas fournir des tendances en direct. Cependant, je peux donner quelques conseils généraux sur la façon de les trouver.

Je devrais préciser que les 

What is unique about GPT-OSS is that it uses OpenAI [Harmony](https://github.com/openai/harmony) format which support conversation structures, reasoning output, and tool calling.

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [10]:
from trl import SFTConfig, SFTTrainer
from transformers import DataCollatorForSeq2Seq
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 16,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        logging_steps = 20,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

In [11]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 4090. Max memory = 23.988 GB.
12.811 GB of memory reserved.


In [12]:
trainer_stats = trainer.train()

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
The tokenizer has new special tokens that are also defined in the model configs. The model configs were aligned accordingly. Updated tokens: {'bos_token_id': 199998, 'pad_token_id': 200017}


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 16
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 16 x 1) = 16
 "-____-"     Trainable parameters = 3,981,312 of 20,918,738,496 (0.02% trained)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
20,1.6179
40,1.141
60,1.0685


In [13]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

2079.415 seconds used for training.
34.66 minutes used for training.
Peak reserved memory = 23.051 GB.
Peak reserved memory for training = 10.24 GB.
Peak reserved memory % of max memory = 96.094 %.
Peak reserved memory for training % of max memory = 42.688 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [14]:
messages = [
    {"role": "system", "content": "reasoning language: French\n\nYou are a helpful assistant that can solve mathematical problems."},
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "medium",
).to(model.device)
from transformers import TextStreamer
_ = model.generate(**inputs, max_new_tokens = 2048, streamer = TextStreamer(tokenizer))

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-13

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions

reasoning language: French

You are a helpful assistant that can solve mathematical problems.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>analysis<|message|>Très bien, commençons par examiner l'équation donnée : x^5 + 3x^4 - 10 = 3. Pour résoudre cette équation, je vais d'abord essayer de simplifier le côté gauche en ajoutant 10 de chaque côté afin d'isoler le terme quadratique.

\[
x^5 + 3x^4 - 10 = 3
\]

En ajoutant 10 de chaque côté :

\[
x^5 + 3x^4 = 13
\]

Cette équation semble assez compliquée à résoudre directement. Je vais essayer de rechercher un facteur commun dans les