To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

This notebook uses the `Llama-3` format for conversation style finetunes. We use [Open Assistant conversations](https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style) in ShareGPT style.

In [None]:

%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[cu128-torch280] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

In [4]:
!pip install unsloth

Collecting unsloth
  Using cached unsloth-2025.8.4-py3-none-any.whl.metadata (47 kB)
Collecting unsloth_zoo>=2025.8.3 (from unsloth)
  Using cached unsloth_zoo-2025.8.3-py3-none-any.whl.metadata (9.4 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Using cached xformers-0.0.31.post1-cp39-abi3-win_amd64.whl.metadata (1.1 kB)
Collecting triton-windows (from unsloth)
  Using cached triton_windows-3.4.0.post20-cp311-cp311-win_amd64.whl.metadata (1.8 kB)
Collecting tyro (from unsloth)
  Using cached tyro-0.9.28-py3-none-any.whl.metadata (11 kB)
Collecting transformers!=4.47.0,!=4.52.0,!=4.52.1,!=4.52.2,!=4.52.3,!=4.53.0,>=4.51.3 (from unsloth)
  Using cached transformers-4.55.1-py3-none-any.whl.metadata (41 kB)
Collecting datasets<4.0.0,>=3.4.1 (from unsloth)
  Using cached datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting sentencepiece>=0.2.0 (from unsloth)
  Using cached sentencepiece-0.2.1-cp311-cp311-win_amd64.whl.metadata (10 kB)
Collecting tqdm (from unsloth)
  Using c

  You can safely remove it manually.


In [8]:
pip install unsloth_zoo

Note: you may need to restart the kernel to use updated packages.


* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)

In [1]:
import unsloth
import torch


print("CUDA available:", torch.cuda.is_available())
print("CUDA version:", torch.version.cuda)
print("Torch version:", torch.__version__)
print("Device count:", torch.cuda.device_count())
print("Device name:", torch.cuda.get_device_name(0))


Exception in thread Thread-4 (_readerthread):
Traceback (most recent call last):
  File "c:\Users\aint\anaconda3\envs\unsloth_env_5noxformer\Lib\threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "c:\Users\aint\anaconda3\envs\unsloth_env_5noxformer\Lib\site-packages\ipykernel\ipkernel.py", line 788, in run_closure
    _threading_Thread_run(self)
  File "c:\Users\aint\anaconda3\envs\unsloth_env_5noxformer\Lib\threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "c:\Users\aint\anaconda3\envs\unsloth_env_5noxformer\Lib\subprocess.py", line 1599, in _readerthread
    buffer.append(fh.read())
                  ^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in position 7: invalid start byte


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm
    PyTorch 2.3.0+cu121 with CUDA 1201 (you have 2.8.0+cu128)
    Python  3.11.9 (you have 3.11.13)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


🦥 Unsloth Zoo will now patch everything to make training faster!
CUDA available: True
CUDA version: 12.8
Torch version: 2.8.0+cu128
Device count: 1
Device name: NVIDIA GeForce RTX 5090 Laptop GPU


In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-Instruct-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

  GPU_BUFFERS = tuple([torch.empty(2*256*2048, dtype = dtype, device = f"{DEVICE_TYPE}:{i}") for i in range(n_gpus)])


==((====))==  Unsloth 2025.8.5: Fast Llama patching. Transformers: 4.55.1.
   \\   /|    NVIDIA GeForce RTX 5090 Laptop GPU. Num GPUs = 1. Max memory: 23.889 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.8.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the `Llama-3` format for conversation style finetunes. We use [Open Assistant conversations](https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style) in ShareGPT style. Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old` and our own optimized `unsloth` template.

Note ShareGPT uses `{"from": "human", "value" : "Hi"}` and not `{"role": "user", "content" : "Hi"}`, so we use `mapping` to map it.

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)

def formatting_prompts_func(examples):
    questions = examples["question"]
    contexts = examples["context"]
    answers = examples["answer"]

    texts = []
    for question, context, answer in zip(questions, contexts, answers):
        prompt = f"### Question:\n{question}\n\n### Context:\n{context}\n\n### Answer:\n{answer}"
        texts.append(prompt)

    return {"text": texts}

from datasets import load_dataset
dataset = load_dataset("csv", data_files=r"C:\Users\aint\Downloads\goat_data-1.1.csv", split="train")
dataset = dataset.map(formatting_prompts_func, batched=True)

Generating train split: 2894 examples [00:00, 36836.58 examples/s]
Map: 100%|██████████| 2894/2894 [00:00<00:00, 48351.93 examples/s]


In [6]:
print(dataset[0].keys())


dict_keys(['id', 'question', 'context', 'answer', 'answer_start', 'text'])


In [7]:
print(dataset[0])


{'id': '1月1日', 'question': '美國乳羊協會登錄的主要乳羊品種有哪些？', 'context': '乳羊品種：目前經美國乳羊協會(ADGA)登錄的主要品種包括：撒能(Saanen)、阿爾拜因(Alpine)、吐根堡(Toggenburg)、努比亞(Nubian)、賴滿嬌(LaMancha)及歐伯哈斯利(Oberhasli)等六種，其他先進畜牧國家如英國、澳洲及紐西蘭等國，則以其中之前四種為主，法國則以阿爾拜因及撒能兩種為主，這些品種過去臺灣都曾自美國及紐、澳進口，但目前羊農所飼養的乳羊品種，已逐漸簡化成以阿爾拜因及撒能為主的趨勢，其中又以阿爾拜因佔70％為多數。', 'answer': '撒能(Saanen)、阿爾拜因(Alpine)、吐根堡(Toggenburg)、努比亞(Nubian)、賴滿嬌(LaMancha)及歐伯哈斯利(Oberhasli)', 'answer_start': 30, 'text': '### Question:\n美國乳羊協會登錄的主要乳羊品種有哪些？\n\n### Context:\n乳羊品種：目前經美國乳羊協會(ADGA)登錄的主要品種包括：撒能(Saanen)、阿爾拜因(Alpine)、吐根堡(Toggenburg)、努比亞(Nubian)、賴滿嬌(LaMancha)及歐伯哈斯利(Oberhasli)等六種，其他先進畜牧國家如英國、澳洲及紐西蘭等國，則以其中之前四種為主，法國則以阿爾拜因及撒能兩種為主，這些品種過去臺灣都曾自美國及紐、澳進口，但目前羊農所飼養的乳羊品種，已逐漸簡化成以阿爾拜因及撒能為主的趨勢，其中又以阿爾拜因佔70％為多數。\n\n### Answer:\n撒能(Saanen)、阿爾拜因(Alpine)、吐根堡(Toggenburg)、努比亞(Nubian)、賴滿嬌(LaMancha)及歐伯哈斯利(Oberhasli)'}


Let's see how the `Llama-3` format works by printing the 5th element

In [8]:
example = dataset[5]
print(example["id"])
print(example["question"])
print(example["context"])
print(example["answer"])
print(example["answer_start"])

2月4日
阿爾拜因有哪些優點？
阿爾拜因(Alpine)：阿爾拜因根據育種地域不同又可分為英國、法國、瑞士、洛克及澳洲等5個品系。英國阿爾拜因毛色為單一黑色，但在臉部兩側有白色條紋，尾根兩側的坐骨端及四肢膝蓋以下為白色。瑞士阿爾拜因現今正式以歐伯哈斯利(Oberhasli)品種登錄。洛克阿爾拜因由洛克夫人育成，現存頭數很少。目前無論在美國或臺灣地區，都以法國阿爾拜因(French Alpine)最為普遍。近年來，在澳洲育成及登錄一種全身為黑色的新品種，原登錄為Melan，現已更改為黑色阿爾拜因(Black Al-pine)，臺灣已有部分羊農引進。法國阿爾拜因為中至大型羊種，母羊體高至少76公分以上，體重57公斤以上；公羊體高至少81公分以上，體重77公斤以上。公母羊均有鬍鬚、鼻樑平直、耳朵直立、眼球略為突出、有角或無角、機警敏捷，雖然對飼主溫馴，但對其他羊隻的攻擊性較其他乳羊品種為強。在美國的法國阿爾拜因沒有固定的毛色，由黑、白、紅棕、灰色及褐色等混雜形成不同毛色組合，但不宜出現土根堡及撒能的典型毛色。美國的阿爾拜因一般可歸納成Cou Blanc、Cou Clair、Cou Noir、Pied、Sundgau、Chamoisee、Two-tone Chamoisee及Broken Chamoisee等8種不同色型，其中5種較普遍的色型如下：Cou Blanc：頸部及肩部為白色，後軀為黑色系；Cou Clair：頸部為褐色，後軀為黑色；Cou Noir：前軀為黑色，後軀為白色；Chamoisee：全身為栗色或褐色，但在頭部、頸部、背脊線及腿部為黑色：Sundgau：全身黑色，但在臉部兩側有白線條，四肢膝蓋以下及尾根兩側為白色，如英國阿爾拜因。不過現今在法國的阿爾拜因、僅有選留單一褐色鑲黑邊(Chamoisee)的毛色。阿爾拜因最初的發源地為法國阿爾卑斯山區，因此體質強健，耐長途跋，對不同氣候環境的適應性良好，非常耐粗放飼養，仔羊的育成率也高。母羊的泌乳期長，在國際上的平均產乳量僅次於撒能，在美國一年305天的產乳量曾高達1，800公斤。由於阿爾拜因具有上述的優點，毛色又以黑及褐色系列為多數，肥育完成的乳公羊在臺灣肉羊拍賣市場每公斤的活體價格比撒能羊高，因此深受羊農的歡迎，已有漸取代能成為乳羊主流品種的趨勢。過去多年來自美國引進臺灣的阿爾拜因頭數最多，也有少數來自澳洲近

In [9]:
print(dataset[5]["text"])

### Question:
阿爾拜因有哪些優點？

### Context:
阿爾拜因(Alpine)：阿爾拜因根據育種地域不同又可分為英國、法國、瑞士、洛克及澳洲等5個品系。英國阿爾拜因毛色為單一黑色，但在臉部兩側有白色條紋，尾根兩側的坐骨端及四肢膝蓋以下為白色。瑞士阿爾拜因現今正式以歐伯哈斯利(Oberhasli)品種登錄。洛克阿爾拜因由洛克夫人育成，現存頭數很少。目前無論在美國或臺灣地區，都以法國阿爾拜因(French Alpine)最為普遍。近年來，在澳洲育成及登錄一種全身為黑色的新品種，原登錄為Melan，現已更改為黑色阿爾拜因(Black Al-pine)，臺灣已有部分羊農引進。法國阿爾拜因為中至大型羊種，母羊體高至少76公分以上，體重57公斤以上；公羊體高至少81公分以上，體重77公斤以上。公母羊均有鬍鬚、鼻樑平直、耳朵直立、眼球略為突出、有角或無角、機警敏捷，雖然對飼主溫馴，但對其他羊隻的攻擊性較其他乳羊品種為強。在美國的法國阿爾拜因沒有固定的毛色，由黑、白、紅棕、灰色及褐色等混雜形成不同毛色組合，但不宜出現土根堡及撒能的典型毛色。美國的阿爾拜因一般可歸納成Cou Blanc、Cou Clair、Cou Noir、Pied、Sundgau、Chamoisee、Two-tone Chamoisee及Broken Chamoisee等8種不同色型，其中5種較普遍的色型如下：Cou Blanc：頸部及肩部為白色，後軀為黑色系；Cou Clair：頸部為褐色，後軀為黑色；Cou Noir：前軀為黑色，後軀為白色；Chamoisee：全身為栗色或褐色，但在頭部、頸部、背脊線及腿部為黑色：Sundgau：全身黑色，但在臉部兩側有白線條，四肢膝蓋以下及尾根兩側為白色，如英國阿爾拜因。不過現今在法國的阿爾拜因、僅有選留單一褐色鑲黑邊(Chamoisee)的毛色。阿爾拜因最初的發源地為法國阿爾卑斯山區，因此體質強健，耐長途跋，對不同氣候環境的適應性良好，非常耐粗放飼養，仔羊的育成率也高。母羊的泌乳期長，在國際上的平均產乳量僅次於撒能，在美國一年305天的產乳量曾高達1，800公斤。由於阿爾拜因具有上述的優點，毛色又以黑及褐色系列為多數，肥育完成的乳公羊在臺灣肉羊拍賣市場每公斤的活體價格比撒能羊高，因此深受羊農的歡迎，已有漸取代能成為乳羊主流品種的趨勢。過去多年來自美國

If you're looking to make your own chat template, that also is possible! You must use the Jinja templating regime. We provide our own stripped down version of the `Unsloth template` which we find to be more efficient, and leverages ChatML, Zephyr and Alpaca styles.

More info on chat templates on [our wiki page!](https://github.com/unslothai/unsloth/wiki#chat-templates)

In [10]:
unsloth_template = \
    "{{ bos_token }}"\
    "{{ 'You are a helpful assistant to the user\n' }}"\
    "{% for message in messages %}"\
        "{% if message['role'] == 'user' %}"\
            "{{ '>>> User: ' + message['content'] + '\n' }}"\
        "{% elif message['role'] == 'assistant' %}"\
            "{{ '>>> Assistant: ' + message['content'] + eos_token + '\n' }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}"\
        "{{ '>>> Assistant: ' }}"\
    "{% endif %}"
unsloth_eos_token = "eos_token"

if True: # Change to True to use Unsloth template
    tokenizer = get_chat_template(
        tokenizer,
        chat_template = (unsloth_template, unsloth_eos_token,), # You must provide a template and EOS token
        mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
        map_eos_token = True, # Maps <|im_end|> to </s> instead
    )

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [13]:
import unsloth
from unsloth import is_bfloat16_supported
from trl import SFTTrainer
from transformers import TrainingArguments

import torch
import os

os.environ["TORCHINDUCTOR_DISABLE"] = "1"
os.environ["TORCH_COMPILE_DISABLE"] = "1"
torch._dynamo.config.suppress_errors = True
torch._dynamo.config.verbose = True

# 關閉 torch.compile
torch.compile = lambda model, *args, **kwargs: model
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 1,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,  # Adjusted for faster training
        max_steps = 100,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

In [16]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 5090 Laptop GPU. Max memory = 23.889 GB.
28.957 GB of memory reserved.


In [15]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,894 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.8679
2,2.8126
3,2.7686
4,2.6665
5,2.3031
6,1.9896
7,2.3211
8,2.1739
9,2.3563
10,2.2968




In [21]:
print(trainer_stats)

TrainOutput(global_step=100, training_loss=1.8610547018051147, metrics={'train_runtime': 479.1904, 'train_samples_per_second': 1.669, 'train_steps_per_second': 0.209, 'total_flos': 1.5746811168325632e+16, 'train_loss': 1.8610547018051147, 'epoch': 0.27643400138217})


In [17]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

479.1904 seconds used for training.
7.99 minutes used for training.
Peak reserved memory = 28.957 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 121.215 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Inference"></a>
### Inference
Let's run the model! Since we're using `Llama-3`, use `apply_chat_template` with `add_generation_prompt` set to `True` for inference.

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"from": "human", "value": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nContinue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe next number in the Fibonacci sequence would be: 13<|eot_id|>']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [13]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"from": "human", "value": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|eot_id|><|start_header_id|>assistant<|end_header_id|>



The next numbers in the Fibonacci sequence would be:

13, 21, 34, 55, 89, 144,...<|eot_id|>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [18]:
model.save_pretrained("lora_model/goat") # Local saving
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [22]:
if True: # Change to True to load the model again
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model/goat", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "山羊缺乏維生素A會導致視力出現什麼問題？繁體中文回答"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)

==((====))==  Unsloth 2025.8.5: Fast Llama patching. Transformers: 4.55.1.
   \\   /|    NVIDIA GeForce RTX 5090 Laptop GPU. Num GPUs = 1. Max memory: 23.889 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

山羊缺乏維生素A會導致視力出現什麼問題？繁體中文回答<|eot_id|><|start_header_id|>assistant<|end_header_id|>

山羊缺乏維生素A會導致夜盲症，夜盲症的山羊會在黑暗中無法正常活動，嚴重的會導致山羊死亡。山羊缺乏維生素A時，會造成視網膜組織的細胞分泌異常，造成視覺功能受損。山羊缺乏維生素A，視覺功能受損，會造成山羊在黑暗中無法活動，嚴重的會導致山羊死亡。山羊缺乏維生素A，視


In [10]:
if True: # Change to True to load the model again
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model/goat", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "肉羊最佳屠宰體重與日齡是多少?80字以內簡答"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)

==((====))==  Unsloth 2025.8.5: Fast Llama patching. Transformers: 4.55.1.
   \\   /|    NVIDIA GeForce RTX 5090 Laptop GPU. Num GPUs = 1. Max memory: 23.889 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>

肉羊最佳屠宰體重與日齡是多少?80字以內簡答<|eot_id|><|start_header_id|>assistant<|end_header_id|>

肉羊的最佳屠宰體重為75公斤，日齡約為6~7個月齡。當肉羊的日齡達到6~7個月齡時，其體重約為75公斤，屬於成熟的肉羊體型。當肉羊的日齡再延長至8個月齡，則其體重約為85公斤，屬於大型的肉羊體型。肉羊的日齡與體重之關係為，日齡愈長，體重�


You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoModelForPeftCausalLM
    from transformers import AutoTokenizer
    model = AutoModelForPeftCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

NameError: name 'model' is not defined

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
pip install mistral-common


In [20]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if True: model.save_pretrained_gguf("model_gguf_fp16", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model_gguf", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 27.01 out of 63.4 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 66%|██████▌   | 21/32 [00:00<00:00, 29.16it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [00:17<00:00,  1.81it/s]


Unsloth: Saving tokenizer... Done.
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['f16'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at model_gguf_fp16 into f16 GGUF format.
The output location will be c:\Users\aint\Downloads\model_gguf_fp16\unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model_gguf_fp16
INFO:hf-to-gguf:Model architecture: LlamaForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00004.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torc

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. Gemma 6 trillion tokens is 2.5x faster! [free Colab](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>