The unsloth Python library is a relatively new open-source library designed to make working with large language models (LLMs) easier, faster, and more memory-efficient, particularly for fine-tuning and inference tasks on consumer-grade hardware (like a single GPU with limited VRAM).
Key Features of unsloth:

1. **Optimized Fine-Tuning**:

    - It allows efficient fine-tuning of LLMs using techniques like QLoRA (Quantized Low-Rank Adapter) and LoRA (Low-Rank Adaptation).
    - Supports 4-bit and 8-bit quantization, which significantly reduces memory usage without severely impacting performance.

1. **Speed & Efficiency**:

    - The library claims to be 2x faster than Hugging Face Transformers for fine-tuning models like LLaMA, Mistral, etc.
    - It is optimized for single-GPU setups, even with as little as 16 GB of VRAM.

1. **Easy-to-Use API**:

    - Offers a familiar API similar to Hugging Face Transformers, making it easy to switch or integrate.
    - Built-in training, evaluation, and model loading utilities.

1. **Compatibility**:

    - Integrates well with Hugging Face’s ecosystem (e.g., 🤗 Datasets, Accelerate, Transformers).
    - Supports popular open LLMs such as LLaMA, Mistral, Gemma, and others.

In [1]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None # None for auto detection.
load_in_4bit = True # Use 4bit quantization to reduce memory usage.

# Loading the pre-trained model - LLAMA 3 8b
model, tokenizer = FastLanguageModel.from_pretrained(
    # bnb means BitsandBytes, a library for model quantization
    model_name = "unsloth/llama-3-8b-bnb-4bit", # More models at https://huggingface.co/unsloth
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.5.7: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    NVIDIA A30 MIG 4g.24gb. Num GPUs = 1. Max memory: 23.486 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

Instead of updating the original large weight matrices $W$ in layers (e.g., attention layers), LoRA freezes them and adds small trainable low-rank matrices ($A$ and $B$) as an approximation:

$$
    W_{adapted} = W + \alpha (AB)
$$

Where:
- $W$ it the original pretrained weights (frozen)
- $A$ and $B$ are the low-rank matrices (trainable, much smaller)
- $\alpha$ is a scaling factor

By using low-rank matrices (e.g., rank $r=8$), LoRA significantly reduces:
- Memory usage
- Number of trainable parameters
- Time to fine-tune

In [2]:
model = FastLanguageModel.get_peft_model(
    model,
    # The rank of the finetuning process
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    # Modules to finetune. We selected all modules (recommanded)
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    # The scaling factor for finetuning.
    # A larger number will make the finetune learn more about your dataset, but can promote over-fitting.
    # suggest equal to rank r, or double it.
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # Unsloth Gradient Checkpointing algorithm that enables fine-tuning LLMs with exceptionally long context windows
    # It uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    # Advanced feature to set automatically the 'lora_alpha' parameter
    use_rslora = False,  # We support rank stabilized LoRA
    # Advanced feature to initialize the LoRA matrices to the top r singular vectors of the weights.
    # Can improve accuracy somewhat, but can make memory usage explode at the start.
    loftq_config = None, 
)

Unsloth 2025.5.7 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [vicgalle](https://huggingface.co/datasets/vicgalle/alpaca-gpt4), which is a version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html) generated from GPT4.

In [3]:
from datasets import load_dataset

dataset = load_dataset("vicgalle/alpaca-gpt4", split="train")
print(dataset.column_names)

['instruction', 'input', 'output', 'text']


In [5]:
dataset[0]

{'instruction': 'Give three tips for staying healthy.',
 'input': '',
 'output': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.',
 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for 

One issue is this dataset has multiple columns. For `Ollama` and `llama.cpp` to function like a custom `ChatGPT` Chatbot, we must only have 2 columns - an `instruction` and an `output` column.

In [6]:
print(dataset.column_names)

['instruction', 'input', 'output', 'text']


To solve this, we shall do the following:
* Merge all columns into 1 instruction prompt.
* Remember LLMs are text predictors, so we can customize the instruction to anything we like!
* Use the `to_sharegpt` function to do this column merging process!

For example below in our [Titanic CSV finetuning notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb), we merged multiple columns in 1 prompt:

<img src="https://raw.githubusercontent.com/unslothai/unsloth/nightly/images/Merge.png" height="100">

To merge multiple columns into 1, use `merged_prompt`.
* Enclose all columns in curly braces `{}`.
* Optional text must be enclused in `[[]]`. For example if the column "Pclass" is empty, the merging function will not show the text and skp this. This is useful for datasets with missing values.
* You can select every column, or a few!
* Select the output or target / prediction column in `output_column_name`. For the Alpaca dataset, this will be `output`.

To make the finetune handle multiple turns (like in ChatGPT), we have to create a "fake" dataset with multiple turns - we use `conversation_extension` to randomnly select some conversations from the dataset, and pack them together into 1 conversation.

In [7]:
from unsloth import to_sharegpt

dataset = to_sharegpt(
    dataset,
    merged_prompt="{instruction}[[\nYour input is:\n{input}]]",
    output_column_name="output",
    conversation_extension=3,  # Select more to handle longer conversations
)

In [9]:
print(dataset)
print(dataset[0])

Dataset({
    features: ['conversations'],
    num_rows: 52002
})
{'conversations': [{'from': 'human', 'value': 'Give three tips for staying healthy.'}, {'from': 'gpt', 'value': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.'}, {'from': 'human', 'value': 'Describe what a monotheistic re

Finally use `standardize_sharegpt` to fix up the dataset!

In [10]:
from unsloth import standardize_sharegpt

dataset = standardize_sharegpt(dataset)

In [11]:
print(dataset)
print(dataset[0])

Dataset({
    features: ['conversations'],
    num_rows: 52002
})
{'conversations': [{'content': 'Give three tips for staying healthy.', 'role': 'user'}, {'content': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.', 'role': 'assistant'}, {'content': 'Describe what a monotheistic religion

### Customizable Chat Templates

You also need to specify a chat template.

Now, you have to use `{INPUT}` for the instruction and `{OUTPUT}` for the response.

We also allow you to use an optional `{SYSTEM}` field. This is useful for Ollama when you want to use a custom system prompt (also like in ChatGPT).

You can also not put a `{SYSTEM}` field, and just put plain text.

```python
chat_template = """{SYSTEM}
USER: {INPUT}
ASSISTANT: {OUTPUT}"""
```

The issue is the Alpaca format has 3 fields, whilst OpenAI style chatbots must only use 2 fields (instruction and response). That's why we used the `to_sharegpt` function to merge these columns into 1.

In [12]:
chat_template = """Below are some instructions that describe some tasks. Write responses that appropriately complete each request.

### Instruction:
{INPUT}

### Response:
{OUTPUT}"""

from unsloth import apply_chat_template

dataset = apply_chat_template(
    dataset,
    tokenizer=tokenizer,
    chat_template=chat_template,
    # default_system_message = "You are a helpful assistant", << [OPTIONAL]
)

Unsloth: We automatically added an EOS token to stop endless generations.
Map: 100%|██████████| 52002/52002 [00:03<00:00, 17280.12 examples/s]


In [13]:
print(dataset.column_names)
display(dataset[0]['conversations'])
display(dataset[0]['text'])

['conversations', 'text']


[{'content': 'Give three tips for staying healthy.', 'role': 'user'},
 {'content': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.',
  'role': 'assistant'},
 {'content': 'Describe what a monotheistic religion is.', 'role': 'user'},
 {'content': 'A monotheistic religion is a type of relig

'<|begin_of_text|>Below are some instructions that describe some tasks. Write responses that appropriately complete each request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.<|end_of_text|>\n\n### Instruction:\

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2, # Number of processes to use for processing the dataset.
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2, # Batch size per device
        # Number of updates steps to accumulate the gradients for, before performing a backward/update pass.
        gradient_accumulation_steps = 4,
        # warmup_steps defines the number of initial training steps during which the learning rate increases linearly
        # from a very small value (usually 0) to the target learning rate.
        # After this warmup period, the learning rate typically decays according to a chosen scheduler (e.g. linear, cosine).
        # Fine-tuning LLMs can be unstable, especially at the beginning. Jumping straight to a high learning rate can cause
        # problems. This helps to stabilize the training phase
        warmup_steps = 5,
        max_steps = 60,
        # num_train_epochs = 1, # For longer training runs!
        learning_rate = 2e-4,
        # use fp16-bit values instead of 32-bit
        fp16 = not is_bfloat16_supported(),
        # use bf16-bit values instead of 32-bit
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        # weight_decay is a regularization technique used during training to
        # help prevent overfitting by discouraging the optimizer from assigning large weights.
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2): 100%|██████████| 52002/52002 [00:21<00:00, 2379.17 examples/s]


In [10]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 52,002 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.4984
2,1.4678
3,1.4461
4,1.5717
5,1.6649
6,1.3171
7,1.3199
8,1.3178
9,1.3538
10,1.204


<a name="Inference"></a>
### Inference
Let's run the model! Unsloth makes inference natively 2x faster as well! You should use prompts which are similar to the ones you had finetuned on, otherwise you might get bad results!

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [                    # Change below!
    {"role": "user", "content": "Continue the fibonacci sequence! Your input is 1, 1, 2, 3, 5, 8,"},
]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

# prints to stdout tokens
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


The next number in the Fibonacci sequence is 13.<|end_of_text|>


In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [                         # Change below!
    {"role": "user",      "content": "Continue the fibonacci sequence! Your input is 1, 1, 2, 3, 5, 8"},
    {"role": "assistant", "content": "The fibonacci sequence continues as 13, 21, 34, 55 and 89."},
    {"role": "user",      "content": "What is France's tallest tower called?"},
]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

The tallest tower in France is called the Tour Eiffel, also known as the Eiffel Tower. It was built in 1889 for the World's Fair and stands at a height of 324 meters (1,063 feet).<|end_of_text|>
