
We'll be using the [Titanic dataset](https://www.kaggle.com/c/titanic) in a CSV to try predict with a finetuned model who survived or perished.

Why a language model instead of traditional machine learning to predict who survived or perished?

Using a finetuned language model for the Titanic dataset can seem unconventional since the dataset is a classic example of structured (tabular) data. However, there are several reasons why one might choose this approach over traditional machine learning methods. [Learn More](https://github.com/cc-xebia-webinars/language-models_03112025/blob/main/docs/using-lm-on-titantic.md)

For the following demo, the following tools will be used:

- **Hugging Face** - [huggingface.co](https://huggingface.co/) - Supplies pre-trained models and fine-tuning utilities as the core framework for adapting language models
- **Unsloth** - [unsloth.ai](https://unsloth.ai/) - Streamlines and optimizes the fine-tuning workflow—automating aspects of training to reduce complexity and speed up the process
- **Weight and Biases** - [wandb.com](https://wandb.com/) - Offers experiment tracking and visualization, allowing you to monitor metrics, compare runs, and fine-tune hyperparameters effectively
- **Ollama** - [ollama.com](https://ollama.com/) - Serves as the deployment platform, enabling you to export and run your fine-tuned model locally for inference

These tools can be run locally or within Google Colab (using the free version). For the demo, a paid version of Colab Pro+ is being used to speed the training process.

The source of this [demo](https://docs.unsloth.ai/basics/tutorial-how-to-finetune-llama-3-and-use-in-ollama) is from Unsloth with notes and modifications added.

In [None]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install unsloth
# Get latest Unsloth
!pip install --upgrade --no-deps "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

* Unsloth supports Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* Unsloth supports [16bit LoRA or 4bit QLoRA](https://github.com/cc-xebia-webinars/language-models_03112025/blob/main/docs/16bit-lora-vs-4bit-qlora.md). Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic [RoPE Scaling](https://github.com/cc-xebia-webinars/language-models_03112025/blob/main/docs/rope-scaling.md) via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), Unsloth supports downloading [4bit models](https://github.com/cc-xebia-webinars/language-models_03112025/blob/main/docs/four-bit-quantization.md) **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* [**NEW**] Unsloth makes Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)

In [None]:
from unsloth import FastLanguageModel
import torch

# Pre-trained LLM that will be finetuned
model_name = "unsloth/llama-3-8b-bnb-4bit"

# Choose any! We auto support RoPE Scaling internally! [Learn more](https://github.com/cc-xebia-webinars/language-models_03112025/blob/main/docs/rope-scaling.md)
max_seq_length = 2048

# None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
dtype = None

# Use 4bit quantization to reduce memory usage. Can be False. [Learn more](https://github.com/cc-xebia-webinars/language-models_03112025/blob/main/docs/four-bit-quantization.md)
load_in_4bit = True

# # 4bit pre quantized models we support for 4x faster downloading + no OOMs.
# fourbit_models = [
#     "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
#     "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
#     "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
#     "unsloth/llama-3-8b-Instruct-bnb-4bit",
#     "unsloth/llama-3-70b-bnb-4bit",
#     "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
#     "unsloth/Phi-3-medium-4k-instruct",
#     "unsloth/mistral-7b-bnb-4bit",
#     "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
# ] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.9: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/198 [00:00<?, ?B/s]

We now add [LoRA adapters](https://github.com/cc-xebia-webinars/language-models_03112025/blob/main/docs/lora-adapters.md) so we only need to update 1 to 10% of all parameters! [Learn more](https://github.com/cc-xebia-webinars/language-models_03112025/blob/main/docs/lora-adapters.md)

In [None]:
# Create a PEFT (Parameter Efficient Fine-Tuning) model from the base model
# using LoRA (Low-Rank Adaptation) to enable efficient fine-tuning.
model = FastLanguageModel.get_peft_model(
    # The pre-trained base model to adapt
    model,

    # 'r' specifies the rank of the low-rank matrices used in LoRA.
    # It can be any number greater than 0; common choices include 8, 16, 32, 64, or 128.
    r = 16,

    # 'target_modules' is a list of module names (typically projection layers)
    # within the model where LoRA adapters should be inserted.
    # [Learn more](https://github.com/cc-xebia-webinars/language-models_03112025/blob/main/docs/projection-modules.md)
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],

    # 'lora_alpha' is a scaling factor that controls the strength of the LoRA updates.
    lora_alpha = 16,

    # 'lora_dropout' applies dropout to the LoRA weights.
    # A value of 0 means no dropout is used, which is optimized for performance.
    lora_dropout = 0,

    # 'bias' configuration can be customized, but "none" is optimized for this setup.
    bias = "none",

    # 'use_gradient_checkpointing' is set to "unsloth" which not only enables
    # gradient checkpointing for saving memory but also optimizes VRAM usage
    # (using 30% less VRAM and supporting 2x larger batch sizes).
    # This is particularly beneficial for handling very long contexts.
    use_gradient_checkpointing = "unsloth",

    # 'random_state' sets the seed for random number generation,
    # ensuring reproducibility of the training or fine-tuning process.
    random_state = 3407,

    # 'use_rslora' is a flag for enabling Rank Stabilized LoRA,
    # which can improve stability during training; here, it is disabled.
    use_rslora = False,

    # 'loftq_config' can hold configuration for LoftQ, an additional optimization feature.
    # It is set to None, meaning LoftQ is not being used in this configuration.
    loftq_config = None,
)

Unsloth 2025.3.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep

We'll now use the [Titanic dataset](https://www.kaggle.com/c/titanic), which is a CSV / Excel file with many columns. The goal is to predict whether some passengers managed to survive or perish based on their characteristics like their age, how much was their fare etc.

Unsloth has uploaded it to their [HF repo](https://huggingface.co/datasets/unsloth/datasets/raw/main/titanic.csv), but you can upload a CSV by pressing the 📂 icon to the left and press the upload 🔼 button.

In [None]:
# Hugging Face Datasets library, an open-source Python library designed to
# simplify the process of downloading, sharing, and processing datasets for
# machine learning tasks.

from datasets import load_dataset
dataset = load_dataset(
    "csv",
    data_files = "https://huggingface.co/datasets/unsloth/datasets/raw/main/titanic.csv",
    split = "train",
)

print(dataset.column_names)

# print first row of data (not the column names)
print(dataset[0])

['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
{'PassengerId': 1, 'Survived': 0, 'Pclass': 3, 'Name': 'Braund, Mr. Owen Harris', 'Sex': 'male', 'Age': 22.0, 'SibSp': 1, 'Parch': 0, 'Ticket': 'A/5 21171', 'Fare': 7.25, 'Cabin': None, 'Embarked': 'S'}


One issue is this dataset has multiple columns. For `Ollama` and `llama.cpp` to function like a custom `ChatGPT` Chatbot, we must only have 2 columns - an `instruction` and an `output` column.

In [None]:
print(dataset.column_names)

['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']


To solve this, we shall do the following:
* Merge all columns into 1 instruction prompt.
* Remember LLMs are text predictors, so we can customize the instruction to anything we like!
* Use the `to_sharegpt` function to do this column merging process!

<img src="https://raw.githubusercontent.com/unslothai/unsloth/nightly/images/Merge.png" height="100">

To merge multiple columns into 1, use `merged_prompt`.
* Enclose all columns in curly braces `{}`.
* Optional text must be enclused in `[[]]`. For example if the column "Pclass" is empty, the merging function will not show the text and skp this. This is useful for datasets with missing values.
* You can select every column, or a few!
* Select the output or target / prediction column in `output_column_name`. For the Titanic dataset, this will be `Survived`.

For example, if we want to use the columns `Age` and `Fare`, we can do the following:



In [None]:
from unsloth import to_sharegpt

# Process the existing 'dataset' using the to_sharegpt function to transform it
# into a format suitable for sharing or further processing
dataset_simple = to_sharegpt(
    # The original dataset to be transformed
    dataset,

    # Define a merged prompt template that incorporates dataset fields.
    # The template includes two sections enclosed in double square brackets:
    #   1. "Their age is {Age}.\n" - will insert the 'Age' value from each dataset entry.
    #   2. "They paid ${Fare} for the trip.\n" - will insert the 'Fare' value.
    # These placeholders {Age} and {Fare} are automatically replaced by the corresponding
    # data from each row, creating a personalized prompt for every record.
    merged_prompt = "[[Their age is {Age}.\n]][[They paid ${Fare} for the trip.\n]]",

    # Specify the name of the column that holds the expected output.
    # In this case, "Survived" likely indicates whether the passenger survived,
    # serving as the target variable in the dataset.
    output_column_name = "Survived",
)

We shall now provide a complex example using nearly all the columns in the dataset as shown below!

We also provide a setting called `conversation_extension`. This selects a few random rows in the dataset and combines them into 1 conversation. This allows the custom finetune to now not only work on only 1 user input, but many, allowing it be to a true chatbot like ChatGPT!

In [None]:
# Import the 'to_sharegpt' function from the 'unsloth' module.
# This function likely converts a dataset into a conversation-like format suitable for ShareGPT.
from unsloth import to_sharegpt

# Process the existing dataset using the 'to_sharegpt' function.
# The function call transforms the dataset by adding a conversation prompt based on the provided template.
dataset = to_sharegpt(
    dataset,  # The original dataset to be transformed

    # 'merged_prompt' is a template string that constructs a narrative for each entry in the dataset.
    # It uses placeholders (e.g., {Embarked}, {Sex}) that will be replaced by the corresponding data values.
    merged_prompt = \
        "[[The passenger embarked from {Embarked}.]]"\
        "[[\nThey are {Sex}.]]"\
        "[[\nThey have {Parch} parents and childen.]]"\
        "[[\nThey have {SibSp} siblings and spouses.]]"\
        "[[\nTheir passenger class is {Pclass}.]]"\
        "[[\nTheir age is {Age}.]]"\
        "[[\nThey paid ${Fare} for the trip.]]",

    # 'conversation_extension' indicates that the function should randomly combine multiple conversation segments
    # into one. A value of 5 suggests that up to 5 conversation parts might be merged together, which can be
    # beneficial for creating longer, more coherent conversations.
    conversation_extension = 5,  # Randomly combines conversations into 1; good for extending long convos

    # 'output_column_name' specifies the name of the new column in the transformed dataset.
    # Here, the output is stored in a column named "Survived", which might represent a target variable or label.
    output_column_name = "Survived",
)


Let's print out how the dataset looks like now:

In [None]:
# Import the 'pprint' function from the 'pprint' module to enable pretty-printing.
from pprint import pprint

# Pretty-print the first entry of the standardized dataset.
# This allows you to visually inspect the structure and confirm that the tags have been properly standardized.
pprint(dataset[0])

{'conversations': [{'from': 'human',
                    'value': 'Their age is 22.0.\n'
                             'They paid $7.25 for the trip.\n'},
                   {'from': 'gpt', 'value': '0'},
                   {'from': 'human',
                    'value': 'Their age is 52.0.\n'
                             'They paid $79.65 for the trip.\n'},
                   {'from': 'gpt', 'value': '0'},
                   {'from': 'human',
                    'value': 'Their age is 9.0.\n'
                             'They paid $31.275 for the trip.\n'},
                   {'from': 'gpt', 'value': '0'},
                   {'from': 'human',
                    'value': 'They paid $7.8958 for the trip.\n'},
                   {'from': 'gpt', 'value': '0'},
                   {'from': 'human',
                    'value': 'Their age is 24.0.\n'
                             'They paid $13.0 for the trip.\n'},
                   {'from': 'gpt', 'value': '0'}]}


Finally use `standardize_sharegpt`! It converts all `user`, `assistant` and `system` tags to OpenAI Hugging Face style, since sometimes people use different tags like `human` for the `user` and `gpt` for the `assistant`. We require `user` and `assistant`.

In [None]:
# Import the 'standardize_sharegpt' function from the 'unsloth' module.
# This function converts all conversation tags in the dataset (such as those labeled 'human', 'gpt', or 'system')
# to the standard OpenAI Hugging Face style, i.e., ensuring tags are consistently 'user' and 'assistant'.
from unsloth import standardize_sharegpt

# Apply the standardization function to the dataset.
# This step ensures that all entries conform to the expected tag format,
# which is crucial for downstream processing or compatibility with systems that require 'user' and 'assistant' tags.
dataset = standardize_sharegpt(dataset)

# Import the 'pprint' function from the 'pprint' module to enable pretty-printing.
from pprint import pprint

# Pretty-print the first entry of the standardized dataset.
# This allows you to visually inspect the structure and confirm that the tags have been properly standardized.
pprint(dataset[0])

{'conversations': [{'content': 'Their age is 22.0.\n'
                               'They paid $7.25 for the trip.\n',
                    'role': 'user'},
                   {'content': '0', 'role': 'assistant'},
                   {'content': 'Their age is 52.0.\n'
                               'They paid $79.65 for the trip.\n',
                    'role': 'user'},
                   {'content': '0', 'role': 'assistant'},
                   {'content': 'Their age is 9.0.\n'
                               'They paid $31.275 for the trip.\n',
                    'role': 'user'},
                   {'content': '0', 'role': 'assistant'},
                   {'content': 'They paid $7.8958 for the trip.\n',
                    'role': 'user'},
                   {'content': '0', 'role': 'assistant'},
                   {'content': 'Their age is 24.0.\n'
                               'They paid $13.0 for the trip.\n',
                    'role': 'user'},
                   {'content': '0

### Customizable Chat Templates

You also need to specify a chat template. Previously, you could use the Alpaca format as shown below.

In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

The issue is the Alpaca format has 3 fields, whilst OpenAI style chatbots must only use 2 fields (instruction and response). That's why we used the `to_sharegpt` function to merge these columns into 1.

* Now, you have to use `{INPUT}` for the instruction and `{OUTPUT}` for the response.

In [None]:
# Define a chat template that reformats the data into two fields required by OpenAI style chatbots.
# The template merges the original three-field Alpaca format into one instruction and one response.
# {INPUT} is used for the instruction (passenger details) and {OUTPUT} for the response (survival prediction).
chat_template = """Below describes some details about some passengers who went on the Titanic.
Predict whether they survived or perished based on their characteristics.
Output 1 if they survived, and 0 if they died.
>>> Passenger Details:
{INPUT}
>>> Did they survive?
{OUTPUT}"""

# Import the 'apply_chat_template' function from the 'unsloth' module.
# This function applies the chat template to each dataset entry, merging the necessary fields and converting
# the conversation format from the Alpaca style (which has three fields) to the OpenAI style (which requires two fields).
from unsloth import apply_chat_template

# Process the dataset with the defined chat template.
# - 'dataset': the original dataset containing the passenger information in the Alpaca format.
# - 'tokenizer': a tokenizer needed to properly encode or decode the text during the transformation.
# - 'chat_template': the template that specifies how to merge the instruction and response fields.
# - 'default_system_message' (optional): can be provided to set a default system instruction if required.
dataset = apply_chat_template(
    dataset,              # The dataset to be transformed
    tokenizer = tokenizer, # The tokenizer for processing text data
    chat_template = chat_template,  # The template that formats the conversation into {INPUT} and {OUTPUT}
    # default_system_message = "You are a helpful assistant", << [OPTIONAL]
)

Unsloth: We automatically added an EOS token to stop endless generations.


Map:   0%|          | 0/891 [00:00<?, ? examples/s]

We also allow you to use an optional `{SYSTEM}` field. This is useful for Ollama when you want to use a custom system prompt (also like in ChatGPT).

You can also not put a `{SYSTEM}` field, and just put plain text.

```python
chat_template = """{SYSTEM}
USER: {INPUT}
ASSISTANT: {OUTPUT}"""
```

Use below if you want to use the Llama-3 prompt format. You must use the `instruct` and not the `base` model if you use this!
```python
chat_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{SYSTEM}<|eot_id|><|start_header_id|>user<|end_header_id|>

{INPUT}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{OUTPUT}<|eot_id|>"""
```

For the ChatML format:
```python
chat_template = """<|im_start|>system
{SYSTEM}<|im_end|>
<|im_start|>user
{INPUT}<|im_end|>
<|im_start|>assistant
{OUTPUT}<|im_end|>"""
```

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
# Import the SFTTrainer from Huggingface TRL, which is used for supervised fine-tuning of language models.
from trl import SFTTrainer
# Import TrainingArguments from Huggingface Transformers to configure training hyperparameters.
from transformers import TrainingArguments
# Import a utility function to check if bfloat16 is supported by the current hardware.
from unsloth import is_bfloat16_supported

# Initialize the SFTTrainer with the model, tokenizer, and training dataset.
# This trainer is designed to handle supervised fine-tuning using TRL's SFT framework.
trainer = SFTTrainer(
    model = model,                       # The pre-trained model to be fine-tuned.
    tokenizer = tokenizer,               # The tokenizer corresponding to the model.
    train_dataset = dataset,             # The training dataset, preprocessed (e.g., using to_sharegpt/apply_chat_template).
    dataset_text_field = "text",         # The field in the dataset that contains the text data for training.
    max_seq_length = max_seq_length,     # Maximum sequence length for each training example.
    dataset_num_proc = 2,                # Number of processes to use for dataset processing (speeding up preprocessing).
    packing = False,                     # Disable sequence packing. Enable for potentially 5x faster training on short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,         # Batch size per device.
        gradient_accumulation_steps = 4,           # Accumulate gradients over 4 steps to simulate a larger batch size.
        warmup_steps = 5,                          # Number of warmup steps before the learning rate ramps up.
        num_train_epochs = 1,                      # Number of training epochs. For a full run, use num_train_epochs=1;
                                                   # alternatively, use max_steps=60 for a quick 60-step run.
        # max_steps = 60,                         # Uncomment this line for a fast training run limited to 60 steps.
        # max_steps = None,                       # Uncomment this line to disable max_steps and run full training.
        learning_rate = 2e-4,                      # Learning rate for the optimizer.
        fp16 = not is_bfloat16_supported(),        # Enable FP16 precision if BF16 is not supported.
        bf16 = is_bfloat16_supported(),            # Enable BF16 precision if supported by the hardware.
        logging_steps = 1,                         # Log training metrics every step.
        optim = "adamw_8bit",                      # Use the 8-bit AdamW optimizer for improved memory efficiency.
        weight_decay = 0.01,                       # Weight decay for regularization.
        lr_scheduler_type = "linear",              # Use a linear learning rate scheduler.
        seed = 3407,                               # Seed for reproducibility.
        output_dir = "outputs",                    # Directory where the trained model and checkpoints will be saved.
    ),
)


Unsloth: We found double BOS tokens - we shall remove one automatically.


Tokenizing to ["text"] (num_proc=2):   0%|          | 0/891 [00:00<?, ? examples/s]

In [None]:
# @title Show current memory stats

# Retrieve the properties (e.g., name, total memory) of the first GPU (index 0).
gpu_stats = torch.cuda.get_device_properties(0)

# Calculate the maximum GPU memory currently reserved by PyTorch (in GB),
# converting from bytes to gigabytes and rounding to 3 decimal places.
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)

# Retrieve the total GPU memory (in GB) from the GPU properties,
# converting from bytes to gigabytes and rounding to 3 decimal places.
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

# Print the GPU name and total memory capacity in gigabytes.
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")

# Print the currently reserved GPU memory in gigabytes.
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.557 GB.
5.496 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 891 | Num Epochs = 1 | Total steps = 111
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/4,582,543,360 (0.92% trained)
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mtraining4programmers[0m ([33mtraining4programmersllc[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.7801
2,1.7396
3,1.7824
4,1.6272
5,1.4156
6,1.2025
7,1.0147
8,0.8506
9,0.7157
10,0.6019


In [None]:
# @title Show final memory and time stats

# Calculate the total peak GPU memory reserved during the entire training process.
# Convert from bytes to gigabytes and round to 3 decimal places.
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)

# Compute how much additional memory was used specifically for LoRA training
# by subtracting the initial memory usage from the final peak memory usage.
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)

# Calculate what percentage of the total GPU memory the peak usage represents.
used_percentage = round(used_memory / max_memory * 100, 3)

# Calculate what percentage of the total GPU memory was used specifically by LoRA training.
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

# Print the total training time in seconds.
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")

# Convert the total training time to minutes and print it.
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")

# Display the peak reserved GPU memory in gigabytes.
print(f"Peak reserved memory = {used_memory} GB.")

# Display how much of that peak memory was used specifically for LoRA training.
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")

# Display the peak memory usage as a percentage of the total GPU memory.
print(f"Peak reserved memory % of max memory = {used_percentage} %.")

# Display the LoRA-specific memory usage as a percentage of the total GPU memory.
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

192.0862 seconds used for training.
3.2 minutes used for training.
Peak reserved memory = 6.254 GB.
Peak reserved memory for training = 0.758 GB.
Peak reserved memory % of max memory = 15.81 %.
Peak reserved memory for training % of max memory = 1.916 %.


<a name="Inference"></a>
### Inference
Let's run the model! Unsloth makes inference natively 2x faster as well! You should use prompts which are similar to the ones you had finetuned on, otherwise you might get bad results!

In [None]:
# Enable fast inference using Unsloth's native support, which provides a 2x speedup.
FastLanguageModel.for_inference(model)  # Fast inference enabled!

# Define the prompt messages.
# IMPORTANT: Use prompts similar to the ones used during fine-tuning to get optimal results.
messages = [
    {"role": "user", "content": 'The passenger embarked from S.\n'\
                                'They are male.\n'\
                                'They have 1 siblings and spouses.\n'\
                                'Their passenger class is 3.\n'\
                                'Their age is 22.0.\n'\
                                'They paid $7.25 for the trip.'},
]

# Prepare input IDs for the model by applying the chat template.
# This process:
# - Formats the conversation history according to the expected template.
# - Adds a generation prompt for the model.
# - Converts the data into PyTorch tensors.
# The resulting tensor is moved to the GPU for efficient processing.
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

# Import TextStreamer for streaming the generated text output.
from transformers import TextStreamer

# Initialize a text streamer that skips displaying the prompt.
# This means only the newly generated text will be shown.
text_streamer = TextStreamer(tokenizer, skip_prompt=True)

# Generate text using the model.
# - The model processes the input_ids and streams the output via text_streamer.
# - max_new_tokens limits the number of tokens generated.
# - pad_token_id is set to the EOS token ID to properly terminate the output.
_ = model.generate(
    input_ids,
    streamer=text_streamer,
    max_new_tokens=128,
    pad_token_id=tokenizer.eos_token_id
)


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


0<|end_of_text|>


Let's try another example:

In [None]:
# Enable fast inference using Unsloth's native support,
# which makes inference 2x faster compared to standard methods.
FastLanguageModel.for_inference(model)  # Fast inference enabled!

# Define the prompt messages.
# Note: Use prompts similar to those from fine-tuning for optimal results.
messages = [
    {"role": "user", "content": 'Their passenger class is 1.\n'\
                                'Their age is 22.0.\n'\
                                'They paid $107.25 for the trip.'},
]

# Prepare input IDs for the model using the tokenizer's chat template.
# This process:
# - Formats the conversation according to the expected template.
# - Adds a generation prompt to indicate where the model should start generating text.
# - Converts the formatted prompt into PyTorch tensors.
# The resulting tensor is moved to the GPU for efficient computation.
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

# Import TextStreamer from Hugging Face Transformers.
# TextStreamer is used to stream the generated text output in real time.
from transformers import TextStreamer

# Create a TextStreamer instance.
# The 'skip_prompt' flag set to True means that the initial prompt text will not be reprinted in the output.
text_streamer = TextStreamer(tokenizer, skip_prompt=True)

# Generate text using the model.
# Parameters:
# - input_ids: the tokenized input from the chat template.
# - streamer: the TextStreamer instance for streaming output.
# - max_new_tokens: limits the number of tokens to generate (128 tokens in this case).
# - pad_token_id: specifies the EOS token ID for proper termination of the generated text.
_ = model.generate(input_ids, streamer=text_streamer, max_new_tokens=128, pad_token_id=tokenizer.eos_token_id)


1<|end_of_text|>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference
pass

# messages = [                    # Change below!
#     {"role": "user", "content": 'Their passenger class is 3.\n'\
#                                 'Their age is 22.0.\n'\
#                                 'They paid $107.25 for the trip.'},
# ]
# input_ids = tokenizer.apply_chat_template(
#     messages,
#     add_generation_prompt = True,
#     return_tensors = "pt",
# ).to("cuda")

# from transformers import TextStreamer
# text_streamer = TextStreamer(tokenizer, skip_prompt = True)
# _ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # WARNING: This approach uses Hugging Face's AutoModelForPeftCausalLM.
    # It is only recommended if you do not have the Unsloth package installed.
    # Unsloth provides faster inference (up to 2x faster) and supports 4bit model quantization,
    # which this method does not fully support. Therefore, this fallback can be significantly slower.

    # Import the necessary class from the peft library.
    # AutoPeftModelForCausalLM is used for loading models that have been fine-tuned with PEFT (Parameter-Efficient Fine-Tuning).
    from peft import AutoPeftModelForCausalLM

    # Import the AutoTokenizer from transformers for tokenization.
    # The tokenizer converts text to numerical tokens that the model can understand.
    from transformers import AutoTokenizer

    # Load your fine-tuned model for causal language modeling from the local directory or hub.
    # "lora_model" should be replaced with the path or identifier of the model you used for training.
    # The parameter load_in_4bit is intended for loading the model in 4-bit precision, but note that
    # this option is not fully supported here, which might lead to slower performance.
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model",  # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,  # Set to True if attempting 4-bit quantization (but this may be slow)
    )

    # Load the corresponding tokenizer for your model.
    # This tokenizer is essential for encoding input text into tokens and decoding the output tokens back into text.
    tokenizer = AutoTokenizer.from_pretrained("lora_model")


In [None]:
messages = [                    # Change below!
    {"role": "user", "content": 'Their passenger class is 3.\n'\
                                'Their age is 22.0.\n'\
                                'They paid $107.25 for the trip.'},
]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

1<|end_of_text|>


<a name="Ollama"></a>
### Ollama Support

[Unsloth](https://github.com/unslothai/unsloth) now allows you to automatically finetune and create a [Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md), and export to [Ollama](https://ollama.com/)! This makes finetuning much easier and provides a seamless workflow from `Unsloth` to `Ollama`!

Let's first install `Ollama`!

In [None]:
!curl -fsSL https://ollama.com/install.sh | sh

>>> Cleaning up old version at /usr/local/lib/ollama
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
############################################################################################# 100.0%
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


Next, we shall save the model to [GGUF](https://huggingface.co/docs/hub/gguf) / [llama.cpp](https://github.com/ggml-org/llama.cpp)

We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

We also support saving to multiple GGUF options in a list fashion! This can speed things up by 10 minutes or more if you want multiple export formats!

In [None]:
# --------------------------------------------------------------------------------
# Save the model in GGUF format using various quantization methods.
# GGUF is a format designed for efficient model inference, and it supports
# various quantization methods to optimize performance and resource usage.
#
# The quantization methods supported include:
# - q8_0: A fast conversion method with high resource use, generally acceptable.
# - q4_k_m: Recommended; uses Q6_K for half of attention.wv and feed_forward.w2 tensors,
#           otherwise Q4_K.
# - q5_k_m: Recommended; similar to q4_k_m, but uses Q5_K where applicable.
#
# Additionally, you can save to multiple GGUF options simultaneously, which can
# speed up the export process by 10 minutes or more if you require multiple formats.
# --------------------------------------------------------------------------------

# -------------------------------
# 1. Save to 8-bit Q8_0 GGUF
# -------------------------------
if True:
    # Here we save the model locally in GGUF format using the default quantization method Q8_0.
    # - "model": The local directory where the saved model will be stored.
    # - tokenizer: The tokenizer associated with the model.
    # This method, save_pretrained_gguf, converts and saves the model in the GGUF format.
    # Q8_0 is known for fast conversion although it uses high system resources.
    model.save_pretrained_gguf("model", tokenizer)


# -------------------------------------------------------------
# 2. Push the 8-bit GGUF model to the Hugging Face Hub
# -------------------------------------------------------------
if False:
    # Before pushing to the hub, ensure you have:
    # - A valid Hugging Face access token. You can obtain one from:
    #   https://huggingface.co/settings/tokens
    # - Updated the repository path "hf/model" with your actual Hugging Face username.
    #
    # The function push_to_hub_gguf uploads the model (in GGUF format) along with the tokenizer.
    # Here, it uses the default quantization method Q8_0.
    model.push_to_hub_gguf("hf/model", tokenizer, token = "")


# -------------------------------
# 3. Save to 16-bit GGUF (f16)
# -------------------------------
if False:
    # This block demonstrates saving the model in 16-bit precision (f16) format.
    # f16 quantization reduces model size and can increase inference speed, which is useful for
    # environments with limited resources, while still maintaining acceptable accuracy.
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")

if False:
    # Similarly, you can push the 16-bit (f16) GGUF model to the Hugging Face Hub.
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")


# -----------------------------------
# 4. Save to q4_k_m GGUF
# -----------------------------------
if False:
    # The q4_k_m quantization method is recommended for many use cases.
    # It applies Q6_K quantization for half of the tensors (attention.wv and feed_forward.w2),
    # while using Q4_K for the remainder, providing a balance between performance and resource use.
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")

if False:
    # You can also push the model saved with q4_k_m quantization to the Hugging Face Hub.
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")


# ---------------------------------------------------------------------------------------
# 5. Save to multiple GGUF options simultaneously for faster export of multiple formats
# ---------------------------------------------------------------------------------------
if False:
    # This block shows how to push the model to the Hugging Face Hub in multiple quantization formats
    # at the same time. Specifying a list for the quantization_method argument triggers the export of:
    # - q4_k_m
    # - q8_0
    # - q5_k_m
    #
    # This is especially useful if you need the model in several formats, as it significantly reduces
    # the overall conversion time (by up to 10 minutes or more).
    #
    # Reminder:
    # - Replace "hf/model" with your Hugging Face username and repository name.
    # - Ensure you have a valid Hugging Face token (obtainable from https://huggingface.co/settings/tokens).
    model.push_to_hub_gguf(
        "hf/model",                # Update "hf" to your actual Hugging Face username.
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m"],
        token = "",                # Insert your Hugging Face access token here.
    )


Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 61.82 out of 83.48 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 32/32 [00:00<00:00, 54.82it/s]


Unsloth: Saving tokenizer... Done.
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at model into q8_0 GGUF format.
The output location will be /content/model/unsloth.Q8_0.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00004.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> Q8_0, shape = {4096, 128256}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.

Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### We removed it in GGUF's chat template for you.


Unsloth: Conversion completed! Output location: /content/model/unsloth.Q8_0.gguf
Unsloth: Saved Ollama Modelfile to model/Modelfile


We use `subprocess` to start `Ollama` up in a non blocking fashion! In your own desktop, you can simply open up a new `terminal` and type `ollama serve`, but in Colab, we have to use this hack!

In [None]:
import subprocess
import time

# Start a new process that runs the "ollama serve" command.
# This runs the command in the background without blocking the rest of the script.
subprocess.Popen(["ollama", "serve"])

# Wait for a few seconds for Ollama to load!
time.sleep(3)

[Ollama Model File Specification](https://github.com/ollama/ollama/blob/main/docs/modelfile.md)

`Ollama` needs a `Modelfile`, which specifies the model's prompt format. Let's print Unsloth's auto generated one:

In [None]:
print(tokenizer._ollama_modelfile)

FROM {__FILE_LOCATION__}

TEMPLATE """Below describes some details about some passengers who went on the Titanic.
Predict whether they survived or perished based on their characteristics.
Output 1 if they survived, and 0 if they died.{{ if .Prompt }}
>>> Passenger Details:
{{ .Prompt }}{{ end }}
>>> Did they survive?
{{ .Response }}<|end_of_text|>"""

PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_of_text|>"
PARAMETER stop "<|reserved_special_token_"
PARAMETER temperature 1.5
PARAMETER min_p 0.1


We now will create an `Ollama` model called `unsloth_model` using the `Modelfile` which we auto generated!

In [None]:
!ollama create unsloth_model -f ./model/Modelfile

[?2026h[?25l[1Ggathering model components ⠙ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠹ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠸ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠼ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠴ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠦ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠧ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠇ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠇ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠋ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠋ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠹ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠹ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠼ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠴ [K[?25h[?2026l[?2026h[?25l[1Ggathering model compon

And now we can do inference on it via `Ollama`!

You can also upload to `Ollama` and try the `Ollama` Desktop app by heading to https://www.ollama.com/

In [None]:
!curl http://localhost:11434/api/chat -d '{ \
    "model": "unsloth_model", \
    "messages": [ \
        {"role": "user", \
         "content": "Their passenger class is 3.\nTheir age is 22.0.\nThey paid $107.25 for the trip."} \
    ] \
    }'

{"model":"unsloth_model","created_at":"2025-03-11T15:20:24.311224606Z","message":{"role":"assistant","content":"0"},"done":false}
{"model":"unsloth_model","created_at":"2025-03-11T15:20:24.337241022Z","message":{"role":"assistant","content":""},"done_reason":"stop","done":true,"total_duration":3599065520,"load_duration":3345391273,"prompt_eval_count":74,"prompt_eval_duration":221000000,"eval_count":2,"eval_duration":30000000}


And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing) for just the Alpaca dataset.

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://ollama.com/"><img src="https://raw.githubusercontent.com/unslothai/unsloth/nightly/images/ollama.png" height="44"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>