<a href="https://colab.research.google.com/github/bpben/ben_friend_25/blob/main/ben-friend-sft.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ben Needs a Friend Supervised Fine-Tuning (SFT) with Unsloth
This notebook demonstrates an approach to fine-tuning a Llama 3 model via SFT.

We're using [Unsloth](https://unsloth.ai/) to make this process more efficient and able to fit on a pretty small GPU (NVIDIA T4).

I adapted this from [Unsloth's tutorial materials](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb).


In [1]:
# required configuration for Colab environment
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.14.0 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth
    # needed for loading the dataset
    !pip install -U datasets

## Model Setup

Here we initialize the model as a `FastLanguageModel`.  This is Unsloth's optimized version of a language model and allows us to do faster inference and training.

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # 4bit quantization to reduce memory usage

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.5.1: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

Here we attach a Low Rank Adapter (LoRA) for parameter-efficient fine-tuning.  We target a specific set of layers in the model:

`q/k/v_proj` - Has to do with the [attention mechanism](https://jalammar.github.io/illustrated-transformer/) of the transformer architecture.

`o_proj` - Handles conversion from attention module back into the rest of the model

'gate/up/down_proj` - FFNN components (see transformer diagram)


These are some standard parameters, but can be adjusted.

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # governs the rank of the LoRA matrix
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # dropout applied to the LoRA matrix, 0 is optimized in unsloth
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
)

Unsloth 2025.5.1 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Data"></a>
### Data Prep
Here I used a dataset I previously processed that has pairs of exchanges between the major characters in Friends.  I've arranged it in the following format:

```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

We need to convert it to the fine-tuning template for Llama 3.1:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

What is 2+2?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

It's 4.<|eot_id|>
```

In [4]:
from unsloth.chat_templates import get_chat_template, standardize_sharegpt
from datasets import load_dataset, Dataset, interleave_datasets

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    # utility to apply the chat template to the conversations
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

# pulls from huggingface
# script dataset is product of another notebook
dataset = load_dataset("bpben/friends_script", split='train')
print("\nRaw")
display(dataset[0]['conversations'])

# apply template
print("\nFine-tuning formatted")
dataset = dataset.map(formatting_prompts_func, batched = True,)
display(dataset[0]['text'])

README.md:   0%|          | 0.00/428 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.01M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/13030 [00:00<?, ? examples/s]


Raw


[{'content': "There's nothing to tell! He's just some guy I work with!",
  'role': 'user'},
 {'content': "C'mon, you're going out with the guy! There's gotta be something wrong with him!",
  'role': 'assistant'},
 {'content': 'All right Joey, be nice. So does he have a hump? A hump and a hairpiece?',
  'role': 'user'},
 {'content': 'Wait, does he eat chalk?', 'role': 'assistant'}]


Fine-tuning formatted


Map:   0%|          | 0/13030 [00:00<?, ? examples/s]

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nThere's nothing to tell! He's just some guy I work with!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nC'mon, you're going out with the guy! There's gotta be something wrong with him!<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nAll right Joey, be nice. So does he have a hump? A hump and a hairpiece?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nWait, does he eat chalk?<|eot_id|>"

I want to mix into this conversation data some standard instruction tuning data.  The idea here is to avoid [catastrophic forgetting](https://en.wikipedia.org/wiki/Catastrophic_interference).  I want it to still be helpful, even if the text it's trained on isn't...quite.

In [5]:
# want to additionally interleave data from an actual instruction tuning dataset
def formatting_inst_data(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

inst_dataset = load_dataset("mlabonne/FineTome-100k", split = "train")
inst_dataset = standardize_sharegpt(inst_dataset)
inst_dataset = inst_dataset.map(formatting_prompts_func, batched = True,)
combined_dataset = interleave_datasets([dataset, inst_dataset], seed = 3407)
combined_dataset[1]['text']

README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nExplain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. \n\nFurthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.\n\nFinally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code t

Here I want to run inference on the same example from the Unsloth tutorial.  This will give us a baseline and we will see how the model changes after training.

Interestingly the tutorial uses `min_p = 0.1` and `temperature = 1.5`, as it seems like better results [have been observed](https://x.com/menhguin/status/1826132708508213629) with these parameters.

In [6]:
# testing out its attitude
messages = [
    {"role": "user", "content": "What are you up to tonight?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


["<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat are you up to tonight?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI'm just a language model, I don't have have personal plans or experiences, but I'm here to help you with any questions or tasks you have. I'll be available 24/7 to assist you. How can I help you tonight?<|eot_id|>"]

<a name="Train"></a>
### Train the model
Now let's set up the trainer ([TRL SFT](https://huggingface.co/docs/trl/sft_trainer)).  This will take a bit as it collates all the information.

We're just going to train on a small subset of these data, and you will see it already breaks the model!

In [7]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = combined_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # running this just for a while, can edit accordingly
        max_steps = 500,
        #num_train_epochs=1,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 50,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Map (num_proc=2):   0%|          | 0/26060 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [8]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map (num_proc=2):   0%|          | 0/26060 [00:00<?, ? examples/s]

To verify if the masking is actually done, we should only see "labels" attached to tokens from the assistant.

In [9]:
tokenizer.decode(trainer.train_dataset[0]["input_ids"])

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nThere's nothing to tell! He's just some guy I work with!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nC'mon, you're going out with the guy! There's gotta be something wrong with him!<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nAll right Joey, be nice. So does he have a hump? A hump and a hairpiece?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nWait, does he eat chalk?<|eot_id|>"

In [10]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[0]["labels"]])

"                                                  C'mon, you're going out with the guy! There's gotta be something wrong with him!<|eot_id|>                                Wait, does he eat chalk?<|eot_id|>"

We can see the System and Instruction prompts are successfully masked!

In [11]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 26,060 | Num Epochs = 1 | Total steps = 5
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856/3,000,000,000 (0.81% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss


<a name="Inference"></a>
### Inference after training

In [13]:
messages = [
    {"role": "user", "content": "What are you up to tonight?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

["<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat are you up to tonight?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI'm just a language model, I don't have personal experiences or emotions, so I don't have plans or activities like humans do. However, I'm always available and ready to help with any questions or topics you'd like to discuss! How about you? Do you have any fun plans for the evening?<|eot_id|>"]

In [None]:
# load and predict
from unsloth import FastLanguageModel
import torch
from unsloth.chat_templates import get_chat_template

model_id = "bpben/ben_friend_sft"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_id,
    load_in_4bit = True,
)


In [None]:
messages = [
    {"role": "user", "content": "Whoa what a night!"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

### A note on usage with Ollama
Ollama (as of this writing) does not appear to accept the safetensors format for adapters, or at least not the format that unsloth pushes to the hub.

As a result, you can either use unsloth's own way to export the entire model: `model.save_pretrained_gguf('sft_friend', tokenizer)`

Or you can follow the steps here: https://sarinsuriyakoon.medium.com/unsloth-lora-with-ollama-lightweight-solution-to-full-cycle-llm-development-edadb6d9e0f0

For the tutorial example, I used the second option as I just wanted the adapter, not the entire model.