NOTE: This notebook follows this sample notebook (with some modifications): https://colab.research.google.com/drive/1T5-zKWM_5OD21QHwXHiV9ixTRR7k3iB9?usp=sharing#scrollTo=IqM-T1RTzY6C


In [None]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git@nightly git+https://github.com/unslothai/unsloth-zoo.git
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2

In [None]:
#ending = "_3e-4LR_10k_1b_test"  #change between different runs (if not want to continue from checkpoint!)
#ending = "_3e-4LR_R32_10k_1b_test"
#ending = "_3e-4LR_10k_3b_test"
ending = "_3e-4LR_r32_10k_1b_val"

In [None]:
from google.colab import drive
drive.mount("/content/drive") #newly added: force_remount=True

output_dir = f"/content/drive/MyDrive/unsloth_runs{ending}/llama32_3b_lora{ending}"

Mounted at /content/drive


In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",

    "unsloth/Llama-3.3-70B-Instruct-bnb-4bit" # NEW! Llama 3.3 70B!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct",  #"unsloth/Llama-3.2-3B-Instruct", #"unsloth/Llama-3.2-1B-Instruct", #"unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.10G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,   #16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128    #DENNA KANSKE VI FAKTISKT VILL EXPERIMENTERA MED!  #16 was for most tests!  Lets test 32 here
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32, # 16,                                                            #and 32 here!  #was 16 for most tests.
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False, #True, #False,  # We support rank stabilized LoRA                #False was for most tests!  (perhaps lets keep it that way, stability seems ok i think)
    loftq_config = None, # And LoftQ
)

Unsloth 2025.11.6 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


Data Prep
We now use the Llama-3.1 format for conversation style finetunes. We use Maxime Labonne's FineTome-100k dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format ("role", "content") instead of ("from", "value")/ Llama-3 renders multi turn conversations like below:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
We use our get_chat_template function to get the correct chat template. We support zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3 and more.

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
from sklearn.model_selection import train_test_split
#dataset = load_dataset("mlabonne/FineTome-100k", split = "train") #original (full dataset)
dataset = load_dataset("mlabonne/FineTome-100k", split="train[:15000]")


README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

We now use standardize_sharegpt to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:

{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
to

{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}

In [None]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

Unsloth: Standardizing formats (num_proc=8):   0%|          | 0/15000 [00:00<?, ? examples/s]

Map:   0%|          | 0/15000 [00:00<?, ? examples/s]

We look at how the conversations are structured for item 5:

In [None]:
dataset[5]["conversations"]

[{'content': 'How do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?',
  'role': 'user'},
 {'content': 'Astronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.',
  'role': 'assistant'}]

And we see how the chat template transformed these conversations.

In [None]:
dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|

Train the model
Now let's use Huggingface TRL's SFTTrainer! More docs here: TRL SFT docs. We do 60 steps to speed things up, but you can set num_train_epochs=1 for a full run, and turn off max_steps=None. We also support TRL's DPOTrainer!

In [None]:
split_set = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = split_set["train"]
val_and_eval = split_set["test"]

split_set_2 = val_and_eval.train_test_split(test_size=0.5, seed=42)
val_dataset = split_set_2["train"]
test_dataset = split_set_2["test"]
print(len(train_dataset))
print(len(val_dataset))
print(len(test_dataset))

12000
1500
1500


In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

"""trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = output_dir,
        report_to = "none", # Use this for WandB etc
        save_steps = 20,
        save_total_limit = 3,
    ),
)"""


trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset= val_dataset, #test_dataset,  #val_dataset,     #swap for test set when doing final eval after fine-tuning!
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 100,
        num_train_epochs = 1, # Set this for 1 full training run.
        #max_steps = 60,
        #learning_rate = 2e-4, #Base
        #learning_rate = 1e-4,  #second attempt
        learning_rate = 3e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = output_dir,
        report_to = "none", # Use this for WandB etc
        save_strategy="steps",   #added now, not sure if necessary
        save_steps = 250,       #225 would have been better perhaps.
        save_total_limit = 3,

        #evaluation additions:
        eval_strategy = "steps",
        eval_steps = 250,     #225 would have been better perhaps.
        load_best_model_at_end = True,
        metric_for_best_model = "eval_loss",
        greater_is_better = False,
        #remove_unused_columns=False,
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=12):   0%|          | 0/12000 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=12):   0%|          | 0/1500 [00:00<?, ? examples/s]

We also use Unsloth's train_on_completions method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map (num_proc=12):   0%|          | 0/12000 [00:00<?, ? examples/s]

Map (num_proc=12):   0%|          | 0/1500 [00:00<?, ? examples/s]

We verify masking is actually done:

In [None]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

"<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nLet x be 40 percent greater than 88, and let y be 25 percent less than x. If z is the harmonic mean of x and y, find the value of z.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nFirst, let's find the value of x, which is 40 percent greater than 88.\n\nx = 88 + (40/100) * 88\nx = 88 + 0.4 * 88\nx = 88 + 35.2\nx = 123.2\n\nNow, let's find the value of y, which is 25 percent less than x.\n\ny = x - (25/100) * x\ny = 123.2 - 0.25 * 123.2\ny = 123.2 - 30.8\ny = 92.4\n\nThe harmonic mean (HM) of two numbers x and y is given by the formula:\n\nHM = 2 / ((1/x) + (1/y))\n\nLet's calculate the harmonic mean of x and y:\n\nHM = 2 / ((1/123.2) + (1/92.4))\nHM = 2 / ((0.00811359) + (0.01082251))\nHM = 2 / (0.0189361)\nHM = 2 / 0.0189361\nHM â‰ˆ 105.6\n\nTherefore, the value of z, which i

In [None]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

"                                                                           First, let's find the value of x, which is 40 percent greater than 88.\n\nx = 88 + (40/100) * 88\nx = 88 + 0.4 * 88\nx = 88 + 35.2\nx = 123.2\n\nNow, let's find the value of y, which is 25 percent less than x.\n\ny = x - (25/100) * x\ny = 123.2 - 0.25 * 123.2\ny = 123.2 - 30.8\ny = 92.4\n\nThe harmonic mean (HM) of two numbers x and y is given by the formula:\n\nHM = 2 / ((1/x) + (1/y))\n\nLet's calculate the harmonic mean of x and y:\n\nHM = 2 / ((1/123.2) + (1/92.4))\nHM = 2 / ((0.00811359) + (0.01082251))\nHM = 2 / (0.0189361)\nHM = 2 / 0.0189361\nHM â‰ˆ 105.6\n\nTherefore, the value of z, which is the harmonic mean of x and y, is approximately 105.6.<|eot_id|>"

We can see the System and Instruction prompts are successfully masked!

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
1.203 GB of memory reserved.


In [None]:
next(model.parameters()).device

device(type='cuda', index=0)

In [None]:
import os
from transformers.trainer_utils import get_last_checkpoint

trainer_stats = None
last_checkpoint = None
if os.path.isdir(output_dir):
  try:
    last_checkpoint = get_last_checkpoint(output_dir)
  except Exception:
    last_checkpoint = None

if last_checkpoint is not None:
  print(f"Continuing training from checkpoint: {last_checkpoint}")
  trainer_stats = trainer.train(resume_from_checkpoint=last_checkpoint)
else: #If not have checkpoint, just train as usual.
  print("Training from scratch!")
  trainer_stats = trainer.train()

Training from scratch!


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 12,000 | Num Epochs = 1 | Total steps = 1,500
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 22,544,384 of 1,258,358,784 (1.79% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
250,0.9636,0.877421
500,0.8926,0.859224
750,0.8833,0.844791
1000,0.8301,0.83448
1250,0.8257,0.825617
1500,0.8177,0.821687


Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


In [None]:
#Same eval set as in during training (so we swap to holdout testset for final eval!)
eval_results = trainer.evaluate()
print(eval_results)

{'eval_loss': 0.8216844201087952, 'eval_runtime': 257.3966, 'eval_samples_per_second': 5.828, 'eval_steps_per_second': 2.914, 'epoch': 1.0}


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

5259.5713 seconds used for training.
87.66 minutes used for training.
Peak reserved memory = 3.018 GB.
Peak reserved memory for training = 1.815 GB.
Peak reserved memory % of max memory = 20.474 %.
Peak reserved memory for training % of max memory = 12.313 %.


Inference
Let's run the model! You can change the instruction and input - leave the output blank!

[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct here

We use min_p = 0.1 and temperature = 1.5. Read this Tweet for more information on why. (https://x.com/menhguin/status/1826132708508213629)

You can also use a TextStreamer for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


The Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding numbers. The sequence is: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, and so on.

The pattern of this sequence is such that each number is the sum of the two preceding numbers:
1 (sum of 0 and 1) + 1 = 2
1 + 1 (sum of 1 and 2) = 2
2 + 3 (sum of


In [None]:
model.save_pretrained(f"lora_model{ending}") # Local saving
tokenizer.save_pretrained(f"lora_model{ending}")
model.push_to_hub(f"ViktorMardskog/NON_GGUF_lora_model{ending}", token = "") # Online saving
tokenizer.push_to_hub(f"ViktorMardskog/NON_GGUF_lora_model{ending}", token = "") # Online saving

FastLanguageModel.for_inference(model)
model.save_pretrained_gguf(f"lora_model{ending}", tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf(f"ViktorMardskog/lora_model{ending}", tokenizer, quantization_method = "q4_k_m", token = "")

