# DataCamp - Fine-tune Llama 3.1 8B
> 🗣️ [Large Language Model Course](https://github.com/mlabonne/llm-course)

❤️ Created by [@maximelabonne](https://twitter.com/maximelabonne).

Add `HF_TOKEN` in the Secrets tab to store your [Hugging Face access token](https://huggingface.co/settings/tokens) in Colab.

![](https://i.imgur.com/VyPwxqa.png)
![](https://i.imgur.com/LXdQpUh.png)
![](https://i.imgur.com/urRLLyC.png)

In [None]:
!pip install -qqq "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" --progress-bar off
from torch import __version__; from packaging.version import Version as V
xformers = "xformers==0.0.27" if V(__version__) < V("2.4.0") else "xformers"
!pip install -qqq --no-deps {xformers} trl peft accelerate bitsandbytes triton --progress-bar off

import torch
from trl import SFTTrainer
from datasets import load_dataset
from transformers import TrainingArguments, TextStreamer
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel, is_bfloat16_supported

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone



Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth.chat_templates import get_chat_template


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


## 1. Load model for PEFT

![](https://i.imgur.com/2CgewGd.png)
![](https://i.imgur.com/Y8qsNvf.png)

We load the model using parameter-efficient techniques (PEFT) to reduce VRAM usage and speed up training.

In [None]:
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length=max_seq_length,
    load_in_4bit=True,
    dtype=None,
)

==((====))==  Unsloth 2025.9.4: Fast Llama patching. Transformers: 4.56.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [None]:
model = FastLanguageModel.get_peft_model(
    model=model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj","k_proj", "v_proj", "down_proj", "o_proj", "gate_proj"],
    use_rslora=True,
    use_gradient_checkpointing="unsloth",
)

Not an error, but Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2025.9.4 patched 32 layers with 32 QKV layers, 32 O layers and 0 MLP layers.


## 2. Prepare data and tokenizer

![](https://i.imgur.com/cIGv8Cb.png)
![](https://i.imgur.com/FFxWTbK.png)
![](https://i.imgur.com/a3navcZ.png)

We prepare our instruction dataset with the right chat template and tokenizer.

In [None]:
tokenizer = get_chat_template(
    tokenizer,
    chat_template="chatml",
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
)


Unsloth: Will map <|im_end|> to EOS = <|end_of_text|>.


In [None]:
dataset = load_dataset("mlabonne/FineTome-100k", split="train[:200]")

In [None]:
def apply_template(examples):
  messages = examples["conversations"]
  text = [tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False) for message in messages]
  return {"text": text}

dataset = dataset.map(apply_template, batched=True)

## 3. Training

![](https://i.imgur.com/D8sDuhK.png)
![](https://i.imgur.com/YeGVUup.png)

We specify the hyperparameters and train our model using Unsloth.

In [None]:
dataset["text"][0]

'<|im_start|>user\nExplain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. \n\nFurthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.\n\nFinally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.<|im_end|>\n<|im_start|>assistant\nBoolean operato

In [None]:
from trl import SFTConfig, SFTTrainer

# set small seq len for memory safety
max_seq_length = 1024

sft_config = SFTConfig(
    max_seq_length=max_seq_length,
    packing=False,  # debug with packing off
    dataset_text_field="text",
    # optimizer/other training args:
    learning_rate=3e-4,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,  # effective batch 16
    num_train_epochs=1,
    fp16=True,
    optim="adamw_8bit",    # or 'paged_adamw_32bit' depending on bitsandbytes
    output_dir="output",
    seed=0,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    processing_class=tokenizer,   # as earlier advice
    args=sft_config
)

# memory helpers
model.gradient_checkpointing_enable()
# optionally: torch.cuda.empty_cache()


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 200 | Num Epochs = 1 | Total steps = 13
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 16
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 16 x 1) = 16
 "-____-"     Trainable parameters = 32,505,856 of 8,062,767,104 (0.40% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.2991
2,1.2257
3,1.0139
4,0.9643
5,0.8799
6,0.836
7,0.773
8,0.757
9,0.7334
10,0.7635


TrainOutput(global_step=13, training_loss=0.8933482995400062, metrics={'train_runtime': 395.168, 'train_samples_per_second': 0.506, 'train_steps_per_second': 0.033, 'total_flos': 5180928247726080.0, 'train_loss': 0.8933482995400062, 'epoch': 1.0})

## 4. Inference

We test the trained model with a toy example to check that there's no obvious error.

In [None]:
messages = [
    {"from": "human", "value": "Is 9.11 greater than 9.9?"}
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

In [None]:
text_streamer = TextStreamer(tokenizer)
_ = model.generate(inputs=inputs, streamer=text_streamer, max_new_tokens=128, use_cache=True)

<|im_start|>user
Is 9.11 greater than 9.9?<|im_end|>
<|im_start|>assistant
Yes, 9.11 is greater than 9.9. The number 9.11 is larger than 9.9 because it has a larger decimal part.<|im_end|>


## 5. Save trained model

We save and export the trained model in safetensors and GGUF formats.

In [None]:
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")

In [None]:
model.push_to_hub_merged("mlabonne/LogicLlama-3.1-8B", tokenizer, save_method="merged_16bit")

In [None]:
model.push_to_hub_gguf("mlabonne/LogicLlama-3.1-8B-gguf", tokenizer, "q8_0")

## 6. Next steps

![](https://i.imgur.com/dMLEDKH.png)
![](https://i.imgur.com/jaOowAJ.png)
![](https://i.imgur.com/DlTKPHj.png)
![](https://i.imgur.com/EMBelvN.png)
![](https://i.imgur.com/QyUp4tA.png)