# Llama-3.2-3B-Instruct fine-tuning Project:

**Theory:**

**General:**

1) LLM fine-tuning is primarily about adapting the model's behaviour to a specific use case, and not about adding more general knowledge.
2) Use that model which already has been trained on the domain/use case that we are going to work on.


---


**SFT: Supervised Fine-Tuning:**

1) Method of fine-tuning a pre-trained language model using labeled data.

2) 3-types:


*   Full fine-tuning - retraining all the parameters of a pre-trained model on an instruction dataset.
*   LoRA - Parameter efficient fine-tuning(PEFT) technique which actually freezes the weights and introduces small adapters. These adapters are low-rank matrices at each targeted layer. Less than 1% of the actual model weights. Both memory and time efficient. Non-destructive - has not forgotten its original weights.
*   QLoRA - 33% additional reduction from LoRA



---






# Installing and loading necessary libraries

1.   Unsloth focuses on Supervised fine-tuning (SFT)
2.   TRL provides a framework for reinforcement learning techniques like PPO to align model with human preferences.



In [None]:
pip install unsloth transformers trl

Collecting unsloth
  Downloading unsloth-2025.5.6-py3-none-any.whl.metadata (46 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/46.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.8/46.8 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Collecting trl
  Downloading trl-0.17.0-py3-none-any.whl.metadata (12 kB)
Collecting unsloth_zoo>=2025.5.7 (from unsloth)
  Downloading unsloth_zoo-2025.5.7-py3-none-any.whl.metadata (8.0 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.30-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting bitsandbytes (from unsloth)
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.20-py3-none-any.whl.metadata (10 kB)
Collecting datasets>=3.4.1 (from unsloth)
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting trl
  Downloading trl-

In [None]:
import torch
from unsloth import FastLanguageModel  #support for QLoRA + 4-bit quantization for ultra-efficient fine-tuning
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth.chat_templates import get_chat_template, standardize_sharegpt

#Unsloth to load & configure the model, and SFTTrainer from TRL to fine-tune it with your dataset.


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


# Loading our model for fine-tuning

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",  #This particular model is optimized and hosted by Unsloth in Hugging Face.
    max_seq_length = 2048,  #Max input token size the model can handle
    load_in_4bit = True  #Weights of models in 4-bit precision
)

==((====))==  Unsloth 2025.5.6: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

In [None]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 3072, padding_idx=128004)
    (layers): ModuleList(
      (0): LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (k_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
          (up_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
          (down_proj): Linear4bit(in_features=8192, out_features=3072, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((3072,)

In [None]:
tokenizer

PreTrainedTokenizerFast(name_or_path='unsloth/llama-3.2-3b-instruct-unsloth-bnb-4bit', vocab_size=128000, model_max_length=131072, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<|begin_of_text|>', 'eos_token': '<|eot_id|>', 'pad_token': '<|finetune_right_pad_id|>'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	128000: AddedToken("<|begin_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128001: AddedToken("<|end_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128002: AddedToken("<|reserved_special_token_0|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128003: AddedToken("<|reserved_special_token_1|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128004: AddedToken("<|finetune_right_pad_id|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)

# Applying LoRA

In [None]:
#Parameter-Efficient Fine-Tuning: Don't change the whole big model. Just add a few small layers and train those.
#LoRA: Low-Rank Adaptation. It works this way:
  # a) We freeze the original model(no updates)
  # b) We add small adapter layers inside key parts of the model (attention and feedforward)
  # c) We train only those small layers. They learn how to tweak the model behaviour.
model = FastLanguageModel.get_peft_model(
    model, #input model
    r=16,  #how big the adapter layer should be
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]  #Tells where to put those LoRA adapters
)

Unsloth 2025.5.6 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [None]:
tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")

In [None]:
dataset = load_dataset("mlabonne/FineTome-100k", split="train")

README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

In [None]:
dataset = standardize_sharegpt(dataset)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

In [None]:
dataset

Dataset({
    features: ['conversations', 'source', 'score'],
    num_rows: 100000
})

In [None]:
dataset[0]

{'conversations': [{'content': 'Explain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. \n\nFurthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.\n\nFinally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.',
   'role': 'user'},
  {'content': 

In [None]:
dataset = dataset.map(
    lambda examples: {
        "text": [
            str(tokenizer.apply_chat_template(convo, tokenizer=False))
            for convo in examples["conversations"]
        ]
    },
    batched=True
)

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

In [None]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        output_dir="outputs"
    ),
)


tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

In [None]:
trainer.train()

In [None]:
model.save_pretrained("finetuned_model")

In [None]:
inference_model, inference_tokenizer = FastLanguageModel.from_pretrained(
    model_name = "finetuned_model",
    max_seq_length = 2048,
    load_in_4bit = True
)

==((====))==  Unsloth 2025.5.6: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [None]:
text_prompts = [
    "What are the key principles of investment?"
]

for prompt in text_prompts:
  formatted_prompt = inference_tokenizer.apply_chat_template([{
      "role" : "user",
      "content" : prompt
      }], tokenize = False)

  model_inputs = inference_tokenizer(formatted_prompt, return_tensors="pt").to("cuda")
  generated_ids = inference_model.generate(
      **model_inputs,
      max_new_tokens = 512,
      temperature = 0.7,
      do_sample = True,
      pad_token_id = inference_tokenizer.pad_token_id
  )
  response = inference_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
  print(response)


system

Cutting Knowledge Date: December 2023
Today Date: 21 May 2025

user

What are the key principles of investment?assistant

The key principles of investment can be summarized as follows:

1. **Diversification**: Spread investments across different asset classes, sectors, and geographic regions to minimize risk and maximize returns.
2. **Risk Management**: Assess and manage risk by understanding the volatility of investments, diversifying, and hedging against potential losses.
3. **Dollar-Cost Averaging**: Invest a fixed amount of money at regular intervals, regardless of market conditions, to reduce the impact of market volatility.
4. **Long-term Focus**: Invest for the long term, as short-term market fluctuations are typically less significant than long-term trends.
5. **Low-Cost Investing**: Minimize investment costs by choosing low-cost index funds, ETFs, and avoiding unnecessary fees.
6. **Tax Efficiency**: Consider tax implications of investments and aim to minimize tax liab