# Customizing Your LLM to Your Needs: Fine-Tuning Llama 3.1 Using QLoRA

As a practical use case of how we can adapt an LLM to our own needs, we will simulate a situation where we have access to an LLM and want to adapt it to better respond to some private documentation, specifically, answering domain-specific questions.

As mentioned in Chapter 10 of the book, one way to customize an LLM to work with your own data is through fine-tuning, also known as post-training.  In this notebook, as a practical example, we will **fine-tune Meta’s Llama 3.1** to answer **finance-related questions**!

For more details about fine-tuning and efficient inference, we recommend reviewing Chapter 10.  If needed, for implementation details, we invite you to revisit the previous tutorials on Hugging Face, dataset handling, and model fine-tuning before on the book repository diving into this notebook.


## Install packages

We will use the [Unsloth](https://github.com/unslothai/unsloth) library, which extends Hugging Face's transformers to make finetuning faster and less resource-intensive.

Let's install Unsloth, Flash Attention (via Xformers), and other necessary packages:

In [1]:
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes
!pip install datasets==3.5.0

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-57u9llmz/unsloth_1a96380414014a2b8800385f7823d50d
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-57u9llmz/unsloth_1a96380414014a2b8800385f7823d50d
  Resolved https://github.com/unslothai/unsloth.git to commit 5177df5f784cd3e1b0aa3db8d6eb6945b49579ae
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting unsloth_zoo>=2025.4.1 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading unsloth_zoo-2025.4.1-py3-none-any.whl.metadata (8.0 kB)
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.git

Collecting xformers
  Downloading xformers-0.0.30-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.30-cp311-cp311-manylinux_2_28_x86_64.whl (31.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.5/31.5 MB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: xformers
Successfully installed xformers-0.0.30
Collecting datasets==3.5.0
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets==3.5.0)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.whl (183 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[?25

## Model: Llama 3.1 8B


For this tutorial, we will use [Llama 3.1](https://ai.meta.com/blog/meta-llama-3-1/) with 8 billion parameters, commonly referred to as Llama 3.1 8B. The choice of model size is mainly based on the free computational resources available on Google Colab for experimentation.

However, working with thei model in half precision (float16) would require about 16GB of VRAM, which matches the maximum available memory in Colab at the time of writing. If we try to fine-tune the model with this setup, it would likely result in out-of-memory  errors.

To solve this, we will use a quantized 4-bit precision version of the Llama 3.1 8B model, published by [unsloth](https://huggingface.co/unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit). Quantization reduces the model size and memory requirements significantly, allowing us to fine-tune the LLM with fewer resources.

More details about quantization can be found in Chapter 10 of the book.


In [2]:
# importing libraries
import torch
device = torch.device("cuda:0")

In [3]:
from unsloth import FastLanguageModel

MODEL_CONFIG = {
    "model_name": 'unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit',
    'max_seq_length': 2048,
}

model, tokenizer = FastLanguageModel.from_pretrained(
      model_name=MODEL_CONFIG['model_name'],
      max_seq_length=MODEL_CONFIG['max_seq_length'],
      load_in_4bit=True,
      dtype=None,
  )

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Failed to patch SmolVLMForConditionalGeneration forward function.


    PyTorch 2.7.0+cu126 with CUDA 1206 (you have 2.6.0+cu124)
    Python  3.11.12 (you have 3.11.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.4.1: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.5k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

## Data

We will use the [finance-alpaca](https://huggingface.co/datasets/gbharti/finance-alpaca) dataset, a Question-Answer (QA) set combining:

- Stanford’s Alpaca dataset
- FiQA (Financial Question Answering) dataset
- 1,300+ custom QA pairs generated via GPT-3.5

This dataset contains 68,912 examples and follows the typical Alpaca format `(instruction, input, output, text)`.

In [4]:
# Import libraries
from datasets import load_dataset
import pprint

In [5]:
# Load the "finance-alpaca" dataset from Hugging Face.
data = load_dataset("gbharti/finance-alpaca", split='train')

README.md:   0%|          | 0.00/831 [00:00<?, ?B/s]

Cleaned_date.json:   0%|          | 0.00/42.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/68912 [00:00<?, ? examples/s]

In [6]:
data

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 68912
})

In [7]:
# Display the first data sample
data[0]

{'instruction': 'For a car, what scams can be plotted with 0% financing vs rebate?',
 'input': '',
 'output': "The car deal makes money 3 ways. If you pay in one lump payment. If the payment is greater than what they paid for the car, plus their expenses, they make a profit. They loan you the money. You make payments over months or years, if the total amount you pay is greater than what they paid for the car, plus their expenses, plus their finance expenses they make money. Of course the money takes years to come in, or they sell your loan to another business to get the money faster but in a smaller amount. You trade in a car and they sell it at a profit. Of course that new transaction could be a lump sum or a loan on the used car... They or course make money if you bring the car back for maintenance, or you buy lots of expensive dealer options. Some dealers wave two deals in front of you: get a 0% interest loan. These tend to be shorter 12 months vs 36,48,60 or even 72 months. The sho

The `input` and `text` are empty, the `instruction` field contains the question, and `output` contains the answer.

In [8]:
# the first 5 samples from the dataset
for i in range(5):
    pprint.pprint(data[i])

{'input': '',
 'instruction': 'For a car, what scams can be plotted with 0% financing vs '
                'rebate?',
 'output': 'The car deal makes money 3 ways. If you pay in one lump payment. '
           'If the payment is greater than what they paid for the car, plus '
           'their expenses, they make a profit. They loan you the money. You '
           'make payments over months or years, if the total amount you pay is '
           'greater than what they paid for the car, plus their expenses, plus '
           'their finance expenses they make money. Of course the money takes '
           'years to come in, or they sell your loan to another business to '
           'get the money faster but in a smaller amount. You trade in a car '
           'and they sell it at a profit. Of course that new transaction could '
           'be a lump sum or a loan on the used car... They or course make '
           'money if you bring the car back for maintenance, or you buy lots '
          

## Data Processing: Chat template

Since we want to create a chatbot that can answer finance questions, we must organize the input (questions) and output (answers) into a specific structure that the model was trained to recognize.

For example, a simple conversation looks like this:

```
User: "Who is the greatest macro trader of all time?"
Assistant: "George Soros"
```

This Q&A sample was answered by ChatGPT-4o. The **Instruct** version of Llama 3.1 was fine-tuned on instructions datasets with the same format as ours, and during its training, Meta used a specific way to organize conversations, which we will use in this notebook. You can read more about the prompt format at [Model Cards & Prompt formats
Llama 3.1](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/)

We will use the function `get_chat_template` to create a tokenizer that will map conversations to right format.

We will first create a conversation, which is simply a list of dictionaries. Each dictionary will have two keys:

1. `role`: indicates who is speaking — user, assistant, system, etc.)
2. `content`: the actual text of the message

Then we use the `.apply_chat_template` method to convert the conversation into the right format.

In [15]:
test_conversation = [
    {"role": "system", "content": "You are an assistant!"},
    {"role": "user", "content": "Who is the greatest macro trader of all time?"},
    {"role": "assistant", "content": "George Soros"}
]

In [16]:
formatted_test_conversation = tokenizer.apply_chat_template(
    conversation=test_conversation,
    tokenize=False,
    add_generation_prompt=False)

In [17]:
print(formatted_test_conversation)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are an assistant!<|eot_id|><|start_header_id|>user<|end_header_id|>

Who is the greatest macro trader of all time?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

George Soros<|eot_id|>


During inference, the user provides a question, and the LLM needs to generate the answer. To signal that it’s the model’s turn to respond, we set `add_generation_prompt=True`.

In [18]:
generation_conversation = [
    {"role": "system", "content": "You are an assistant!"},
    {"role": "user", "content": "Who is the greatest macro trader of all time?"},
]

In [19]:
formatted_gen_conversation = tokenizer.apply_chat_template(
    conversation=generation_conversation,
    tokenize=False,
    add_generation_prompt=True)

In [20]:
print(formatted_gen_conversation)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are an assistant!<|eot_id|><|start_header_id|>user<|end_header_id|>

Who is the greatest macro trader of all time?<|eot_id|><|start_header_id|>assistant<|end_header_id|>




### Processing the dataset

Next, we will format the entire dataset using the `.apply_chat_template` method.
To do this, we define a helper function called `create_text_prompt`, which formats each example, and then apply it across the dataset using the `.map` method.

In [21]:
from typing import Dict, List

SYSTEMT_PROMPT = "You are a helpful AI to answer questions about Finance"

def create_text_prompt(example: Dict[str, str]) -> str:
    conversation = [
        {"role": "system", "content": SYSTEMT_PROMPT},
        {'role': 'user', "content": example['instruction']},
        {'role': 'assistant', "content": example['output']},
    ]

    text = tokenizer.apply_chat_template(conversation=conversation, tokenize=False, add_generation_prompt = False)
    example["text"] = text
    return example



In [22]:
data = data.filter(lambda x: len(x['instruction'].strip()) > 0 and len(x['output'].strip()) > 0)

Filter:   0%|          | 0/68912 [00:00<?, ? examples/s]

In [23]:
len(data)

68911

In [24]:
processed_data = data.map(create_text_prompt, batched=False)

Map:   0%|          | 0/68911 [00:00<?, ? examples/s]

In [25]:
for i in range(5):
    item = processed_data[i]
    print(item['text'], '\n\n\n')

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful AI to answer questions about Finance<|eot_id|><|start_header_id|>user<|end_header_id|>

For a car, what scams can be plotted with 0% financing vs rebate?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The car deal makes money 3 ways. If you pay in one lump payment. If the payment is greater than what they paid for the car, plus their expenses, they make a profit. They loan you the money. You make payments over months or years, if the total amount you pay is greater than what they paid for the car, plus their expenses, plus their finance expenses they make money. Of course the money takes years to come in, or they sell your loan to another business to get the money faster but in a smaller amount. You trade in a car and they sell it at a profit. Of course that new transaction could be a lump sum or a loan on the used car... They or course 

## Parameter Efficient FineTuning (PEFT)

Our model has **8 billion parameters**, and fine-tuning all of them would be extremely slow, memory-intensive, and it comes with risk of catastrophic forgetting.

To solve this, we use **LoRA** (Low-Rank Adaptation), a technique that allows us to fine-tune only a small number of additional parameters on top of the original model parameters.

LoRA works by injecting small trainable "adapters" into certain parts of the model (such as the attention layers). This makes fine-tuning much faster, requires far less memory, and lets us achieve high-quality results without needing to update billions of parameters. To learn more about Parameter Efficient Fine-Tuning (PEFT), refer to Chapter 10 of the book.

In the next step, we will set up LoRA for our Llama 3.1 model.

In [26]:
peft_model = FastLanguageModel.get_peft_model(
    model,
    # Rank of the LoRA matrices — controls the number of trainable parameters (suggested: 8, 16, 32, 64, or 128)
    r = 16,
    # Layers where LoRA adapters will be injected (mostly attention and MLP layers)
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention projections
        "gate_proj", "up_proj", "down_proj",     # Feed-forward (MLP) projections
    ],
    # Scaling factor for the LoRA updates — balances the impact of the new weights
    lora_alpha = 16,
    # Dropout rate for LoRA layers — 0 means no dropout (good for small/medium datasets)
    lora_dropout = 0,
    # Whether to fine-tune bias terms — "none" skips bias for faster, more memory-efficient training
    bias = "none",
    # Enables memory-saving checkpointing ("unsloth" version is highly optimized)
    use_gradient_checkpointing = "unsloth",
    # Random seed to make training reproducible
    random_state = 3407,
    # Whether to use Rank-Stabilized LoRA (RSLoRA) — standard LoRA is used here
    use_rslora = False,
    # Configuration for LoftQ (LoRA + Quantization) — not used here
    loftq_config = None,
)


Unsloth 2025.4.1 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Training

Now that our model is prepared, we are ready to start the fine-tuning process!  

The training procedure is very similar to standard supervised fine-tuning of any Transformer model.

We will first define our training hyperparameters using `TrainingArguments`, and then use the `SFTTrainer` from the `trl` library to handle the training loop for us.

In [33]:
train_test_split = processed_data.train_test_split(test_size=0.2, seed=3407)

In [38]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported


training_args = TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=5e-6,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="output",
        report_to="none",
        eval_steps=10,
        save_strategy = "steps",
        save_steps=10,
    )


trainer = SFTTrainer(
    model=peft_model,
    tokenizer=tokenizer,
    train_dataset=train_test_split['train'],
    eval_dataset=train_test_split['test'],
    dataset_text_field="text",
    max_seq_length=MODEL_CONFIG['max_seq_length'],
    dataset_num_proc=2,
    packing=False, # Can make training 5x faster for short sequences.
    args=training_args

    )

In [39]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 55,128 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)


Step,Training Loss
1,0.7665
2,0.6981
3,0.6193
4,0.7544
5,0.4665
6,0.6053
7,0.4028
8,0.6485
9,0.6389
10,0.4044


KeyboardInterrupt: 

## Text Generation

Now that our model has been trained, it's time to generate some text!
To generate a response, we need to:

1. Apply the chat template to the user's query.
2. Tokenize into their token ids.
3. Generate a response based on the input.
4. Detokenize the generated response back into text.

In [40]:
def create_conversation(content: str) -> List[Dict[str, str]]:
    return [
        {"role": "system", "content": SYSTEMT_PROMPT},
        {"role": "user", "content": content},
    ]

def create_completion_inputs(content: str) -> torch.Tensor:
    conversation = create_conversation(content)
    inputs = tokenizer.apply_chat_template(
        conversation,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    )
    return inputs

In [66]:
FastLanguageModel.for_inference(peft_model) # Enable native 2x faster inference

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lor

In [67]:
input_ids = create_completion_inputs("Should I buy NVIDIA stock?").to(device)

the input_ids have shape (n_sentences, n_tokens)

In [68]:
input_ids.shape

torch.Size([1, 51])

In [69]:
input_ids

tensor([[128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
             25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
            220,   1627,  10263,    220,   2366,     19,    271,   2675,    527,
            264,  11190,  15592,    311,   4320,   4860,    922,  23261, 128009,
         128006,    882, 128007,    271,  15346,    358,   3780,  34661,   5708,
             30, 128009, 128006,  78191, 128007,    271]], device='cuda:0')

In [70]:
tokenizer.batch_decode(input_ids)

['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a helpful AI to answer questions about Finance<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nShould I buy NVIDIA stock?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n']

using the `.generate` method, which will continue generating tokens until either the end-of-sequence (EOS) token is reached or the maximum number of new tokens (max_new_tokens) is generated.

In [71]:
output_ids = peft_model.generate(input_ids=input_ids, max_new_tokens=256, use_cache=True)

The output_ids will contain both the input tokens and the generated completion tokens, so we will remove the part that belongs to the input

In [72]:
completions = output_ids[:, input_ids.shape[1]:].cpu()[0]
completions

tensor([  2675,    527,    264,  11190,  15592,    311,   4320,   4860,    922,
         23261, 128009])

Now we can decode the tokens into text using the tokenizer

In [73]:
completion_text = tokenizer.decode(completions, skip_special_tokens=True).strip()

In [74]:
completion_text

'You are a helpful AI to answer questions about Finance'

Feel free to experiment with the model and ask it more complex finance-related questions!

In [None]:
# def create_conversation(content: str) -> List[Dict[str, str]]:
#     return [
#         {"role": "system", "content": SYSTEMT_PROMPT},
#         {"role": "user", "content": content},
#     ]

# # Build batch inputs using chat template
# def create_completion_inputs(contents: List[str]):
#     conversations = [create_conversation(content) for content in contents]
#     inputs = tokenizer.apply_chat_template(
#         conversations,
#         tokenize=True,
#         add_generation_prompt=True,
#         return_tensors="pt",
#         padding=True,
#     )
#     return inputs


# QUESTIONS = [
#     "Should I buy NVIDIA stock?",
#     "What is the best time to sell all my assets?"
# ]

# input_ids = create_completion_inputs(QUESTIONS).to("cuda")

# input_ids.shape

# input_ids

# output_ids = model.generate(input_ids=input_ids, max_new_tokens=256, use_cache=True)

# completions = output_ids[:, input_ids.shape[1]:]  # Remove prompt part

# completition_texts = tokenizer.batch_decode(completions, skip_special_tokens=True)

# for i in range(len(QUESTIONS)):
#   print('question:', QUESTIONS[i])
#   print('answer:', completition_texts[i])
#   print('\n\n\n')