# Fine-tuning LLMs, Methods of Finetuning, Model Compression (and more!)

### Author - Harshwardhan Fartale
Github - https://github.com/emharsha1812 \
Linkedin - https://www.linkedin.com/in/emharsha1812/ \
Website -  https://emharsha1812.github.io/

Model Used - SmolLM3
For more details about the model visit - [SmolLM3](https://huggingface.co/blog/smollm3)

*There are three common steps that lead to creating a high-quality LLM:*

---

### 1. Language modeling

The first step in creating a high-quality LLM is to pretrain it on one or more  
massive text datasets. During training, it attempts to predict the  
next token to accurately learn linguistic and semantic representations found in  
the text. This is called language modeling and is a self-supervised method

This produces a base model, also commonly referred to as a pretrained or foundation model. Base models are a key artifact of the training process but are harder for the end user to deal with. This is why the next step is important.

---

### 2. Fine-tuning-1 (Supervised Fine-tuning)

LLMs are more useful if they respond well to instructions and try to follow  
them. When humans ask the model to write an article, they expect the model to  
generate the article and not list other instructions for example (which is what a base model might do).  
With supervised fine-tuning (SFT), we can adapt the base model to follow  
instructions. During this fine-tuning process, the parameters of the base model  
are updated to be more in line with our target task, like following instructions.  
Like a pretrained model, it is trained using next-token prediction but instead of only predicting the next token, it does so based on a user input

---

### 3. Fine-tuning-2 (Preference tuning)

The final step further improves the quality of the model and makes it more  
aligned with the expected behavior of AI safety or human preferences. This  
is called preference tuning. Preference tuning is a form of fine-tuning and, as  
the name implies, aligns the output of the model to our preferences, which are  
defined by the data that we give it. Like SFT, it can improve upon the original  
model but has the added benefit of distilling preference of output in its training process.*

In [1]:
## All the Imports
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

device= "cuda" if torch.cuda.is_available() else "cpu"

print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB")



GPU memory: 15.8GB


## Base Models vs Instruct Models

A base model is trained on raw text data to predict the next token, while an instruct model is fine-tuned specifically to follow instructions and engage in conversations.  
For example, SmolLM3 is a base model, while SmolLM3-Instruct is its instruction-tuned variant.

> An instructed model (sometimes called an instruction-tuned model) is an LLM that has been fine-tuned or aligned to follow human instructions more accurately and safely

In [9]:
# Load both base and instruct models for comparison
base_model_name = "HuggingFaceTB/SmolLM3-3B-Base"
instruct_model_name = "HuggingFaceTB/SmolLM3-3B"

# Load tokenizers
base_tokenizer = AutoTokenizer.from_pretrained(base_model_name)
instruct_tokenizer = AutoTokenizer.from_pretrained(instruct_model_name)

# Load models (use smaller precision for memory efficiency)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name, torch_dtype=torch.float32, device_map="auto"
)

instruct_model = AutoModelForCausalLM.from_pretrained(
    instruct_model_name, torch_dtype=torch.float32, device_map="auto"
)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/151 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/289 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/943 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

: 

In [None]:
## Comparing Base vs Instruct

# Test the same prompt on both models
test_prompt = "Explain quantum computing in simple terms."

# Prepare the prompt for base model (no chat template)
base_inputs = base_tokenizer(test_prompt, return_tensors="pt").to(device)

# Prepare the prompt for instruct model (with chat template)
instruct_messages = [
    {"role": "system", "content": "/no_think"},
    {"role": "user", "content": test_prompt}
]
instruct_formatted = instruct_tokenizer.apply_chat_template(
    instruct_messages, tokenize=False, add_generation_prompt=True
)
instruct_inputs = instruct_tokenizer(instruct_formatted, return_tensors="pt").to(device)

# Generate responses
print("=== Model comparison ===\n")

print("BASE MODEL RESPONSE:")
with torch.no_grad():
    base_outputs = base_model.generate(
        **base_inputs,
        max_new_tokens=150,
        temperature=0.7,
        do_sample=True,
        pad_token_id=base_tokenizer.eos_token_id,
    )
    base_response = base_tokenizer.decode(base_outputs[0], skip_special_tokens=True)
    print(base_response[len(test_prompt) :])  # Show only the generated part

print("\n" + "=" * 50)
print("Instruct model response:")
with torch.no_grad():
    instruct_outputs = instruct_model.generate(
        **instruct_inputs,
        max_new_tokens=150,
        temperature=0.7,
        do_sample=True,
        pad_token_id=instruct_tokenizer.eos_token_id,
    )
    instruct_response = instruct_tokenizer.decode(
        instruct_outputs[0], skip_special_tokens=True
    )
    # Extract only the assistant's response
    assistant_start = instruct_response.find("<|im_start|>assistant\n") + len(
        "<|im_start|>assistant\n"
    )
    assistant_response = instruct_response[assistant_start:]
    print(assistant_response)


=== Model comparison ===

BASE MODEL RESPONSE:
 I'm a layman.
— The Question
Quantum computing is a form of computing that takes advantage of the properties of quantum mechanics to explore large numbers of possibilities in parallel.
To understand this, we have to start with classical computing. We have 1s and 0s, and these represent information, like a switch that's on and off. A computer can read these values and do something with them, like add them. One of the things you can add is a "memory," or storage. The computer can store 1s and 0s in memory. And the computer can read the value in memory. So, if the computer remembers the number 5, it can say "read memory and get the value 5

Instruct model response:
nowledge Cutoff Date: June 2025
Today Date: 10 September 2025
Reasoning Mode: /no_think

## Custom Instructions

You are a helpful AI assistant named SmolLM, trained by Hugging Face.

user
Explain quantum computing in simple terms.
assistant
<think>

</think>
Imagine a traditional

In [None]:
# Test SmolLM3's reasoning capabilities
reasoning_prompts = [
    "What is 15 × 24? Show your work.",
]

thinking_prompts = [
    "/no_think",
    "/think"
]

print("=== TESTING REASONING CAPABILITIES ===\n")

for thinking_prompt in thinking_prompts:
    print(f"Thinking prompt: {thinking_prompt}")
    for i, prompt in enumerate(reasoning_prompts, 1):
        print(f"Problem {i}: {prompt}")

        messages = [
            {"role":"system", "content": thinking_prompt},
            {"role": "user", "content": prompt}
        ]
        formatted_prompt = instruct_tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        inputs = instruct_tokenizer(formatted_prompt, return_tensors="pt").to(device)

        with torch.no_grad():
            outputs = instruct_model.generate(
                **inputs,
                max_new_tokens=400,
                temperature=0.3,  # Lower temperature for more consistent reasoning
                do_sample=True,
                pad_token_id=instruct_tokenizer.eos_token_id,
            )
            response = instruct_tokenizer.decode(outputs[0], skip_special_tokens=True)
            assistant_start = response.find("<|im_start|>assistant\n") + len(
                "<|im_start|>assistant\n"
            )
            assistant_response = response[assistant_start:].split("<|im_end|>")[0]
            print(f"Answer: {assistant_response}")

        print(f"Thinking Prompt:{thinking_prompt} DONE")


=== TESTING REASONING CAPABILITIES ===

Thinking prompt: /no_think
Problem 1: What is 15 × 24? Show your work.
Answer: nowledge Cutoff Date: June 2025
Today Date: 10 September 2025
Reasoning Mode: /no_think

## Custom Instructions

You are a helpful AI assistant named SmolLM, trained by Hugging Face.

user
What is 15 × 24? Show your work.
assistant
<think>

</think>
To solve 15 × 24, we can use the standard multiplication algorithm. Here's how to do it step by step:

1. Write down the numbers:
   ```
   15
   ×24
   ```

2. Multiply 15 by 4 (the units digit of 24):
   ```
   15
   × 4
   -------
   60
   ```

3. Multiply 15 by 20 (the tens digit of 24, which is 2, but we need to multiply by 20, so we add a zero):
   ```
   15
   ×20
   -------
   300
   ```

4. Add the two results together:
   ```
   60
   +300
   -------
   360
   ```

So, 15 × 24 = 360.

Alternatively, we can also use the distributive property of multiplication to break it down:
15 × 24 = 15 × (20 + 4) = 15 × 20 + 15

## Chat Templates and their Role

> TLDR; Chat template: A structured format for interactions between language models, users, and external tools.

### What are Chat Templates ?

Chat templates are structured formatting systems that transform conversational data into tokenizable sequences for Large Language Models

Chat templates help ensure that every message gets interpreted correctly by the model.

They provide essential structure for multi-turn conversations, system instruction handling, and consistent model behavior across different interaction patterns. They define how roles (system, user, assistant), message content, and conversation flow are encoded using special tokens and formatting conventions.

Many Chat Templates incorporate model-specific tokens for beginning-of-sequence, end-of-sequence, and conversation control. These tokens are learned during training and are critical for proper model behavior

### What do these templates look like ?

```python
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
Hi! How can I help you today?<|im_end|>
```

Typically when we are using Open Source models like by huggingface, Each message in a chat template is typically represented as an object or dictionary containing two main attributes:

**Role**: Specifies the role of the speaker, such as “user” or “assistant”.

**Content**: The actual text or content of the message.
Ex -
```python
{"role": "user", "content": "Hello, how are you?"}
{"role": "assistant", "content": "I'm doing great. How can I help you today?"}
```


### Some famous Chat Templates

1. ChatML (by OpenAI but abandoned soon after). Defacto in huggingface models

2. Harmony (OpenAI's response format for its open-weight model series (GPT-OSS))

3. Other ChatTemplates by Mistral, Llama ([Reference]())

> Note - Anthropic and other leading developers still do not disclose their chat template formats.

In [None]:
## Different types of Conversations
conversations = {
    "simple_qa": [
        {"role": "system", "content": "/no_think"},
        {"role": "user", "content": "What is machine learning?"},
    ],
    "with_system": [
        {
            "role": "system",
            "content": "You are a helpful AI assistant specialized in explaining technical concepts clearly. /no_think",
        },
        {"role": "user", "content": "What is machine learning?"},
    ],
    "multi_turn": [
        {"role": "system", "content": "You are a math tutor. /no_think"},
        {"role": "user", "content": "What is calculus?"},
        {
            "role": "assistant",
            "content": "Calculus is a branch of mathematics that deals with rates of change and accumulation of quantities.",
        },
        {"role": "user", "content": "Can you give me a simple example?"},
    ],
    "reasoning_task": [
        {"role": "system", "content": "/think"},
        {
            "role": "user",
            "content": "Solve step by step: If a train travels 120 miles in 2 hours, what is its average speed?",
        },
    ],
}

for conv_type, messages in conversations.items():
    print(f"--- {conv_type.upper()} ---")

    # Format without generation prompt (for completed conversations)
    formatted_complete = instruct_tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False
    )

    # Format with generation prompt (for inference)
    formatted_prompt = instruct_tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    print("Complete conversation format:")
    print(formatted_complete)
    print("\nWith generation prompt:")
    print(formatted_prompt)
    print("\n" + "=" * 50 + "\n")


--- SIMPLE_QA ---
Complete conversation format:
<|im_start|>system
## Metadata

Knowledge Cutoff Date: June 2025
Today Date: 10 September 2025
Reasoning Mode: /no_think

## Custom Instructions

You are a helpful AI assistant named SmolLM, trained by Hugging Face.

<|im_start|>user
What is machine learning?<|im_end|>


With generation prompt:
<|im_start|>system
## Metadata

Knowledge Cutoff Date: June 2025
Today Date: 10 September 2025
Reasoning Mode: /no_think

## Custom Instructions

You are a helpful AI assistant named SmolLM, trained by Hugging Face.

<|im_start|>user
What is machine learning?<|im_end|>
<|im_start|>assistant
<think>

</think>



--- WITH_SYSTEM ---
Complete conversation format:
<|im_start|>system
## Metadata

Knowledge Cutoff Date: June 2025
Today Date: 10 September 2025
Reasoning Mode: /no_think

## Custom Instructions

You are a helpful AI assistant specialized in explaining technical concepts clearly.

<|im_start|>user
What is machine learning?<|im_end|>


With g

## Supervised Fine-tuning

SFT is the process of continuing to train a pre-trained model on task-specific datasets with labeled examples. Think of it as specialized education:

---

Pre-training teaches the model general language understanding (like learning to read).

Supervised fine-tuning teaches specific skills and behaviors (like learning to do a specific task).

The key insight behind SFT is that we’re not teaching the model new knowledge from scratch. Instead, we’re reshaping how existing knowledge is applied. The pre-trained model already understands language, grammar, and has absorbed vast amounts of factual information. SFT focuses this general capability toward specific application patterns, response styles, and task-specific requirements.

This approach is effective because it leverages the rich representations learned during pre-training while requiring significantly less computational resources than training from scratch. The model learns to recognize instruction patterns, maintain conversation context, follow safety guidelines, and generate responses in desired formats.

---

Supervised Fine Tuning (SFT) is the phase that helps take a model that has gained vast amounts of knowledge (during pre-training) and turns it into something that can help follow instructions of a user and make it generally useful. SFT solves this by giving the model **examples of the behavior we want**. We collect a dataset of instruction–response pairs (e.g., prompts and their ideal answers), then fine-tune the pre-trained model on this dataset [10]. The result is a model that:

- Learns to follow instructions
- Produces outputs in the right format and tone
- Can serve as the foundation for preference optimization and RL

## Full - Finetuning

The most common fine-tuning process is full fine-tuning. Like pretraining an LLM, this process involves updating all parameters of a model to be in line with your target. The main difference is that we now use a smaller but labeled dataset whereas the pretraining process was done on a large dataset without any labels.  You can use any labeled data for full fine-tuning, making it also a great technique for learning domain-specific representations. To make our LLM follow instructions, we will need question-response data

## SFT-Dataset

At its core, SFT is just supervised learning where the model is thought “correct” output for a set of input queries. The learning process then asks the model to produce next-set of tokens given a prefix and then the cross entropy loss against the target tokens is used to guide it. (Exactly like how a multi-class classification problem is trained).

Therefore, the dataset is a collection of instruction-response pairs `(x, y)`, where:

- `x` is an input instruction or prompt
- `y` is the target output (human-written or high-quality model-generated)


**JSON Format**
```json
{
	"prompt":[
		{"role": "system", "content":"You are a helpful, honest assistant."},
		{"role":"user", "content":"What is the capital city of U.S."},
	],
	"completion":[
		 {"role": "assistant","content":"The capital of the United States is Washington, D.C."}
	 ],
}
```

#### How SFT Data Is Batched and Padded

Once we’ve collected instruction–response pairs for SFT, the next challenge is feeding them into the GPU efficiently. Language models work on fixed-shaped tensors: every example in a batch must be the same length. But real text varies a lot — one answer might be 12 tokens, another 240 tokens. To handle this, we pack data into batches and use padding (and a fixed max sequence length T); many large-scale training recipes also concatenate shorter examples and then split into fixed-length sequences to reduce padding.

**Batching** means grouping several examples together so they can be processed in parallel. For instance, a batch size of 16 means the model sees 16 prompts + responses at once. This improves GPU utilization and stabilizes gradients. But since sequences vary in length, we take the longest example in that batch and make every other sequence match it.

That’s where **padding** comes in. Padding tokens are special “empty” tokens (usually `PAD`) appended to shorter sequences so that all sequences in a batch share the same length. The model is told to **ignore these pads** via an **attention mask**, so they don’t affect the loss. Concretely:

- Example 1: `[The, cat, sat]` → length 3
- Example 2: `[Dogs, bark, loudly, at, night]` → length 5
- If batched together, we pad Example 1 to length 5: `[The, cat, sat, PAD, PAD]`

In training, the **attention mask** is `[1, 1, 1, 0, 0]` so the loss is only computed on real tokens. This keeps gradients correct while still making tensors rectangular.

In practice, batching and padding strategies can dramatically affect throughput.

- **Dynamic batching (bucketing):** Group examples of similar lengths together so less padding is needed.
- **Packed sequences:** Concatenate multiple shorter examples into one long sequence, separated by special tokens, to reduce wasted space.
- **Masking:** Ensure only “real” tokens contribute to gradients.

## LoRA: Low Rank Adaptation of Large Language Models

1. **Choose the dimension of the smaller matrices**:  
    Let `r` be the chosen value. Construct two matrices: `A` (dimension `n × r`) and `B` (dimension `r × m`). Their product is `W_AB`, which is of the same dimension as `W`. `r` is the LoRA rank.

2. **Create the new weight matrix**:  
    Add `W_AB` to the original weight matrix `W` to create a new weight matrix `Wʹ`. Use `Wʹ` in place of `W` as part of the model. You can use a hyperparameter `ɑ` to determine how much `W_AB` should contribute to the new matrix:  
    $$
    W' = W + \frac{\alpha}{r} W_{AB}
    $$

3. **Finetuning**:  
    During finetuning, update only the parameters in `A` and `B`. `W` is kept intact.


### Detailed Mathematical Breakdown

To understand the innovation of LoRA, we first need to look at the standard fine-tuning process. A neural network, at its core, is composed of layers, many of which perform matrix multiplication using weight matrices.

Let's imagine a single weight matrix in a pre-trained model, which we'll call $W_0$. This matrix might have thousands of rows and columns.

#### Pre-trained Weights ($W_0$)
When we fine-tune this model on a new task, we update these weights based on the new data. The process learns a "delta" or change matrix, $\Delta W$, which is added to the original weights.

The new, fine-tuned weight matrix, $W_{ft}$, is:

$$
W_{ft} = W_0 + \Delta W
$$


The crucial point here is that the change matrix $ΔW$ has the exact same dimensions as the original matrix $W_0$. If $W_0$ has 100 million parameters, then we are training a $ΔW$ that also has 100 million parameters. To save the fine-tuned model, we must save the entire $W_{ft}$ matrix, which is just as large as the original.


### A Simple Example

Imagine our pre-trained weight matrix is a simple 2×3 matrix:

$$
W_0 =
\begin{bmatrix}
0.8 & 0.1 & 0.3 \\
0.2 & 0.7 & 0.5
\end{bmatrix}
$$

After fine-tuning, we might learn the following update matrix:

$$
\Delta W =
\begin{bmatrix}
0.1 & -0.2 & 0.05 \\
-0.05 & 0.15 & 0.1
\end{bmatrix}
$$

The new, fully fine-tuned weight matrix would be:

$$
W_{ft} = W_0 + \Delta W =
\begin{bmatrix}
0.9 & -0.1 & 0.35 \\
0.15 & 0.85 & 0.6
\end{bmatrix}
$$

To make this change, we had to train and store 6 new values for $\Delta W$. For a model like GPT-3, this means training and storing 175 billion new values for every single task.

### Mathematical Equation for Full-finetuning

The paper provided the equation below to describe the standard process of full fine-tuning:

$$
\max_{\Phi} \sum_{(x, y) \in Z} \sum_{t=1}^{|y|} \log \left( P_{\Phi} \left( y_t \mid x, y_{<t} \right) \right)
$$

What it means: This formula states that we want to find the best possible set of model parameters, denoted by $\Phi$, that maximizes the probability of generating the correct output sequences ($y$) given the input sequences ($x$) and all the preceding correct tokens $y_{<t}$. We are optimizing over the entire set of parameters in the model.

Let's dissect each component:

- **$\max_{\Phi}$**: This means our goal is to maximize the following expression by changing the model's parameters, denoted by the set $\Phi$. In full fine-tuning, $\Phi$ represents every single weight and bias in the entire model. For GPT-3, this is a set of ~175 billion parameters.

- **$\sum_{(x, y) \in Z}$**: This tells us to sum the results over our entire training dataset, which is a set $Z$ of context-target pairs $(x, y)$. For example, in a summarization task, $x$ would be a long article and $y$ would be its short summary.

- **$\sum_{t=1}^{|y|}$**: This is for the autoregressive nature of language models. For each target sequence $y$, we sum over every single token from the beginning ($t=1$) to the end ($|y|$). The model tries to predict each token correctly, one by one.

- **$\log(\cdot)$**: We use the logarithm of the probability. This is a standard technique that makes the math more stable and turns a long product of probabilities into a more manageable sum (the "log-likelihood"). Maximizing the log-probability is the same as maximizing the probability itself.

- **$P_{\Phi} \left( y_t \mid x, y_{<t} \right)$**: It represents the probability assigned by the model $P$ (with its current parameters $\Phi$) to the correct next token $y_t$, given the input context $x$ and all the preceding correct tokens $y_{<t}$.

In simple terms, this equation says: Adjust all 175 billion parameters ($\Phi$) to make the model as good as possible at predicting the next correct word in the sequence for all the examples in our training data. The model starts with pre-trained weights $\Phi_0$ and learns an update $\Delta \Phi$, resulting in final weights of $\Phi_0 + \Delta \Phi$. The problem is that $\Delta \Phi$ is just as large as $\Phi_0$.


The initial pre-trained weights, $\Phi_0$, correspond to our simple matrix example $W_0$.  
The final, optimized parameters, $\Phi$, are equivalent to our fine-tuned matrix $W_{ft}$.  
The update, $\Delta \Phi$, which is what we learn during training, corresponds to $\Delta W$.  
The paper's key point that $|\Delta \Phi|$ equals $|\Phi_0|$ is exactly what we illustrated: the update matrix $\Delta W$ has the same large dimensions as the original matrix $W_0$, making it expensive to train and store.


LoRA is built on a key insight: the update matrix $ΔW$ does not need to have full rank to be effective. The authors hypothesize that the change in weights during adaptation has a low "intrinsic rank". This means that the large $ΔW$ matrix can be approximated with high fidelity by multiplying two much smaller matrices.

Instead of learning $ΔW$ directly, LoRA learns two smaller matrices, which we'll call $A$ and $B$.

$ΔW ≈ B⋅A$

This is a low-rank decomposition. The "rank" ($r$) is a small number we choose (like 1, 2, 8, or 64) that determines the inner dimension of these thin matrices.

If $W_0$ is a $d×k$ matrix, then $ΔW$ is also $d×k$. With LoRA, matrix $A$ will have dimensions $r×k$, and matrix $B$ will have dimensions $d×r$. The number of trainable parameters is now the sum of the parameters in $A$ and $B$ ($d×r + r×k$), which is dramatically smaller than the $d×k$ parameters in $ΔW$, especially when $r$ is much smaller than $d$ and $k$.

> Crucially, during training with LoRA, the original weights $W_0$ are frozen and do not receive gradient updates. We only train the much smaller $A$ and $B$ matrices.


# Mathematical Walk-through of Forward Pass with LoRA

The standard forward pass for a layer is $h = W \cdot x$, where h is the output, w is the weight matrix, and x is the input.

With LoRA, the output of the frozen pre-trained weights is computed as usual, and the output of the LoRA matrices is added to it.

The modified forward pass is:

$$h = W_0 \cdot x + B \cdot A \cdot x$$

This can be written as $h = (W_0 + B \cdot A) \cdot x$. Let's use our example matrices to walk through the calculation.

## 1. Define Inputs

• Pre-trained weights $W_0$:

$$W_0 = \begin{bmatrix} 0.8 & 0.1 & 0.3 \\ 0.2 & 0.7 & 0.5 \end{bmatrix}$$

• Trained LoRA matrices $A$ and $B$ (with r=1):

$$B = \begin{bmatrix} 0.4 \\ 0.2 \end{bmatrix}, \quad A = \begin{bmatrix} 0.25 & -0.5 & 0.1 \end{bmatrix}$$

• An input vector $x$:

$$x = \begin{bmatrix} 10 \\ 20 \\ 30 \end{bmatrix}$$

## 2. Calculate the original path

First, we compute the output from the frozen, pre-trained weights.

$$h_0 = W_0 \cdot x = \begin{bmatrix} 0.8 & 0.1 & 0.3 \\ 0.2 & 0.7 & 0.5 \end{bmatrix} \cdot \begin{bmatrix} 10 \\ 20 \\ 30 \end{bmatrix}$$

$$= \begin{bmatrix} (8 + 2 + 9) \\ (2 + 14 + 15) \end{bmatrix}$$

$$= \begin{bmatrix} 19 \\ 31 \end{bmatrix}$$

## 3. Calculate the LoRA path

Next, we compute the update from our trained LoRA matrices. It's more efficient to multiply $A \cdot x$ first.

$$A \cdot x = \begin{bmatrix} 0.25 & -0.5 & 0.1 \end{bmatrix} \cdot \begin{bmatrix} 10 \\ 20 \\ 30 \end{bmatrix}$$

$$= [(2.5 - 10 + 3)]$$

$$= [-4.5]$$

Now multiply that result by B:

$$\Delta h = B \cdot (A \cdot x)$$

$$= \begin{bmatrix} 0.4 \\ 0.2 \end{bmatrix} \cdot [-4.5]$$

$$= \begin{bmatrix} -1.8 \\ -0.9 \end{bmatrix}$$

## 4. Combine the outputs

Finally, we add the two results together to get the final output h.

$$h = h_0 + \Delta h$$

$$= \begin{bmatrix} 19 \\ 31 \end{bmatrix} + \begin{bmatrix} -1.8 \\ -0.9 \end{bmatrix}$$

$$= \begin{bmatrix} 17.2 \\ 30.1 \end{bmatrix}$$

This process—keeping $W_0$ frozen and only passing gradients through the $B \cdot A$ path—is how LoRA achieves its remarkable parameter efficiency during training.

*Mathematical Walk-through of the Backward Pass*

The "backward pass," or backpropagation, is the core mechanism by which a neural network learns. Its goal is to calculate how much each trainable parameter in the model contributed to the final error (or "loss"). Once we know this, we can adjust the parameters slightly to reduce that error.

In LoRA, the key efficiency gain comes from the fact that we only need to calculate these adjustments for the tiny $A$ and $B$ matrices. The massive pre-trained weight matrix, $W_0$, is frozen, so we can completely ignore it during the backward pass, saving immense amounts of computation and memory.

Let's walk through how the gradients for $A$ and $B$ are calculated.

1. **The Setup**

First, let's recall the forward pass equation:

$$
h = W_0 \cdot x + B \cdot A \cdot x
$$

The backward pass starts with a gradient signal coming from the next layer of the network. This signal, which we'll call $grad_h$, tells us how the final loss $(L)$ would change with respect to a small change in our output, $h$. Mathematically, this is the derivative $\frac{\partial L}{\partial h}$. Our task is to use this incoming gradient to figure out $\frac{\partial L}{\partial A}$ and $\frac{\partial L}{\partial B}$.

We will use the same matrices and input vector from our forward pass example:

$$
A = \begin{bmatrix} 0.25 & -0.5 & 0.1 \end{bmatrix}
$$

$$
B = \begin{bmatrix} 0.4 \\ 0.2 \end{bmatrix}
$$

$$
x = \begin{bmatrix} 10 \\ 20 \\ 30 \end{bmatrix}
$$

Let's assume the incoming gradient $grad_h$ is:

$$
grad_h = \frac{\partial L}{\partial h} = \begin{bmatrix} 0.5 \\ -0.2 \end{bmatrix}
$$
* 


# The Chain Rule in Action

The gradient for the LoRA path ($\Delta h = B \cdot A \cdot x$) is the same as the gradient for the total output $h$, because $W_0 \cdot x$ is treated as a constant. The gradient flows back only through the parts of the computation that involve our trainable parameters.

## 3. Calculating the Gradient for B (grad_B)

To find how B affects the loss, we use the chain rule. The gradient of the loss with respect to B is found by multiplying the gradient of the output (grad_h) by how B affects the output.

The update term is $\Delta h = B \cdot (A \cdot x)$. The derivative of this with respect to B involves the term it was multiplied by, which is $(A \cdot x)$.

The formula is:

$$\frac{\partial L}{\partial B} = \text{grad\_h} \cdot (A \cdot x)^T$$

Let's calculate this:

First, we need the term $(A \cdot x)$. We already computed this in the forward pass:
$A \cdot x = [-4.5]$

The transpose, $(A \cdot x)^T$, is $[-4.5]$.

Now we multiply:
$$\text{grad\_B} = \frac{\partial L}{\partial B} = \begin{bmatrix} 0.5 \\ -0.2 \end{bmatrix} \cdot [-4.5] = \begin{bmatrix} 0.5 \times -4.5 \\ -0.2 \times -4.5 \end{bmatrix} = \begin{bmatrix} -2.25 \\ 0.9 \end{bmatrix}$$

This grad_B matrix, which has the same shape as B, tells us how to adjust each element in B to reduce the loss.

## 4. Calculating the Gradient for A (grad_A)

Similarly, we find how A affects the loss. This time, the gradient has to pass back through B first.

The formula is:

$$\frac{\partial L}{\partial A} = (B^T \cdot \text{grad\_h}) \cdot x^T$$

Let's calculate this step-by-step:

First, we need the transpose of B:
$B^T = \begin{bmatrix} 0.4 & 0.2 \end{bmatrix}$

Next, we multiply $B^T$ by grad_h:
$$B^T \cdot \text{grad\_h} = \begin{bmatrix} 0.4 & 0.2 \end{bmatrix} \cdot \begin{bmatrix} 0.5 \\ -0.2 \end{bmatrix} = [(0.4 \times 0.5) + (0.2 \times -0.2)] = [0.2 - 0.04] = [0.16]$$

Finally, we multiply this result by the transpose of the input, $x^T$:
$$\text{grad\_A} = \frac{\partial L}{\partial A} = [0.16] \cdot \begin{bmatrix} 10 & 20 & 30 \end{bmatrix} = \begin{bmatrix} 1.6 & 3.2 & 4.8 \end{bmatrix}$$

This grad_A matrix, which has the same shape as A, tells us how to adjust A.

## 5. Updating the Weights

After calculating the gradients, the optimizer performs a weight update. Using a simple learning rate (lr), the update rule is:

$$A_{new} = A_{old} - \text{lr} \cdot \text{grad\_A}$$
$$B_{new} = B_{old} - \text{lr} \cdot \text{grad\_B}$$

And that's it! The crucial part is that W_0 is never updated. It is not part of the gradient calculation and requires no memory to store its gradients or optimizer states (like momentum). This is the source of LoRA's efficiency during the training process.


## Link Back to the Math Equation on LoRA
The equation below from the paper describes the optimization objective of LoRA:

$\max_{\Theta} \sum_{(x,y) \in Z} \sum_{t=1}^{|y|} \log(P_{\Phi_0 + \Delta\Phi(\Theta)}(y_t|x,y_{<t}))$

The key changes of this equation compared to the previous equation for full-model fine-tuning:

$\max_{\Theta}$: This is the most important change. Instead of optimizing over the massive set of parameters $\Phi$, we are now optimizing over a much smaller set of parameters, denoted by $\Theta$. As the paper notes, the size of $\Theta$ can be as small as 0.01% of the size of the original parameters ($|\Theta| \ll |\Phi_0|$). In our LoRA explainer, $\Theta$ represents the collection of all the trainable values in our small A and B matrices.

$P_{\Phi_0 + \Delta\Phi(\Theta)}$: This shows how the model's weights are constructed.

$\Phi_0$: These are the original, pre-trained weights of the large model. They are treated as a fixed constant and are not trained. This corresponds to our frozen $W_0$ matrix.

$\Delta\Phi(\Theta)$: This represents the weight update, but it's no longer a huge matrix of trainable parameters. Instead, it's a function that generates the large update matrix from the small set of parameters $\Theta$. For LoRA, this function is the matrix multiplication of our small matrices: $\Delta\Phi(\Theta) = B \cdot A$. The parameters $\Theta$ are the entries of B and A.

What it means:

The optimization is no longer over the massive parameter set $\Phi$, but over a much smaller set of parameters denoted by $\Theta$. The paper states $|\Theta| \ll |\Phi_0|$.

The update to the weights, $\Delta\Phi$, is now a function of this small parameter set, written as $\Delta\Phi(\Theta)$. The original weights $\Phi_0$ remain frozen.

Link to this article: This equation is the mathematical foundation for the section "Conceptual Overview: Low-Rank Adaptation (LoRA)".

The small, trainable parameter set $\Theta$ represents the collection of all the elements in our low-rank matrices, A and B.

The function that generates the large update, $\Delta\Phi(\Theta)$, corresponds directly to the matrix multiplication $B \cdot A$. This operation takes the small number of parameters in A and B and uses them to produce the full-sized update matrix $\Delta W$.

The term $\Phi_0 + \Delta\Phi(\Theta)$ in the equation is precisely what we illustrated in the forward pass: $h = (W_0 + B \cdot A) \cdot x$.


# LoRA in the Self-Attention Module

The self-attention mechanism is a cornerstone of the Transformer architecture. For each input token, it computes Query (Q), Key (K), and Value (V) vectors. These are generated by multiplying the input embedding (x) with three distinct weight matrices: $W_q$, $W_k$, $W_v$. After the attention scores are calculated and applied to the Value vectors, the result is passed through a final output projection matrix, $W_o$, to produce the layer's output.

While LoRA can be applied to any weight matrix, the paper's authors focus their study on only the weight matrices in the self-attention module ($W_q$, $W_k$, $W_v$, $W_o$). They find that for maximum parameter-efficiency, it is often sufficient to adapt only a subset of these matrices. For example, adapting only the query ($W_q$) and value ($W_v$) matrices can yield strong performance. For the purpose of a complete illustration, we will describe the process as if LoRA is applied to all four:

- For $W_q$, we add $B_q \cdot A_q$
- For $W_k$, we add $B_k \cdot A_k$
- For $W_v$, we add $B_v \cdot A_v$
- For $W_o$, we add $B_o \cdot A_o$

All eight of these LoRA matrices (A_q, B_q, A_k, B_k, etc.) are the only parameters that are trained. The original W matrices remain frozen. The calculations for each of the four paths are independent and can be performed in parallel.


## Why does LoRA work?

If a model requires a lot of parameters to learn certain behaviors during pretraining, shouldn’t it also require a lot of parameters to change its behaviors during finetuning?

Many papers have argued that while LLMs have many parameters, they have very low intrinsic dimensions; see Li et al. (2018); Aghajanyan et al. (2020); and Hu et al. (2021). They showed that pre-training implicitly minimizes the model’s intrinsic dimension. Surprisingly, larger models tend to have lower intrinsic dimensions after pre-training. This suggests that pre-training acts as a compression framework for downstream tasks. In other words, the better trained an LLM is, the easier it is to finetune the model using a small number of trainable parameters and a small amount of data.

---

### LoRA Configurations

To apply LoRA, you need to decide:
- **What weight matrices to apply LoRA to**
- **The rank of each factorization**

LoRA is most commonly applied to the four weight matrices in the attention modules:
- **Query (Wq)**
- **Key (Wk)**
- **Value (Wv)**
- **Output projection (Wo)**

Typically, LoRA is applied uniformly to all matrices of the same type within a model. For example, applying LoRA to the query matrix means applying LoRA to all query matrices in the model.


*Serving LoRA adapters*

LoRA not only lets you finetune models using less memory and data, but it also simplifies serving multiple models due to its modularity. To understand this benefit, let’s examine how to serve a LoRA-finetuned model.

---

### In general, there are two ways to serve a LoRA-finetuned model:

1. **Merge the LoRA weights A and B into the original model to create the new matrix Wʹ prior to serving the finetuned model.**  
    Since no extra computation is done during inference, no extra latency is added.

2. **Keep W, A, and B separate during serving.**  
    The process of merging A and B back to W happens during inference, which adds extra latency.

---

The first option is generally better if you have only one LoRA model to serve, whereas the second is generally better for multi-LoRA serving—serving multiple LoRA models that share the same base model.

---

### Multi-LoRA Serving

For multi-LoRA serving, while option 2 adds latency overhead, it significantly reduces the storage needed. Consider the scenario in which you finetune a model for each of your customers using LoRA. With 100 customers, you end up with 100 finetuned models, all sharing the same base model. 

- **Option 1:** You have to store 100 full-rank matrices Wʹ.  
- **Option 2:** You only have to store one full-rank matrix W and 100 sets of smaller matrices (A, B).

---

### Example

Let’s say that the original matrix W is of the dimension 4096 × 4096 (16.8M parameters). If the LoRA’s rank is 8, the number of parameters in A and B is:

$$
4096 × 8 × 2 = 65,536
$$

- **Option 1:** 100 full-rank matrices Wʹ totals:  
  $$
  16.8M × 100 = 1.68B \text{ parameters.}
  $$

- **Option 2:** One full-rank matrix W and 100 sets of small matrices (A, B) totals:  
  $$
  16.8M + 65,536 × 100 = 23.3M \text{ parameters.}
  $$

---

### Key Insight

Option 2 also makes it faster to switch between tasks.

## Quantization

LLMs get their name due to the number of parameters they contain. Nowadays, these models typically have billions of parameters (mostly weights) which can be quite expensive to store.

During inference, activations are created as a product of the input and the weights, which similarly can be quite large.

As a result, we would like to represent billions of values as efficiently as possible, minimizing the amount of space we need to store a given value.


The weights of an LLM are numeric values with a given precision, which can be expressed by the number of bits like float64 or float32.  

if we lower the amount of bits to represent a value, we get a less accurate result. However, if we lower the number of bits we also lower the memory requirements of that model.

Papers like [“Intrinsic dimensionality explains the effectiveness of language model fine-tuning”](https://arxiv.org/pdf/2012.13255) demonstrate that language models “have a very low intrinsic dimension.”5 This means that we can find small ranks that approximate even the massive matrices of an LLM. A 175B model like GPT-3, for example, would have a weight matrix of 12,288 × 12,288 inside each of its 96 Transformer blocks. That’s 150 million parameters. If we can successfully adapt that matrix into rank 8, that would only require two 12,288 × 2 matrices resulting in 197K parameters per block.

## QLora - Quantized LoRa

 We can make LoRA even more efficient by reducing the memory requirements of the
 model’s original weights before projecting them into smaller matrices. The weights of an LLM are numeric values with a given precision, which can be expressed by the number of bits like float64 or float32.
 If we lower the amount of bits to represent a value, we get a less accurate result. However, if we lower the number of bits we also lower the memory requirements of that model.


 In the original LoRA paper, during finetuning, the
 model’s weights are stored using 16 bits. QLoRA stores the model’s weights in 4 bits but dequantizes (converts) them back into BF16 when computing the forward and backward pass.

QLoRA enables finetuning very large pretrained LLMs (e.g., 33B, 65B parameters) on a single high-memory GPU (24–48 GB) by combining three complementary ideas: (1) **NF4**, a 4-bit quantization type designed for normally-distributed weights, (2) **Double Quantization (DQ)**, which further compresses the small per-block quantization metadata, and (3) **Paged Optimizers**, which keep optimizer state pageable on host memory to avoid GPU OOMs. The core intuition: keep compute precision high (use bfloat16/float16 for matrix multiplies) while aggressively compressing the frozen base model weights and carefully managing optimizer memory for the tiny, trainable LoRA adapters.

## 1. NF4 — Intuition, math, and numeric example
Neural network weight histograms typically look bell-shaped (approximately normal). If you only have 4 bits (16 values) to represent numbers, you want many representable values close to zero and few in the tails. **NF4** (NormalFloat, 4-bit) chooses those 16 values as quantiles of a standard normal distribution — i.e., equal-mass bins for a normal prior. This gives much lower quantization error than uniform 4-bit or integer-4 schemes for transformer weights.

**Formal intuition / formula** (stylized):
$$
v_i \propto \Phi^{-1}\!\Big(\frac{i+0.5}{16}\Big),\quad i=0\dots 15,
$$
where  $$ (\Phi^{-1}) $$ is the inverse standard-normal CDF. Practically, QLoRA uses **blockwise** quantization (blocksize = **64**) and stores per-block scale constants so each block gets its own tiny affine dequantization.

**Concrete numbers**:
- 65B params in 16-bit: ≈ 130 GB for weights alone.
- Same in 4-bit: ≈ 32.5 GB (4 bits = 0.5 bytes/param).
- NF4 improves 4-bit fidelity (paper shows lower perplexities vs integer 4-bit).


## 2. Double Quantization — why quantize the quantizer
Blockwise 4-bit encoding requires storing a tiny per-block scale (and maybe zero point). When blocksize is small (64), the number of blocks is large and these metadata add up. **Double Quantization** applies a second quantization pass to the *array of per-block scales* (e.g., quantize them to 8-bit with a larger blocksize such as 256). The second pass costs almost nothing in accuracy but gives additional memory savings.

**Why it helps (numbers)**:
- DQ reduces storage by about **0.373 bits/parameter** (with typical block choices). On 65B params this is ≈ **3.0 GB** extra saving beyond plain 4-bit storage — often decisive when squeezing into a 48GB GPU.

**Implementation notes**:
- Stage 1: blockwise NF4 encode weights (store 4-bit packed codes + per-block scales).
- Stage 2: take the per-block scales tensor → subtract mean/center → encode with an 8-bit quantizer (larger blocksize).
- Runtime dequantization reverses both stages; overhead is small compared to matrix multiply.


## 3. Paged Optimizers — avoid OOM spikes
Even with compact weights, training needs memory for activations and optimizer states. QLoRA trains only LoRA adapters (tiny low-rank matrices), but optimizer moments for those still exist. QLoRA uses **paged optimizers**: allocate optimizer state in pageable host memory (CPU RAM) and let the CUDA runtime / unified memory move pages to GPU as needed. This prevents fatal OOMs due to rare activation/optimizer spikes.

**Trade-offs**:
- Pros: avoids OOM, enables single-GPU finetuning of 33B/65B models.
- Cons: if page faults are frequent (thrashing), runtime slows down. In many practical runs paging happens only occasionally so impact is minor.

---

## 4. End-to-end integration (runtime flow)
1. **Load** base model weights from disk in NF4+DQ packed form. They remain **frozen**.  
2. **Dequantize** blocks on demand to bfloat16 for computation (matrix multiplies).  
3. **LoRA adapters** (trainable) are in higher precision (bfloat16/float16); only they receive gradients.  
4. **Optimizer state** for adapters is pageable — mostly on host RAM, moved to GPU only when needed.  
5. **Gradient checkpointing** is typically enabled to reduce activation memory further.

This combination keeps compute precision high while minimizing storage and avoids OOM spikes from optimizer state.


## 5. Practical numbers and quick mental model
- 65B model: 16-bit weights ≈ 130 GB; 4-bit weights ≈ 32.5 GB.
- DQ saves an extra ≈ 3 GB on 65B (0.373 bits/param).
- Paper’s reported full-training footprint (16-bit) could be ~780 GB; QLoRA brings it to the range of a single 48GB GPU for finetuning.
- Quality: NF4 + DQ + LoRA (applied broadly) recovers near 16-bit finetuning performance across benchmarks.

## Hands-On Example - Instruction Tuning with QLoRa 

Now that we have explored how QLoRA works, let us put that knowledge into
practice! In this section, we will fine-tune a completely open source and smaller
version of Llama, TinyLlama, to follow instructions using the QLoRA procedure.
Consider this model a base or pretrained model, one that was trained with language
modeling but cannot yet follow instructions.

We chose this chat template to use throughout the examples since the chat version of
TinyLlama uses the same format. The data that we are using is a small subset of the
UltraChat dataset.7 This dataset is a filtered version of the original UltraChat dataset
that contains almost 200k conversations between a user and an LLM

In [1]:
from transformers import AutoTokenizer
from datasets import load_dataset


# Load a tokenizer to use its chat template
template_tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

def format_prompt(example):
    """Format the prompt to using the <|user|> template TinyLLama is using"""

    # Format answers
    chat = example["messages"]
    prompt = template_tokenizer.apply_chat_template(chat, tokenize=False)

    return {"text": prompt}

# Load and format the data using the template TinyLLama is using
dataset = (
    load_dataset("HuggingFaceH4/ultrachat_200k",  split="test_sft")
      .shuffle(seed=42)
      .select(range(3_000))
)
dataset = dataset.map(format_prompt)
dataset = dataset.remove_columns([c for c in dataset.column_names if c != "text"])

In [2]:
# Example of formatted prompt
print(dataset["text"][2576])

<|user|>
Given the text: Knock, knock. Who’s there? Hike.
Can you continue the joke based on the given text material "Knock, knock. Who’s there? Hike"?</s>
<|assistant|>
Sure! Knock, knock. Who's there? Hike. Hike who? Hike up your pants, it's cold outside!</s>
<|user|>
Can you tell me another knock-knock joke based on the same text material "Knock, knock. Who's there? Hike"?</s>
<|assistant|>
Of course! Knock, knock. Who's there? Hike. Hike who? Hike your way over here and let's go for a walk!</s>



 In BitsAndBytesConfig, you can define the quantization scheme. We follow the steps
 used in the original QLoRA paper and load the model in 4-bit (load_in_4bit) with
 a normalized float representation (bnb_4bit_quant_type) and double quantization
 (bnb_4bit_use_double_quant):

### Questions to ask 

According to the theory that we went through,

1. What does the parameter ```bnb_4bit_compute_dtype``` represent ? 
2. Why did we set ```model.config.use_cache=False```
3. What would happen if we remove the ```quantization_config``` parameter ?

In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"

# 4-bit quantization configuration - Q in QLoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Use 4-bit precision model loading
    bnb_4bit_quant_type="nf4",  # Quantization type
    bnb_4bit_compute_dtype="float16",  # Compute dtype
    bnb_4bit_use_double_quant=True,  # Apply nested quantization
)

# Load the model to train on the GPU
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",

    # Leave this out for regular SFT
    quantization_config=bnb_config,
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=False)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_side = "right"

try:
    if getattr(tokenizer, "chat_template", None) in (None, ""):
        chat_tok = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
        tokenizer.chat_template = chat_tok.chat_template
except Exception:
    pass

#### Important Points

1. ```r``` - This is the rank of the compressed matrices Increasing this value will also increase the sizes of compressed matrices leading to less compression and thereby improved representative power. Values typically range between 4 and 64
2. ```lora_alpha``` - Controls the amount of change that is added to the original weights. In essence, it balances the knowledge of the original model with that of the new task. A rule of thumb is to choose a value twice the size of r.
3. ```target_modules``` -  Controls which layers to target. The LoRA procedure can choose to ignore specific layers, like specific projection layers. This can speed up training but reduce performance and vice versa.

In [4]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# Prepare LoRA Configuration
peft_config = LoraConfig(
    lora_alpha=32,  # LoRA Scaling
    lora_dropout=0.1,  # Dropout for LoRA Layers
    r=64,  # Rank
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=  # Layers to target
     ['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)

# prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

### Training Configuration Parameters worth mentioning 
1. ```num_train_epochs``` - The total number of training rounds.Higher values tend to degrade performance so we generally like to keep this low.
2. ```learning_rate``` - Determines the step size at each iteration of weight updates. The authors of QLoRA found that higher learning rates work better for larger models (>33B parameters).
3. ```lr_scheduler_type``` - A cosine-based scheduler to adjust the learning rate dynamically. It will linearly increase the learning rate, starting from zero, until it reaches the set value. After that, the learning rate is decayed following the values of a cosine function.
4. ```optim``` - The paged optimizers used in the original QLoRA paper.


> Optimizing these parameters is a difficult task and there are no set guidelines for doing so. It requires experimentation to figure out what works best for specific datasets, model sizes, and target tasks

In [5]:
from trl import SFTTrainer, SFTConfig
import torch

bf16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
fp16 = torch.cuda.is_available() and not bf16

def formatting_func(example):
    return example["text"]

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    train_dataset=dataset,
    peft_config=peft_config,  # Leave this out for regular SFT
    formatting_func=formatting_func,
    args=SFTConfig(
        output_dir="./results",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        num_train_epochs=1,
        logging_steps=10,
        fp16=fp16,
        bf16=bf16,
        gradient_checkpointing=True,
        dataset_text_field="text",
        max_length=512,
        report_to="none",
        optim="paged_adamw_32bit",
        packing=False,
        dataset_num_proc=4,
        seed=3407,
    ),
)

# Train model
trainer.train()

# Save QLoRA weights
trainer.model.save_pretrained("TinyLlama-1.1B-qlora")




Applying formatting function to train dataset (num_proc=4):   0%|          | 0/3000 [00:00<?, ? examples/s]

Adding EOS to train dataset (num_proc=4):   0%|          | 0/3000 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=4):   0%|          | 0/3000 [00:00<?, ? examples/s]

Truncating train dataset (num_proc=4):   0%|          | 0/3000 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 0}.
  return fn(*args, **kwargs)


Step,Training Loss
10,1.7687
20,1.4229
30,1.5596
40,1.3954
50,1.415
60,1.3632
70,1.5072
80,1.4667
90,1.4498
100,1.3739


In [7]:
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    "TinyLlama-1.1B-qlora",
    low_cpu_mem_usage=True,
    device_map=None,
)

# Merge LoRA and base model
merged_model = model.merge_and_unload()

In [8]:
from transformers import pipeline

# Use our predefined prompt template
prompt = """<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
"""

# Run our instruction-tuned model
pipe = pipeline(task="text-generation", model=merged_model, tokenizer=tokenizer)
print(pipe(prompt)[0]["generated_text"])

Device set to use cuda:0


<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
Large Language Models (LLMs) are machine learning models that can generate human-like text and understand the meaning of words and phrases. They use large amounts of data to learn to model language, and are trained using large volumes of text, such as Wikipedia articles or social media posts.

LLMs can be used in a variety of applications, including text generation, natural language processing (NLP), and machine translation (MT). They can generate text in a wide range of styles and languages, and can be used to create personalized content for users or automate tasks like transcribing audio or text.

One of the most significant benefits of LLMs is their ability to generate human-like text in a variety of styles and contexts. For example, they can generate convincing conversations or narratives, or they can understand and analyze the intent behind a user's input and produce content that reflects that intent.

Anoth

# Some thoughts on Finetuning

## When to Finetune ? 

### Reasons to Finetune

The primary reason for finetuning is to improve a model's quality, in terms of both general capabilities, and task-specific capabilities. Finetuning is commonly used to improve a model's ability to generate outputs following specific structures, such as JSON or YAML formats. 

A general-purpose model that performs well on a wide range of benchmarks might not perform well on your specific task. If the model you want to use wasn’t sufficiently trained on your task, finetuning it with your data can be especially useful.

For example, an out-of-the-box model might be good at converting from text to the standard SQL dialect but might fail with a less common SQL dialect. In this case,finetuning this model on data containing this SQL dialect will help. Similarly, if the model works well on standard SQL for common queries but often fails for customer specific queries, finetuning the model on customer-specific queries might help.

## References (Do give them a read!)


1. [SmolLM3](https://huggingface.co/Hu`ggingFaceTB/SmolLM3-3B-Base)
2. [Estimating GPU Memory Requirements](https://harshwardhanfartale.substack.com/p/estimating-vram-requirements-for)


### Chat Templates

1. [Chat Templates Medium Article]()

2. [Chat Templates (Asimov's Addendum)](https://asimovaddendum.substack.com/p/chat-templates)


### Books - Heavily Referenced from 

1. [Hands-On Large Language Models](https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/)
2. [Designing Large Language Models Applications](https://www.oreilly.com/library/view/designing-large-language/9781098150495/)
3. [AI Engineering](https://www.oreilly.com/library/view/ai-engineering/9781098166298/)

### Blogs - Super Helpful (also Heavily Referenced)

1. [A Visual Guide to Quantization](https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization)
2. [A Visual Guide to Mixture of Experts]()
3. [Practical Tips for Finetuning LLMs](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms)