# Fine tuning LLM models


Information source: https://www.youtube.com/watch?v=iOdFUJiB0Zc&ab_channel=freeCodeCamp.org

## Quantization

Simple definition: Conversion from higher memory format to a lower memory format.

**Explanation**

Any LLM is a neural network model. These neural networks have parameters. We keep hearing about a 7 billion parameter model or 3 billion parameter models. 


What are these parameters?
These parameters are the learned weights and biases in the network. These parameters are stored as FP32, taking 32 bits for each parameter value. 

Now, suppose we want to fine-tune one of these bigger LLMs. 
* Loading this model on a personal computer, edge device, mobile device or VRAM is not possible. The size of the model is huge.
* We can try loading the model on cloud which provides higher capacity. But the cost is also higher for cloud services. 

How can we reduce the size of the model?
We can convert from a high bit format like FP32 to a smaller size format, like FP16 (half precision) or 8 bit representations. 

### What is quantization?

Convert a bigger model and quantize it to a smaller model, so that it can be used for purposes like faster inference, edge device computing, mobile computing, fine tuning etc.

### Disadvantage

When we quantize, since we decrease the precision at which we store the model parameters, there is loss of information and hence loss of accuracy.

## Caliberation

Caliberation is the process of performing quantization.


**Types of quantization**

1. Symmetric Quantization
2. Assymmetric Quantization

**Symmetric Quantization**

Suppose, we have a list of numbers in the range of 0-1000. And these numbers are evenly distributed. Assume, these numbers are stored in FP32 and we want to store them as unit8 (positive 8 bit integer). 

So, 

original list range (0, 1000)
scaled down range (0, 255) - highest number 8 bits can store is 2^8 - 1.

scale factor = 1000 - 0
               _________  = 3.92
               
               255 - 0
               

Now, any number in original number when divided by 3.92 and rounded will give us uint8 representation.

For e.g .250 will become, round(250/3.92) = round(63.775.....) = 64

**Assymetric Quantization**

What if the data is not symmetrically distributed, maybe right skewed etc.

So,

original list range (-20, 1000)
scaled range (0, 255)

scale factor = (1000 - (-20))/255 = 1020 / 255 = 4.0

Now, if we use this to scale -20, it will be

-20 / 4.0 = -5

But -5, is not in range 0, 255. So we adapt a zero point, which is +5.



Hence, for quantization we have 2 parameters. They are scale and zero point.

## Model of Quantization

1. Post training quantization
2. Quantization aware training


**Post Training Quantization**

In this case, we already have a pre-trained model. If we want to use this model for my use case, we perform caliberation and convert the model into quantized model. Then use the model.

Problem: There is lot of information and hence accuracy.

**Quantization Aware Training**

In this case, we will take our trained model. Perform quantization, and then we perform fine tuning with our data to make the model more accurate and then quantize again. 

With this, we do loose accuracy but we do make improvements with fine turning with custom data to make up for loss of accuracy.



**QAT is used in general. PTQ is not used.**

## Lora and QLORA - Indepth Intuition


It is specifically used in fine-tuning of LLM model. Whenever we have a pre-trained model, like chatgpt-4, this is a pre-trained model. It is trained on large volume of data. 

There are various ways of fine-tuning. 

The fine-tuning happens on the base model weights. 

1. Full parameter fine-tuning.
2. Domain specific fine-tuning (e.g. for finance, sales etc)
3. Specific task fine-tuning (task A, Task B, Task C etc)


**Full parameter fine-tuning**

1. Update all model weights
2. Hardware resource constraint is a challenge.

Downstream tasks become difficult, like model inference, model monitoring etc.

To overcome this challenge, we will use Lora and QLORA. 

## LoRA (Low-Rank Adaptation) for LLM Fine-Tuning

### 1. The Parameter Problem
Large language models (LLMs) like GPT-3 or BERT have billions of parameters (weights). Fine-tuning all of these parameters for every new task is:
- **Expensive**: It requires a lot of computational resources.
- **Slow**: It takes a long time to train.
- **Inefficient**: Storing a full version of the model for each task takes up a lot of space.

LoRA offers a solution by freezing most of the model's parameters and fine-tuning only a small subset.

### 2. Core Idea: Low-Rank Decomposition
LoRA is based on **low-rank approximation**. Here’s the detailed breakdown:

- **Weight matrices in LLMs**: In a deep learning model, the computations mainly involve multiplying an input vector by large weight matrices.
  
- **Low-rank assumption**: LoRA assumes the updates to these large weight matrices during fine-tuning can be expressed as the product of two smaller matrices. This drastically reduces the number of parameters being updated.

- **How LoRA modifies weight matrices**:
  - Instead of updating the full weight matrix \( W \), LoRA inserts two small matrices \( A \) and \( B \) into the model, so:
  
    \[
    W' = W + \Delta W
    \]
  
    Where \( \Delta W = A \times B \).

  - **Matrix sizes**:
    - \( A \): \( r \times k \) (restores dimensionality).
    - \( B \): \( d \times r \) (reduces dimensionality).
    - \( r \) is much smaller than both \( d \) and \( k \), making \( A \times B \) a low-rank approximation of the full update.

### 3. Freezing the Original Weights
- In LoRA, the original weight matrix \( W \) is **frozen**. Only the newly introduced low-rank matrices \( A \) and \( B \) are updated during fine-tuning.

- This means you update only a small portion of the model, which saves both time and memory.

### 4. How LoRA is Applied
- LoRA is typically applied to specific layers in the model, such as **attention layers** in transformers, where matrix multiplications are the most computationally expensive.

- When input data passes through the model:
  - It is first multiplied by the original weight matrix \( W \) (frozen).
  - In parallel, it passes through the low-rank matrices \( B \) and \( A \), and these matrices are updated during fine-tuning.
  - The outputs from both operations are summed: \( W \times \text{input} + (A \times (B \times \text{input})) \).

### 5. Advantages of LoRA
- **Efficiency**: Only small matrices \( A \) and \( B \) are fine-tuned, which reduces memory and computational overhead.
- **Modularity**: You can adapt a single model to multiple tasks by swapping in different LoRA modules without retraining the entire model.
- **Minimal Performance Impact**: Despite fine-tuning fewer parameters, LoRA achieves performance close to traditional fine-tuning.

### 6. Visualization
Imagine your model is a large building. Instead of rebuilding the whole structure for each new task, LoRA adds small adjustment blocks (matrices \( A \) and \( B \)) to the building. These adjustments are enough to fine-tune the model without altering the entire foundation.

### 7. Use in Transformers
LoRA is particularly effective in transformer models, as the attention layers involve large matrix multiplications. By applying LoRA only to these layers, fine-tuning becomes much more efficient.

### Summary
1. **Start** with a large pre-trained model with frozen weights.
2. **Inject** small matrices \( A \) and \( B \) into specific layers.
3. **Fine-tune** only the small matrices while keeping the original model frozen.
4. **Use** the adapted model for your new task with minimal additional parameters.

LoRA allows you to fine-tune large models efficiently while saving both time and memory.


# Additional Learnings

### Batch Normalization

Source: https://www.youtube.com/watch?v=DtEq44FTPM4&ab_channel=CodeEmporium

Why Batch Normalization?

1. Increases training speed - It smoothens the optimization landscape significantly. The smoothness induces a more predictive and stable behavior of the gradients, allowing for faster training. The optmization plan landscape has consistent gradients or smoother gradients.

2. Allow sub-optimal starts - They make initial weight less important. Suppose we don't have BatchNorm, then the optimization landscape can be wide and if we choose a point that is far from minimum, training may take larger number of iterations to reach minimum. Whereas, if optimization landscape was BatchNorm-ed, then the landscape is expected to be smoother and contained, and any starting weights would be similar steps away from minimum.

3. Acts as a Regularizer (a little) - Think of regularizer in NN as dropouts. Dropouts introduce randomness in the learning process. Batch normalization does induce some regularization. 

### Mini-batch

Number of training samples to consider, before updating weigths.

### FP32 - How it looks in memory?

Out of 32 bits,

1. 1-bit is used for sign (+/-)
2. 7-bit is used to store number before decimal 
3. 24-bit is used to store mantissa (number after decimal)

Hence, for number 7.32, 

1-bit will be used to store +, 7 bits to store 7 and 24 bits will be used to store 32.