# Fine tuning LLM models


Information source: https://www.youtube.com/watch?v=iOdFUJiB0Zc&ab_channel=freeCodeCamp.org

## Quantization

Simple definition: Conversion from higher memory format to a lower memory format.

**Explanation**

Any LLM is a neural network model. These neural networks have parameters. We keep hearing about a 7 billion parameter model or 3 billion parameter models. 


What are these parameters?
These parameters are the learned weights and biases in the network. These parameters are stored as FP32, taking 32 bits for each parameter value. 

Now, suppose we want to fine-tune one of these bigger LLMs. 
* Loading this model on a personal computer, edge device, mobile device or VRAM is not possible. The size of the model is huge.
* We can try loading the model on cloud which provides higher capacity. But the cost is also higher for cloud services. 

How can we reduce the size of the model?
We can convert from a high bit format like FP32 to a smaller size format, like FP16 (half precision) or 8 bit representations. 

### What is quantization?

Convert a bigger model and quantize it to a smaller model, so that it can be used for purposes like faster inference, edge device computing, mobile computing, fine tuning etc.

### Disadvantage

When we quantize, since we decrease the precision at which we store the model parameters, there is loss of information and hence loss of accuracy.

## Caliberation

Caliberation is the process of performing quantization.


**Types of quantization**

1. Symmetric Quantization
2. Assymmetric Quantization

**Symmetric Quantization**

Suppose, we have a list of numbers in the range of 0-1000. And these numbers are evenly distributed. Assume, these numbers are stored in FP32 and we want to store them as unit8 (positive 8 bit integer). 

So, 

original list range (0, 1000)
scaled down range (0, 255) - highest number 8 bits can store is 2^8 - 1.

scale factor = 1000 - 0
               _________  = 3.92
               
               255 - 0
               

Now, any number in original number when divided by 3.92 and rounded will give us uint8 representation.

For e.g .250 will become, round(250/3.92) = round(63.775.....) = 64

**Assymetric Quantization**

What if the data is not symmetrically distributed, maybe right skewed etc.

So,

original list range (-20, 1000)
scaled range (0, 255)

scale factor = (1000 - (-20))/255 = 1020 / 255 = 4.0

Now, if we use this to scale -20, it will be

-20 / 4.0 = -5

But -5, is not in range 0, 255. So we adapt a zero point, which is +5.



Hence, for quantization we have 2 parameters. They are scale and zero point.

## Model of Quantization

1. Post training quantization
2. Quantization aware training

# Additional Learnings

### Batch Normalization

Source: https://www.youtube.com/watch?v=DtEq44FTPM4&ab_channel=CodeEmporium

Why Batch Normalization?

1. Increases training speed - It smoothens the optimization landscape significantly. The smoothness induces a more predictive and stable behavior of the gradients, allowing for faster training. The optmization plan landscape has consistent gradients or smoother gradients.

2. Allow sub-optimal starts - They make initial weight less important. Suppose we don't have BatchNorm, then the optimization landscape can be wide and if we choose a point that is far from minimum, training may take larger number of iterations to reach minimum. Whereas, if optimization landscape was BatchNorm-ed, then the landscape is expected to be smoother and contained, and any starting weights would be similar steps away from minimum.

3. Acts as a Regularizer (a little) - Think of regularizer in NN as dropouts. Dropouts introduce randomness in the learning process. Batch normalization does induce some regularization. 

### Mini-batch

Number of training samples to consider, before updating weigths.

### FP32 - How it looks in memory?

Out of 32 bits,

1. 1-bit is used for sign (+/-)
2. 7-bit is used to store number before decimal 
3. 24-bit is used to store mantissa (number after decimal)

Hence, for number 7.32, 

1-bit will be used to store +, 7 bits to store 7 and 24 bits will be used to store 32.