<a href="https://colab.research.google.com/github/dxvsh/LearningPytorch/blob/main/Week4/ContPreTrain-Peft-quantize.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 style="color:Tomato;"> Continual Pretraining of Llama 3.2 1B</h1>

In this notebook, we will continually pre-train the Llama 3.2 1B model on the Tamil subset of the Sangraha dataset from AI4Bharat

We are going to use an L4 GPU with 48 GB of memory

Training all the model's parameters requires significant amount of memory.

Therefore, we will train the model using **LoRa**, one of several parameter-efficient fine-tuning techniques

According to the docs:
> **LoRA** (Low Rank Adaptation for LLMs) : is low-rank decomposition method to reduce the number of trainable parameters which speeds up finetuning large models and uses less memory.

In [1]:
!pip install datasets > /dev/null

In [2]:
from pprint import pprint
import math
import wandb


import datasets
from datasets import load_dataset, load_from_disk
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import DataCollatorForLanguageModeling
from transformers import TrainingArguments, Trainer

```python
wandb.init(
    project="DLP-W4-CPT-Node-1",
    config={
        "batch_size":4,
        "dataset": "Sangraha",
    },
)
```

<h1 style="color:Tomato;"> Load the dataset </h1>

Let's load a small portion of sangraha dataset from ai4bharat.

In [4]:
ds = load_dataset('ai4bharat/sangraha',data_files="https://huggingface.co/datasets/ai4bharat/sangraha/resolve/main/verified/tam/data-0.parquet")['train']
print(ds)

Dataset({
    features: ['doc_id', 'text', 'type'],
    num_rows: 149796
})


In [10]:
ds[1]

{'doc_id': '6e9d2be2015727c4f1590a67b5a200854cf08771',
 'text': 'செய்முறைஃ\nபச்சரிசி மற்றும் பச்சைப்பயறை ஒன்றாக சேர்த்து ஒரு மணி நேரம் ஊற வைக்கவும். ஊறிய அரிசி, பயறுடன், தேங்காய் துருவல், காய்ந்த மிளகாய், பெருங்காயத்தூள், இஞ்சி, கொத்தமல்லி, கறிவேப்பிலை, உப்பு சேர்த்து தோசை மாவு பதத்தில் அரைத்து கொள்ளவும்.\n அடுப்பில் தோசைக்கல்லை வைத்து எண்ணெய் ஊற்றி காய்ந்ததும் அரைத்து வைத்திருக்கும் மாவை ஊற்றி சுட்டு எடுக்கவும். சுவையான பச்சை பயறு தோசை ரெசிபி ரெடி. ',
 'type': 'web'}

<h1 style="color:Tomato;"> Load Llama 3.2 1b tokenizer </h1>

Note that currently access to the meta-llama/Llama-3.2-1B is restricted. You must have access to it and be authenticated to access it.

So don't run the below cells unless you have access to the model yet. You need to fill out a form and ask for access which is given in a few hours.


In [None]:
model_id = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
print(f'Vocab size: {tokenizer.vocab_size}')
print(f'Context length: {tokenizer.model_max_length}')

Vocab size: 128000
Context length: 131072


The context length of the Llama 3.2 model is 131K tokens.

Let's restrict it to 1024 for this demo.

In [None]:
tokenizer.model_max_length = 1024
tokenizer.pad_token = tokenizer.eos_token # set the end of sentence token as the pad token

Let's compute the approximate **fertility score** of the tokenizer for the Tamil language.

The fertility score is a measure of how many tokens a word is split into.

For example if a sample sentence has 10 words and after tokenization, it gets split into 50 tokens, then it means the fertility score is 50/10 = 5.

In [None]:
example = ds[1]
num_words = len(example['text'].split())
print(f'Number of words: {num_words}')

Number of words: 47


In [None]:
input_ids = tokenizer(example['text'])['input_ids']
print(f'Number of tokens: {len(input_ids)}')

Number of tokens: 521


In [None]:
print(f'The fertility rate is: {len(input_ids)/num_words}')

The fertility rate is: 11.085106382978724


Typically, the **fertility score** for the tokenizer is quite high for **Indic languages.**

The fertility score is high (meaning that, **every word is split into 11 tokens on average**). Of course, to get the correct score, we have to use all the samples from the entire dataset.

In [None]:
print(example['text'])

செய்முறைஃ
பச்சரிசி மற்றும் பச்சைப்பயறை ஒன்றாக சேர்த்து ஒரு மணி நேரம் ஊற வைக்கவும். ஊறிய அரிசி, பயறுடன், தேங்காய் துருவல், காய்ந்த மிளகாய், பெருங்காயத்தூள், இஞ்சி, கொத்தமல்லி, கறிவேப்பிலை, உப்பு சேர்த்து தோசை மாவு பதத்தில் அரைத்து கொள்ளவும்.
 அடுப்பில் தோசைக்கல்லை வைத்து எண்ணெய் ஊற்றி காய்ந்ததும் அரைத்து வைத்திருக்கும் மாவை ஊற்றி சுட்டு எடுக்கவும். சுவையான பச்சை பயறு தோசை ரெசிபி ரெடி. 


In [None]:
tokens = tokenizer.convert_ids_to_tokens(input_ids)
print(tokens[0 7])
tokenizer.decode(input_ids)

['<|begin_of_text|>', 'à®', 'ļ', 'à¯', 'Ĩ', 'à®', '¯', 'à¯įà®', '®', 'à¯ģ', 'à®', '±', 'à¯', 'Ī', 'à®', 'ĥ', 'Ċ', 'à®', 'ª', 'à®', 'ļ', 'à¯įà®', 'ļ', 'à®', '°', 'à®¿à®', 'ļ', 'à®¿', 'Ġà®', '®', 'à®', '±', 'à¯įà®', '±', 'à¯ģ', 'à®', '®', 'à¯į', 'Ġà®', 'ª', 'à®', 'ļ', 'à¯įà®', 'ļ', 'à¯', 'Ī', 'à®']


'<|begin_of_text|>செய்முறைஃ\nபச்சரிசி மற்றும் பச்சைப்பயறை ஒன்றாக சேர்த்து ஒரு மணி நேரம் ஊற வைக்கவும். ஊறிய அரிசி, பயறுடன், தேங்காய் துருவல், காய்ந்த மிளகாய், பெருங்காயத்தூள், இஞ்சி, கொத்தமல்லி, கறிவேப்பிலை, உப்பு சேர்த்து தோசை மாவு பதத்தில் அரைத்து கொள்ளவும்.\n அடுப்பில் தோசைக்கல்லை வைத்து எண்ணெய் ஊற்றி காய்ந்ததும் அரைத்து வைத்திருக்கும் மாவை ஊற்றி சுட்டு எடுக்கவும். சுவையான பச்சை பயறு தோசை ரெசிபி ரெடி. '

Anyway, let us go with this!

Lets tokenize the samples in the dataset, and remove all the unecessary columns.

Because we only need the `token_ids` and the `attention_mask` to feed to the model.

In [None]:
def tokenize(example):
    example = tokenizer(example['text'],padding=False,truncation=True)
    return example

In [None]:
tokenized_ds = ds.map(tokenize,batched=True,num_proc=12, remove_columns=['doc_id', 'text', 'type'])
print(tokenized_ds)

Map (num_proc=12):   0%|          | 0/149796 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 149796
})


The dataset is tokenized successfully. And we now have the `input_ids` and the `attention_mask`.

Now just like we did in the previous notebooks, we concatenate all the `input_ids` and chunk them so that each of them becomes 1024 in length.

<h1 style="color:Tomato;"> Packing Sequence </h1>

In [None]:
def concatenate_and_chunk(examples):
    pass


In [None]:
ds_chunked = ds.map(concatenate_and_chunk,
                    batch_size=1000,
                    batched=True,
                    num_proc=12,
                    remove_columns=['doc_id', 'text', 'type']
                   )

In [None]:
ds_chunked.save_to_disk('tamil_ds')

In [None]:
ds_chunked = load_from_disk('tamil_ds')

In [None]:
print(ds_chunked)

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 483683
})


Observe that after concatenating and packing the samples, the dataset has gone up from about 150K samples to about 483K.

Now, split the chunked dataset into train and test splits:

In [None]:
ds_split = ds_chunked.train_test_split(test_size=0.001,seed=42)
print(ds_split)

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 483199
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 484
    })
})


<h1 style="color:Tomato;"> Data Collator </h1>

Our objective here is causal language modelling (which essentially, means next token prediction), so we turn off the masked language modelling objective.

In [None]:
# dataloader
data_collator = DataCollatorForLanguageModeling(tokenizer,mlm=False)

<h1 style="color:Tomato;"> Loading LLama 3.2 1b Model </h1>

LLama 3.2 is a gated model, and you need permission to access the model weights. <br>
(**Note:** You will receive access an hour or two after submitting the form from your Hugging Face account.)

You can find details about the model, such as its architecture, performance, and more,  [here](https://huggingface.co/meta-llama/Llama-3.2-1B)

Here are some important details about the Llama 3.2 1B model: <br>
* Number of parameters:  **1 Billion** (actually, 1.23 Billion)
* Context length: 128 K (actually, 131K)
* Vocabulary size: 128 K
* Input modalities: Multilingual
* Token count: **9T**
* Knowledge cutoff: Dec 2023
* GPU clusters: **916K** GPU Hours on H100 80GB

Since we are continuing the pre-training process, we load the model with a CausalLM head.

This will load the pretrained weights of the model:


In [None]:
model = AutoModelForCausalLM.from_pretrained(model_id,pad_token_id=tokenizer.eos_token_id)
# pad_token needed in general, otherwise raises an error (makes sense!)

Let's look at the configuration and architecture details of the model:

In [None]:
configuration = model.config
print(configuration)

LlamaConfig {
  "_name_or_path": "meta-llama/Llama-3.2-1B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 16,
  "num_key_value_heads": 8,
  "pad_token_id": 128001,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.44.1",
  "use_cache": true,
  "vocab_size": 128256
}



In [None]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): Lla

It uses decoder layers (like GPT)

However, the model uses Relative Position Embeddings (RPE) that are added in the attention head

In [None]:
num_parameters = 0
for param in model.parameters():
    num_parameters += param.numel()
print(f'Number of Parameters:{num_parameters/10**9:.2f} B')

Number of Parameters:1.24 B


So, the total number of parameters is 1.24B.

Let's calculate the memory requirement for this model:

In [None]:
print(model.dtype) # 4 bytes

torch.float32


The dataype is float32 which uses 4 bytes of memory per parameter.

Therefore, the memory requirement is simply the number of parameters multiplied by the data type used to store each parameter. <br> (**Note:** We also need to store additional parameters, such as statistics for normalization, that do not require gradients).

In [None]:
mem_in_gb = (num_parameters*4)/1e9  # divide by 10^9 to get the size in GB
print(mem_in_gb)

4.9432576


We can also get the info directly using a built-in function:

In [None]:
print(model.get_memory_footprint()/1e9)

4.943259776


* The model needs about 5GB of memory.
* However, during training, **additional memory** is needed for storing gradients, which depends on the type of optimizer used.
* Additionally, GPU kernels consume some memory, typically between 2 to 4 GB, depending on the type of GPU.

In [None]:
param_model = (num_parameters*4)/1e9
adam_opt = 3*param_model # for storing moments
kernel = 1
bs = 1 # batch size
print(f'Total Memory requirement per sample: {(param_model+adam_opt+kernel)*bs} GB')

Total Memory requirement per sample: 20.7730304 GB


So, We need at **least 21 GB of memory** to train the model with a **batch size of 1**

Therefore, for this demonstration, we will use a single-node L4 GPU with 2 GPU instances, each having 24 GB of GPU memory

How do we increase the batch size per GPU device from 1 to at least 2?

Of course, we can use a gradient accumulation strategy; however, this will increase the training time

The answer is : **PEFT** adapters (optionally combined with quantization)

<h1 style="color:Tomato;"> PEFT: LoRA</h1>

Before we continue with the pre-training, let's see how the model generates a coherent text based on the given prompt:

In [None]:
prompt = "I was reading Feynman's lecture on physics. He talks about "
inputs = tokenizer(prompt,return_tensors='pt',padding=True)
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, top_k=10, top_p=0.95)
tokeni zer.batch_decode(outputs, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['I was reading Feynman\'s lecture on physics. He talks about 2 different ways to look at the world, the "old" way and the "new" way. In the "old" way, we look at the world and see that the universe is governed by laws. In the "new" way,']

As we can see, the model starts to generate a pretty good and coherent response in English.

But lets try the same thing for Tamil and see how good is the text generation:

In [None]:
prompt = "இன்று இடியுடன் கூடிய கண மழை பெய்யும் என சென்னை வானிலை "
inputs = tokenizer(prompt,return_tensors='pt',padding=True)
# set max tokens to a little higher as the words are split into 11 tokens on average
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=10, top_p=0.95)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['இன்று இடியுடன் கூடிய கண மழை பெய்யும் என சென்னை வானிலை \xa0கண்ணாடியில் பெய்யும் வரை இருக்கிறது. இதையால் புறநகர்ப் பகுதிகள் அதிக விளை']

And although the model does output tamil, **the output text is not very coherent**. Likely because the pretrained model wasn't trained on a lot of Tamil data and therefore its understanding of the language is quite poor.

<br>

To make it better, let's continue the pre-training of the model on the Tamil subset of the Sangraha dataset using **LoRa** (a parameter efficient finetuning technique).

> **LoRA** is low-rank decomposition method to reduce the number of trainable parameters which speeds up finetuning large models and uses less memory

<img src="https://raw.githubusercontent.com/Arunprakash-A/Modern-NLP-with-Hugging-Face/refs/heads/main/Notebooks/images/lora_1.png" width="400" height="360">

Thanks to **LoRa**, we don't need to update all the parameters of the model during fine tuning. We only need to update only a very small number of parameters, this makes the fine tuning process much more compute efficient.

Here, the **Blue** side is the frozen pre-trained weights of the model. We don't need to update them any further. These are frozen.

And the **Orange** side are the weights that will be updated. This is a low rank approximation to reduce the number of parameters that need to be trained. Only these parameters will be updated during fine-tuning.

So thanks to LoRa, we only need to update a much smaller number of weights when we're interested in finetuning for a downstream task instead of updating all of the billions of weights in the models!


**Forward Pass:** <br>
$$h=Wx+\Delta Wx = Wx+BAx \quad \text{where} \quad W \in \mathbb{R}^{d \times k}, B \in \mathbb{R}^{d \times r}, A \in \mathbb{R}^{r \times k}$$
* $W$ is a pre-trained weight matrix  (during training, $W$ is frozen and **does not** receive gradient updates)
* $A$ is initialized **randomly** (say, **Gaussian**)
* $B$ is initialized to **zero**
* $r$ is the rank for the low rank approximation
* $\Delta W$ is scaled by $\frac{\alpha}{r}$ after the first iteration, where $\alpha$ is a constant

Note that after matrix multiplication of the $A$ and $B$, the output dimensions are the same as the dimensions of the orignal pretrained weight matrix $W$. During fine-tuning, you only need to update the parameters in $A$ and $B$ (not the params in $W$).

Also note, as the rank increases, the approximation gets better and better, but it comes at the cost of increased number of number of parameters to be fine tuned. So if you increase `r`, the number of parameters to be fine tuned increases.

**Benefit of LoRa**: <br>
* Switching between tasks only by swapping the LoRA weights instead of all the parameters.
* This allows for the creation of many customized models that can be swapped in and out on the fly on machines that store the pre-trained weights in VRAM

**Paper**: https://arxiv.org/pdf/2106.09685 <br>
**HF Doc**: https://huggingface.co/docs/peft/v0.7.1/en/index

Initialize the **LoRa** config:

In [None]:
from peft import LoraConfig, TaskType, LoraModel
lora_config = LoraConfig(
    r=16, # rank
    target_modules=["q_proj", "v_proj"],
    task_type=TaskType.CAUSAL_LM, # this is a CLM task
    inference_mode=False,
    lora_alpha=32,
    lora_dropout=0.05
)

In general, we can add adapters to any torch modules (nn.Linear, Conv1D,..). For example,
```
target_modules = ['q_proj','k_proj','v_proj','o_proj','gate_proj','down_proj','up_proj','lm_head']
```

Create a **LoRa** model and see the reduction in the number of trainable parameters:

In [None]:
from peft import get_peft_model
lora_model = get_peft_model(model, lora_config)
lora_model.print_trainable_parameters()

trainable params: 1,703,936 || all params: 1,237,518,336 || trainable%: 0.1377


This is an insane reduction! We had **1.2B** params in the original model and thanks to **LoRa**, we don't need to update all of them for fine tuning the model on the Tamil data. We only need to update just **1.7M** parameters for our fine tuning!

That is a drastic change, we only need to train just 0.13% of the total parameters.

In [None]:
print(lora_model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 2048, padding_idx=128001)
        (layers): ModuleList(
          (0-15): 16 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=2048, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (

From the above, observe that:

* LoRA is only applied to the two projection layers: "q_proj" and "v_proj".
* For "q_proj": $A \in \mathbb{R}^{2048 \times 16}$ and $B \in \mathbb{R}^{16 \times 2048}$ where the Rank is 16.
* And if we multiply $A$ and $B$, the output dimensions will be: $2048 \times 2048$, which is the dimension of "q_proj"


------

* For "v_proj": $A \in \mathbb{R}^{2048 \times 16}$ and $B \in \mathbb{R}^{16 \times 512}$ where the Rank is 16.
* And if we multiply $A$ and $B$, the output dimensions will be: $2048 \times 512$, which is the dimension of "v_proj"

Once again, let us quickly verify the number of learnable parameters for our own satisfaction

In [None]:
num_parameters = 0
for param in model.parameters():
    num_parameters += param.numel()
print(f'Number of Parameters of original model:{num_parameters} ')

Number of Parameters of original model:1237518336 


In [None]:
num_parameters_lora = 0
for param in lora_model.parameters():
    if param.requires_grad: # only count the parameters which need to be updated
        num_parameters_lora += param.numel()
print(f'Number of Parameters of LoRA model:{num_parameters_lora} ')


Number of Parameters of LoRA model:1703936 


Everything is perfect and the math checks out! We got the same number of parameters as above using the built-in functions as well.

Note, however, that we **still need to keep the entire model in GPU memory**. This requires about 5 GB of RAM. Additionally, we need to store the activation values of all layers, which consumes a significant amount of memory

Before proceeding further, let's check a few things:

In [None]:
lora_model.peft_config

{'default': LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path='meta-llama/Llama-3.2-1B', revision=None, task_type=<TaskType.CAUSAL_LM: 'CAUSAL_LM'>, inference_mode=False, r=16, target_modules={'v_proj', 'q_proj'}, lora_alpha=32, lora_dropout=0.05, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, use_dora=False, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False))}

<h1 style="color:Tomato;">  Training </h1>

Set up the training arguments:

In [None]:
training_args = TrainingArguments( output_dir='lora_llama_1b_ct',
                                  eval_strategy="steps",
                                  eval_steps=100,
                                  num_train_epochs=1,
                                  per_device_train_batch_size=2, # set the batch size to 2
                                  per_device_eval_batch_size=2,
                                  bf16=False,
                                  fp16=True,
                                  tf32=False,
                                  gradient_accumulation_steps=1,
                                  adam_beta1=0.9,
                                  adam_beta2=0.999,
                                  learning_rate=2e-5,
                                  weight_decay=0.01,
                                  logging_dir='logs',
                                  logging_strategy="steps",
                                  logging_steps = 100,
                                  save_steps=100,
                                  save_total_limit=20,
                                  report_to='none',
                                )

Set up the model trainer:

In [None]:
trainer = Trainer(model=lora_model, # note that we're only passing the lora_model for training, so only the lora adapter layers will get trained (and not all the params)
                  args = training_args,
                 train_dataset=ds_split["train"],
                 eval_dataset=ds_split["test"],
                 data_collator = data_collator)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


```python
results = trainer.train()
```

<img src="https://raw.githubusercontent.com/Arunprakash-A/Modern-NLP-with-Hugging-Face/refs/heads/main/Notebooks/images/Lora_train_loss.png" width="600" height="200">

<img src="https://raw.githubusercontent.com/Arunprakash-A/Modern-NLP-with-Hugging-Face/refs/heads/main/Notebooks/images/Lora_validation_loss.png" width="600" height="200">

The model used approximately 10 GB of GPU memory with a batch size of 1, which is less than half of what the original model requires (over 22 GB)

So we're getting quite a good improvement using LoRa!

**WARNING**
* We loaded the original model weights and stored them in a variable `model`
* We then applied LoRA and stored the resulting model in the variable `lora_model`
* By design, **The `model` is modified `in-place`** (to save memory?)
* This is not an issue when using a script; however, it can create problems in notebooks if we execute the `model` after executing `lora_model`.

Let's load the model checkpoint (after processing 92,400 samples or 94 milllion tokens):

In [None]:
model_cpt = AutoModelForCausalLM.from_pretrained('checkpoint-15400/')

In [None]:
prompt = "இன்று இடியுடன் கூடிய கண மழை பெய்யும் என சென்னை வானில"
inputs = tokenizer(prompt,return_tensors='pt',padding=True)
# set max tokens to a little higher as the words are split into 11 tokens on average
outputs = model_cpt.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=10, top_p=0.95)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['இன்று இடியுடன் கூடிய கண மழை பெய்யும் என சென்னை வானிலை நிறுவனம் கூறுகிறது. பெரும்பாலான இடங்களில் கண மழை பெய்யும் காரணம் இது.']

After finetuning, the text generation has gotten better and the output is atleast somewhat coherent (compared to the last time).

And as we train on more samples and let the training run for more epochs, the generative capabilities on Tamil should get better.

* As usual, we can store the model using the `.save_pretrained` method and load the peft model back with `from peft import PeftModel`
* By default, the PeftModel is set for **inference**, but if you’d like to train the adapter further you can set `is_trainable=True.`
```python
lora_model = PeftModel.from_pretrained(model, "path/to/model", is_trainable=True)
```

What happens to the model's ability to complete a given prompt coherently after continual pre-training on a potentially domain-specific dataset? In this case, We've tried fine-tuning the model on a Tamil dataset. **Does it retain its earlier world knowledge** and can it complete prompts just as it did before? Although we haven't trained the model on a larger dataset, we hope that it still retains its general knowledge. Let's see:

In [None]:
prompt = "I was reading Feynman's lecture on physics. He talks about "
inputs = tokenizer(prompt,return_tensors='pt',padding=True)
outputs = model_cpt.generate(**inputs, max_new_tokens=50, do_sample=True, top_k=10, top_p=0.95)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Following is the response we got: <br>
["I was reading Feynman's lecture on physics. He talks about 2 different approaches to solving problems. One is to use the mathematics and the other is to use the physics. I have to say I think the physics approach is more useful, but I think the mathematics approach is also useful.\nI think it is useful]"

We can see that finetuning the model on a domain-specific (Tamil, in this case) dataset hasn't disturbed the model's world knowledge. It can still answer the prompts as it did before.

<h1 style="color:Tomato;">Quantization</h1>

From the docs:
>**Quantization** represents data with fewer bits, making it a useful technique for reducing memory-usage and accelerating inference especially when it comes to large language models (LLMs).

We can further reduce memory requirement (10 GB with LoRA) by quantizing the model parameters and adding adapters to the quantized model during training.

In [None]:
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

Load the model parameters in **8bit precision**

In [None]:
model_8bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config,
                                                  pad_token_id=tokenizer.eos_token_id,
                                                  device_map="auto")

In [None]:
print(model_8bit)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048, padding_idx=128001)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear8bitLt(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear8bitLt(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear8bitLt(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear8bitLt(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear8bitLt(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): 

Let's add adapters to fine-tune the quantized model like before:

In [None]:
from peft import LoraConfig, TaskType,LoraModel
lora_config = LoraConfig(
    r=16,
    target_modules=["q_proj", "v_proj"],
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    lora_alpha=32,
    lora_dropout=0.05
)

In [None]:
from peft import get_peft_model
lora_model = get_peft_model( , lora_config)
lora_model.print_trainable_parameters()

trainable params: 1,703,936 || all params: 1,237,518,336 || trainable%: 0.1377


In [None]:
print(lora_model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 2048, padding_idx=128001)
        (layers): ModuleList(
          (0-15): 16 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear8bitLt(
                (base_layer): Linear8bitLt(in_features=2048, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
   

Now we can train the model by using `Trainer` API!

* This approach requires approximately **6 GB** of GPU memory (with a batch size of 1)
* In contrast, LoRA without quantization requires about **10 GB** of GPU memory (also with a batch size of 1)

With the help of Quantization, our memory usage has further dropped from 10GB to 6GB.

Finally, we were able to continue the pre-training of Llama 3.2 1B with a batch size of 16 on the L4 GPU node (thanks to LoRa and Quantization).

Next, we will explore how to perform task-specific fine-tuning of the pre-trained model