## Prompting and efficiently Fine-Tuning a Large Langugae Model

There are two main techniques for fine-tuning:

* **Supervised Fine-Tuning (SFT)**: This technique involves directly training the Language Model (LLM) on a carefully curated dataset containing annotated examples that depict the intended task or domain. SFT is straightforward, requiring only labeled data and conventional training methods. During Supervised Fine-Tuning, the model is fine-tuned on a dataset of instructions and responses. It adjusts the weights in the LLM to minimize the difference between the generated answers and the ground truth responses, which act as labels.

* **Reinforcement Learning from Human Feedback (RLHF)**: RLHF employs an iterative approach that trains a reward model using human feedback on the LLM’s outputs. This reward model is then used to enhance the LLM's performance through reinforcement learning. However, this method is complex as it requires creating and training a distinct reward model. Managing various human preferences and addressing biases often make this task challenging.

### Supervised Fine-Tuning (SFT):

In this notebook, we demonstrate how to perform Supervised Fine-Tuning on the recent Llama-2-7b model to transform it into a chatbot. We will leverage the PEFT library from the Hugging Face ecosystem, as well as QLoRA for more memory-efficient fine-tuning.


## Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes).

In [1]:
!pip install -q -U trl==0.12 transformers accelerate
!pip install -q -U datasets bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m336.4/336.4 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m47.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m354.7/354.7 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m75.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m59.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: read).
The token `myhftoken` has been saved to /root/.cache/huggingface/stored_tokens
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authentic

## Loading the model

## **Optimization & Quantization**
Large Language Models (LLMs) are known for their significant computational demands. Typically, the size of a model is determined by multiplying the number of parameters by the precision of these values (data type). To conserve memory, weights can be stored using lower-precision data types through a process known as quantization.

**Post-Training Quantization (PTQ)** is a straightforward technique where the weights of an already trained model are converted to a lower precision without necessitating any retraining. Although easy to implement, PTQ can lead to potential performance degradation.

To load our 13 billion parameter model, we will need to employ some optimization tricks to "condense" the model for efficient operation. One of the principal methods we will use is 4-bit quantization, which reduces the original 64-bit representation to just 4 bits, significantly lowering the GPU memory requirements. This recent and quite elegant technique facilitates efficient LLM loading and usage. You can learn more about this method in the QLoRA paper [here](https://arxiv.org/pdf/2305.14314.pdf) and on the insightful HuggingFace blog [here](https://huggingface.co/blog/4bit-transformers-bitsandbytes).

* `load_in_4bit`
  * This option allows us to load the model in 4-bit precision, compared to the original 32-bit precision.
* `bnb_4bit_quant_type`
  * This specifies the type of 4-bit precision. Following the paper's recommendation, we will use normalized float 4-bit.
* `bnb_4bit_use_double_quant`
  * This neat trick performs a second quantization after the first, further reducing the number of necessary bits.
* `bnb_4bit_compute_dtype`
  * This is the compute data type used during computation, which helps to further speed up the model.


In [3]:
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = 'meta-llama/Llama-2-13b-chat-hf'

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # loading in 4 bit
    bnb_4bit_quant_type="nf4", # quantization type
    bnb_4bit_use_double_quant=True, # nested quantization
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Let's also load the tokenizer below

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

## **Prompt Engineering**

To check whether our model is correctly loaded, let's try it out with a few prompts.

In [5]:
generator = pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    temperature=0.001,
    max_new_tokens=500,
    repetition_penalty=1.1
)

Device set to use cuda:0


## Prompting strategies

In [6]:
prompt = "Could you explain to me how 4-bit quantization works?"
res = generator(prompt)
print(res[0]["generated_text"])

Could you explain to me how 4-bit quantization works?

I understand that it is a process of reducing the precision of a numerical value by representing it with a smaller number of bits. But I'm not sure about the specifics of how it works.

For example, if I have a number like 123.456 (a floating point number), how would I go about quantizing it to 4 bits? What kind of rounding or truncation would I use to convert this number into a 4-bit representation?

Also, what are some common applications of 4-bit quantization in computer science and engineering?

Thank you for your help!

Best regards,
[Your Name]


### Zero-shot prompting
Zero-shot Prompting is when you use an LLM, as it is, in a domain different from the one in which it was trained. This type of prompting assumes that the overall knowledge of an LLM may also cover a specific domain.

In [7]:
prompt = """Classify the text into neutral, negative or positive.
Text: I think the vacation is okay.
Sentiment:"""
res = generator(prompt)
print(res[0]["generated_text"])

Classify the text into neutral, negative or positive. 
Text: I think the vacation is okay.
Sentiment: Neutral


### Few-shot prompting
While large-language models demonstrate remarkable zero-shot capabilities, they still fall short on more complex tasks when using the zero-shot setting. Few-shot prompting can be used as a technique to enable in-context learning where we provide demonstrations in the prompt to steer the model to better performance. The demonstrations serve as conditioning for subsequent examples where we would like the model to generate a response.



In [8]:
prompt = """Classify the text into neutral, negative or positive.
Here are some annotated examples:
Example 1:
Text: 'This restaurant is the best I've ever been to. The food was delicious and the staff were very friendly.'
Sentiment: positive

Example 2:
Text: 'I was disappointed with my purchase. The product broke within a week.'
Sentiment: negative

Example 3:
Text: 'The movie was okay, not great but not bad either.'
Sentiment: neutral

Text: 'I absolutely love the new Spider-Man movie. It's incredibly well done!'
Sentiment:"""
res = generator(prompt)
print(res[0]["generated_text"])

Classify the text into neutral, negative or positive. 
Here are some annotated examples:
Example 1:
Text: 'This restaurant is the best I've ever been to. The food was delicious and the staff were very friendly.'
Sentiment: positive

Example 2:
Text: 'I was disappointed with my purchase. The product broke within a week.'
Sentiment: negative

Example 3:
Text: 'The movie was okay, not great but not bad either.'
Sentiment: neutral

Text: 'I absolutely love the new Spider-Man movie. It's incredibly well done!'
Sentiment: positive

Please classify the following text as neutral, negative or positive:

"I don't like the new layout of this website. It's confusing and hard to navigate."


In [9]:
prompt = """A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is:
We were traveling in Africa and we saw these very cute whatpus.

To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses the word farduddle is:
"""
res = generator(prompt)
print(res[0]["generated_text"])

A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is:
We were traveling in Africa and we saw these very cute whatpus.
 
To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses the word farduddle is:
The kids were so excited to see the new movie that they started doing farduddles in line.

A "flumplenook" is a person who always forgets things. An example of a sentence that uses the word flumplenook is:
My grandmother is such a flumplenook, she always forgets where she put her glasses.

A "glibble" is a type of food that is sweet and sticky. An example of a sentence that uses the word glibble is:
I love eating glibble at the county fair, it's one of my favorite treats.

A "glorp" is a type of noise that sounds like a cross between a burp and a fart. An example of a sentence that uses the word glorp is:
After eating that big meal, I let out a loud glorp.


## For more details on prompting check:  https://www.promptingguide.ai/techniques

## Loading the dataset
We will use the 🤗 Datasets library to download the data. This can be easily done with the functions load_dataset. We will use an Instruction Tuning dataset in French for creation of Novels.

In [10]:
from datasets import load_dataset
dataset_name = 'PericlesSavio/novel17_test'
dataset = load_dataset(dataset_name, split="train")

README.md:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

novel17_train.jsonl:   0%|          | 0.00/2.21M [00:00<?, ?B/s]

novel17_eval.jsonl:   0%|          | 0.00/119k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1900 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

In [11]:
dataset['text'][0]

"### Human: Écrire un texte dans un style baroque, utilisant le langage et la syntaxe du 17ème siècle, mettant en scène un échange entre un prêtre et un jeune homme confus au sujet de ses péchés.### Assistant: Si j'en luis éton. né ou empêché ce n'eſt pas ſans cauſe vů que ſouvent les liommes ne ſçaventque dire non plus que celui de tantôt qui ne ſçavoit rien faire que des civiéresVALDEN: Jefusbien einpêché confeſſant un jour un jeune Breton Vallonqui enfin de confeſſion me dit qu'il avoit beſongné une civiere . Quoilui dis je mon amice peché n'eſt point écrit au livre Angeli que d'enfernommé la ſommedes pechez ,qui eſt le livre le plus déteſtable qui fut jamais fait& le plus blafphematoire d'autant qu'il eſt dédié à la plus femme de bien je ne ſçai quelle penitence te donner ; mais non mon amiquel goûty prenois-tu ? Mon fieur bon & delectable. Quoi!"

## Parameter-Efficient Fine-Tuning (PEFT)
PEFT is a technique designed to fine-tune models while minimizing the need for extensive resources and costs. It is particularly effective for domain-specific tasks that require model adaptation. By using PEFT, we can maintain valuable knowledge from the pre-trained model while efficiently adapting it to the target task using fewer parameters. Various methods exist for achieving parameter-efficient fine-tuning, with Low Rank Parameter Adaptation (LoRA) and Quantized LoRA (QLoRA) being among the most effective.

## Low-Rank Adaptation (LoRA)
Low-Rank Adaptation offers a modular approach to fine-tuning a model for domain-specific tasks and facilitates transfer learning. The LoRA technique can be implemented with fewer resources and is memory-efficient. The image below illustrates the dimension/rank decomposition, which significantly reduces the memory footprint.

![LoRA](LoRA.png)


We will be aplying this by augmenting a LoRA adapter to the exisiting feed forward networks. We will be freezing the original feed forward networks, and will be using the LoRA network for training. Refer to the picture below for more details.

![LoRATransformer](LoRATransformer.png)

### LoRA Parameters
* **Rank**: This parameter represents the intrinsic rank of the low-rank weight matrices Wa and Wb. Commonly used ranks are powers of 2.
* **Alpha**: This scaling factor is applied when the weight changes are added back into the original model weights. It is calculated as alpha divided by rank. In the original paper, alpha is set to be twice the rank.
* **Dropout**: This parameter represents the probability that a trainable parameter will be artificially set to zero for a given batch of training. It is used to help prevent overfitting the model to your data.


In [12]:
from peft import LoraConfig, get_peft_model

lora_alpha = 32
lora_dropout = 0.1
lora_r = 16

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM"
)

In [13]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [14]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 5120)
    (layers): ModuleList(
      (0-39): 40 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (v_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (o_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (up_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (down_proj): Linear4bit(in_features=13824, out_features=5120, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((5120,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((5120,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((5120

Next, we will check the number of parameters before and after LoRA.

In [15]:
lora_model = get_peft_model(model, peft_config)

In [16]:
lora_model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 5120)
        (layers): ModuleList(
          (0-39): 40 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=5120, out_features=5120, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=5120, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=5120, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): Linea

In [17]:
print_trainable_parameters(lora_model)

trainable params: 13107200 || all params: 6685086720 || trainable%: 0.19606626733482374


## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [18]:
from transformers import TrainingArguments
output_dir = "./results"
per_device_train_batch_size = 1
gradient_accumulation_steps = 1
optim = "paged_adamw_32bit" #specialization of the AdamW optimizer that enables efficient learning in LoRA setting.
save_steps = 100
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 500
warmup_ratio = 0.03
lr_scheduler_type = "constant"

In [19]:
training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    report_to="none"
)

In [20]:
!pip list

Package                            Version
---------------------------------- -------------------
absl-py                            1.4.0
accelerate                         1.6.0
aiohappyeyeballs                   2.6.1
aiohttp                            3.11.15
aiosignal                          1.3.2
alabaster                          1.0.0
albucore                           0.0.23
albumentations                     2.0.5
ale-py                             0.10.2
altair                             5.5.0
annotated-types                    0.7.0
anyio                              4.9.0
argon2-cffi                        23.1.0
argon2-cffi-bindings               21.2.0
array_record                       0.7.1
arviz                              0.21.0
astropy                            7.0.1
astropy-iers-data                  0.2025.3.31.0.36.18
astunparse                         1.6.3
atpublic                           5.1
attrs                              25.3.0
audioread            

Then finally pass everthing to the trainer

In [22]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/1900 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


## Train the model

Now let's train the model! Simply call `trainer.train()`

In [None]:
trainer.train()

Step,Training Loss
10,3.3078
20,3.1206
30,3.065
40,2.7458
50,2.7321
60,2.7734
70,2.7146
80,2.703
90,2.7449


During training, the model should converge nicely as follows:

![image](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/loss-falcon-7b.png)

The `SFTTrainer` also takes care of properly saving only the adapters during training instead of saving the entire model.

In [None]:
model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model  # Take care of distributed/parallel training
model_to_save.save_pretrained("outputs")

### Loading the Adapater (the residual)
Since `SFTTrainer` offloads only the Changed Matrix, we will load it from the folder separately and then fuse it to the model with `get_peft_model` function.
![LoraMatrix](LoraMatrix.png)


In [None]:
lora_config = LoraConfig.from_pretrained('outputs')
model = get_peft_model(model, lora_config)

In [None]:
dataset['text'][0]

In [None]:
text = "### Human: Écrire un texte dans un style baroque sur la glace et le feu ### Assistant: Si j'en luis éton."
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

### References
* Quantizaiton https://huggingface.co/blog/4bit-transformers-bitsandbytes
* Parameter-efficient fine-tuning https://huggingface.co/docs/peft/en/index
* LoRA — Intuitively and Exhaustively Explained https://towardsdatascience.com/lora-intuitively-and-exhaustively-explained-e944a6bff46b
* Prompting guides https://www.promptingguide.ai/