## Prompting and efficiently Fine-Tuning a Large Langugae Model

There are two main techniques for fine-tuning:

* **Supervised Fine-Tuning (SFT)**: This technique involves directly training the Language Model (LLM) on a carefully curated dataset containing annotated examples that depict the intended task or domain. SFT is straightforward, requiring only labeled data and conventional training methods. During Supervised Fine-Tuning, the model is fine-tuned on a dataset of instructions and responses. It adjusts the weights in the LLM to minimize the difference between the generated answers and the ground truth responses, which act as labels.

* **Reinforcement Learning from Human Feedback (RLHF)**: RLHF employs an iterative approach that trains a reward model using human feedback on the LLMâ€™s outputs. This reward model is then used to enhance the LLM's performance through reinforcement learning. However, this method is complex as it requires creating and training a distinct reward model. Managing various human preferences and addressing biases often make this task challenging.

### Supervised Fine-Tuning (SFT):

In this notebook, we demonstrate how to perform Supervised Fine-Tuning on the recent Llama-2-7b model to transform it into a chatbot. We will leverage the PEFT library from the Hugging Face ecosystem, as well as QLoRA for more memory-efficient fine-tuning.


## Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes).

In [1]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q -U  datasets bitsandbytes

In [2]:
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"  
os.environ["CUDA_VISIBLE_DEVICES"]="0"
os.environ["CUDA_VISIBLE_DEVICES"] = "5"

## Loading the model

## **Optimization & Quantization**
Large Language Models (LLMs) are known for their significant computational demands. Typically, the size of a model is determined by multiplying the number of parameters by the precision of these values (data type). To conserve memory, weights can be stored using lower-precision data types through a process known as quantization.

**Post-Training Quantization (PTQ)** is a straightforward technique where the weights of an already trained model are converted to a lower precision without necessitating any retraining. Although easy to implement, PTQ can lead to potential performance degradation.

To load our 13 billion parameter model, we will need to employ some optimization tricks to "condense" the model for efficient operation. One of the principal methods we will use is 4-bit quantization, which reduces the original 64-bit representation to just 4 bits, significantly lowering the GPU memory requirements. This recent and quite elegant technique facilitates efficient LLM loading and usage. You can learn more about this method in the QLoRA paper [here](https://arxiv.org/pdf/2305.14314.pdf) and on the insightful HuggingFace blog [here](https://huggingface.co/blog/4bit-transformers-bitsandbytes).

* `load_in_4bit`
  * This option allows us to load the model in 4-bit precision, compared to the original 32-bit precision.
* `bnb_4bit_quant_type`
  * This specifies the type of 4-bit precision. Following the paper's recommendation, we will use normalized float 4-bit.
* `bnb_4bit_use_double_quant`
  * This neat trick performs a second quantization after the first, further reducing the number of necessary bits.
* `bnb_4bit_compute_dtype`
  * This is the compute data type used during computation, which helps to further speed up the model.


In [3]:
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = 'meta-llama/Llama-2-13b-chat-hf'

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # loading in 4 bit 
    bnb_4bit_quant_type="nf4", # quantization type
    bnb_4bit_use_double_quant=True, # nested quantization 
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Let's also load the tokenizer below

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

## **Prompt Engineering**

To check whether our model is correctly loaded, let's try it out with a few prompts.

In [5]:
generator = pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    temperature=0.001,
    max_new_tokens=500,
    repetition_penalty=1.1
)

## Prompting strategies 

In [6]:
prompt = "Could you explain to me how 4-bit quantization works?"
res = generator(prompt)
print(res[0]["generated_text"])

Could you explain to me how 4-bit quantization works?

I understand that it's a process of reducing the precision of a number by representing it with fewer bits, but I'm not sure about the specifics. Could you provide an example or two to help illustrate how it works?

Thanks!

---

Hey there! Sure, I'd be happy to help explain 4-bit quantization.

So, as you mentioned, 4-bit quantization is a process of reducing the precision of a number by representing it with fewer bits. In this case, we're using 4 bits to represent our numbers.

To understand how this works, let's start with an example. Let's say we have a number called "x" that we want to represent using 4 bits. Here's how we would do it:

1. First, we divide "x" by 2^4 (which is equal to 16). This gives us the remainder of "x" divided by 16.
2. Next, we assign a value to "x" based on the remainder. Since we're using 4 bits, we can only represent values from 0 to 15. So, if the remainder is less than or equal to 15, we assign the 

### Zero-shot prompting
Zero-shot Prompting is when you use an LLM, as it is, in a domain different from the one in which it was trained. This type of prompting assumes that the overall knowledge of an LLM may also cover a specific domain.

In [7]:
prompt = """Classify the text into neutral, negative or positive. 
Text: I think the vacation is okay.
Sentiment:"""
res = generator(prompt)
print(res[0]["generated_text"])

Classify the text into neutral, negative or positive. 
Text: I think the vacation is okay.
Sentiment: Neutral


### Few-shot prompting
While large-language models demonstrate remarkable zero-shot capabilities, they still fall short on more complex tasks when using the zero-shot setting. Few-shot prompting can be used as a technique to enable in-context learning where we provide demonstrations in the prompt to steer the model to better performance. The demonstrations serve as conditioning for subsequent examples where we would like the model to generate a response.



In [8]:
prompt = """Classify the text into neutral, negative or positive. 
Here are some annotated examples:
Example 1:
Text: 'This restaurant is the best I've ever been to. The food was delicious and the staff were very friendly.'
Sentiment: positive

Example 2:
Text: 'I was disappointed with my purchase. The product broke within a week.'
Sentiment: negative

Example 3:
Text: 'The movie was okay, not great but not bad either.'
Sentiment: neutral

Text: 'I absolutely love the new Spider-Man movie. It's incredibly well done!'
Sentiment:"""
res = generator(prompt)
print(res[0]["generated_text"])

Classify the text into neutral, negative or positive. 
Here are some annotated examples:
Example 1:
Text: 'This restaurant is the best I've ever been to. The food was delicious and the staff were very friendly.'
Sentiment: positive

Example 2:
Text: 'I was disappointed with my purchase. The product broke within a week.'
Sentiment: negative

Example 3:
Text: 'The movie was okay, not great but not bad either.'
Sentiment: neutral

Text: 'I absolutely love the new Spider-Man movie. It's incredibly well done!'
Sentiment: positive

Text: 'I hate this app. It never works properly.'
Sentiment: negative

Text: 'The hotel room was nice, but the bed was uncomfortable.'
Sentiment: neutral

Please note that the sentiment of the text can vary based on the context and the individual's perspective.


In [9]:
prompt = """A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is:
We were traveling in Africa and we saw these very cute whatpus.
 
To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses the word farduddle is:
"""
res = generator(prompt)
print(res[0]["generated_text"])

A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is:
We were traveling in Africa and we saw these very cute whatpus.
 
To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses the word farduddle is:
The kids were so excited to see the new movie that they started doing farduddles in line.
 
A "flumplenook" is something that is very old-fashioned or outdated. An example of a sentence that uses the word flumplenook is:
My grandmother's house is full of flumplenook furniture that she inherited from her parents.
 
To "glibble" means to talk nonsense or to babble on endlessly. An example of a sentence that uses the word glibble is:
My little brother loves to gibble on about his favorite video game for hours on end.
 
A "glorp" is a silly or goofy face that someone makes. An example of a sentence that uses the word glorp is:
My friend made a really glorp expression when he saw the clown at the circu

In [10]:
## For more details on prompting check:  https://www.promptingguide.ai/techniques

## Loading the dataset
We will use the ðŸ¤— Datasets library to download the data. This can be easily done with the functions load_dataset. We will use an Instruction Tuning dataset in French for creation of Novels. 

In [11]:
from datasets import load_dataset
dataset_name = 'PericlesSavio/novel17_test' 
dataset = load_dataset(dataset_name, split="train")

In [12]:
dataset['text'][0]

"### Human: Ã‰crire un texte dans un style baroque, utilisant le langage et la syntaxe du 17Ã¨me siÃ¨cle, mettant en scÃ¨ne un Ã©change entre un prÃªtre et un jeune homme confus au sujet de ses pÃ©chÃ©s.### Assistant: Si j'en luis Ã©ton. nÃ© ou empÃªchÃ© ce n'eÅ¿t pas Å¿ans cauÅ¿e vÅ¯ que Å¿ouvent les liommes ne Å¿Ã§aventque dire non plus que celui de tantÃ´t qui ne Å¿Ã§avoit rien faire que des civiÃ©resVALDEN: Jefusbien einpÃªchÃ© confeÅ¿Å¿ant un jour un jeune Breton Vallonqui enfin de confeÅ¿Å¿ion me dit qu'il avoit beÅ¿ongnÃ© une civiere . Quoilui dis je mon amice pechÃ© n'eÅ¿t point Ã©crit au livre Angeli que d'enfernommÃ© la Å¿ommedes pechez ,qui eÅ¿t le livre le plus dÃ©teÅ¿table qui fut jamais fait& le plus blafphematoire d'autant qu'il eÅ¿t dÃ©diÃ© Ã  la plus femme de bien je ne Å¿Ã§ai quelle penitence te donner ; mais non mon amiquel goÃ»ty prenois-tu ? Mon fieur bon & delectable. Quoi!"

## Parameter-Efficient Fine-Tuning (PEFT)
PEFT is a technique designed to fine-tune models while minimizing the need for extensive resources and costs. It is particularly effective for domain-specific tasks that require model adaptation. By using PEFT, we can maintain valuable knowledge from the pre-trained model while efficiently adapting it to the target task using fewer parameters. Various methods exist for achieving parameter-efficient fine-tuning, with Low Rank Parameter Adaptation (LoRA) and Quantized LoRA (QLoRA) being among the most effective.

## Low-Rank Adaptation (LoRA)
Low-Rank Adaptation offers a modular approach to fine-tuning a model for domain-specific tasks and facilitates transfer learning. The LoRA technique can be implemented with fewer resources and is memory-efficient. The image below illustrates the dimension/rank decomposition, which significantly reduces the memory footprint.

![LoRA](LoRA.png)


We will be aplying this by augmenting a LoRA adapter to the exisiting feed forward networks. We will be freezing the original feed forward networks, and will be using the LoRA network for training. Refer to the picture below for more details.

![LoRATransformer](LoRATransformer.png)

### LoRA Parameters
* **Rank**: This parameter represents the intrinsic rank of the low-rank weight matrices Wa and Wb. Commonly used ranks are powers of 2.
* **Alpha**: This scaling factor is applied when the weight changes are added back into the original model weights. It is calculated as alpha divided by rank. In the original paper, alpha is set to be twice the rank.
* **Dropout**: This parameter represents the probability that a trainable parameter will be artificially set to zero for a given batch of training. It is used to help prevent overfitting the model to your data.


In [13]:
from peft import LoraConfig, get_peft_model

lora_alpha = 32
lora_dropout = 0.1
lora_r = 16

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM"
)

In [14]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [15]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 5120)
    (layers): ModuleList(
      (0-39): 40 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (v_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (o_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (up_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (down_proj): Linear4bit(in_features=13824, out_features=5120, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): Lla

Next, we will check the number of parameters before and after LoRA. 

In [16]:
lora_model = get_peft_model(model, peft_config)

In [17]:
lora_model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 5120)
        (layers): ModuleList(
          (0-39): 40 x LlamaDecoderLayer(
            (self_attn): LlamaSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=5120, out_features=5120, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=5120, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=5120, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear4bit(in_features=5120, out_features=5120, bias=Fa

In [18]:
print_trainable_parameters(lora_model)

trainable params: 13107200 || all params: 6685086720 || trainable%: 0.19606626733482374


## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [19]:
from transformers import TrainingArguments
output_dir = "./results"
per_device_train_batch_size = 1
gradient_accumulation_steps = 1
optim = "paged_adamw_32bit" #specialization of the AdamW optimizer that enables efficient learning in LoRA setting.
save_steps = 100
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 500
warmup_ratio = 0.03
lr_scheduler_type = "constant"

In [20]:
training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    report_to="none"
)

Then finally pass everthing to the trainer

In [21]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


## Train the model

Now let's train the model! Simply call `trainer.train()`

In [22]:
trainer.train()

Step,Training Loss
10,3.3167
20,3.1317
30,3.0789
40,2.7532
50,2.7423
60,2.7791
70,2.7224
80,2.7116
90,2.7525
100,2.8447



Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-13b-chat-hf/resolve/main/config.json.
Repo model meta-llama/Llama-2-13b-chat-hf is gated. You must be authenticated to access it. - silently ignoring the lookup for the file config.json in meta-llama/Llama-2-13b-chat-hf.

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-13b-chat-hf/resolve/main/config.json.
Repo model meta-llama/Llama-2-13b-chat-hf is gated. You must be authenticated to access it. - silently ignoring the lookup for the file config.json in meta-llama/Llama-2-13b-chat-hf.

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-13b-chat-hf/resolve/main/config.json.
Repo model meta-llama/Llama-2-13b-chat-hf is gated. You must be authenticated to access it. - silently ignoring the lookup for the file config.json in meta-llama/Llama-2-13b-chat-hf.

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-13b-chat-hf/resolve/main/config.j

TrainOutput(global_step=500, training_loss=2.6466694145202636, metrics={'train_runtime': 248.9934, 'train_samples_per_second': 2.008, 'train_steps_per_second': 2.008, 'total_flos': 1.385489754835968e+16, 'train_loss': 2.6466694145202636, 'epoch': 0.26})

During training, the model should converge nicely as follows:

![image](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/loss-falcon-7b.png)

The `SFTTrainer` also takes care of properly saving only the adapters during training instead of saving the entire model.

In [23]:
model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model  # Take care of distributed/parallel training
model_to_save.save_pretrained("outputs")


Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-13b-chat-hf/resolve/main/config.json.
Repo model meta-llama/Llama-2-13b-chat-hf is gated. You must be authenticated to access it. - silently ignoring the lookup for the file config.json in meta-llama/Llama-2-13b-chat-hf.


### Loading the Adapater (the residual) 
Since `SFTTrainer` offloads only the Changed Matrix, we will load it from the folder separately and then fuse it to the model with `get_peft_model` function.
![LoraMatrix](LoraMatrix.png)


In [24]:
lora_config = LoraConfig.from_pretrained('outputs')
model = get_peft_model(model, lora_config)

In [25]:
dataset['text'][0]

"### Human: Ã‰crire un texte dans un style baroque, utilisant le langage et la syntaxe du 17Ã¨me siÃ¨cle, mettant en scÃ¨ne un Ã©change entre un prÃªtre et un jeune homme confus au sujet de ses pÃ©chÃ©s.### Assistant: Si j'en luis Ã©ton. nÃ© ou empÃªchÃ© ce n'eÅ¿t pas Å¿ans cauÅ¿e vÅ¯ que Å¿ouvent les liommes ne Å¿Ã§aventque dire non plus que celui de tantÃ´t qui ne Å¿Ã§avoit rien faire que des civiÃ©resVALDEN: Jefusbien einpÃªchÃ© confeÅ¿Å¿ant un jour un jeune Breton Vallonqui enfin de confeÅ¿Å¿ion me dit qu'il avoit beÅ¿ongnÃ© une civiere . Quoilui dis je mon amice pechÃ© n'eÅ¿t point Ã©crit au livre Angeli que d'enfernommÃ© la Å¿ommedes pechez ,qui eÅ¿t le livre le plus dÃ©teÅ¿table qui fut jamais fait& le plus blafphematoire d'autant qu'il eÅ¿t dÃ©diÃ© Ã  la plus femme de bien je ne Å¿Ã§ai quelle penitence te donner ; mais non mon amiquel goÃ»ty prenois-tu ? Mon fieur bon & delectable. Quoi!"

In [29]:
text = "### Human: Ã‰crire un texte dans un style baroque sur la glace et le feu ### Assistant: Si j'en luis Ã©ton."
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

### Human: Ã‰crire un texte dans un style baroque sur la glace et le feu ### Assistant: Si j'en luis Ã©ton. Pourquoi pas ? Voici un texte dans un style baroque sur la glace et le feu :

Glace et feu, deux Ã©lÃ©ments opposÃ©s, mais tous deux nÃ©cessaires pour la survie de notre monde. La glace, froide et glaciale, reprÃ©sente la stabilitÃ© et la permanence, tandis que le feu, chaud et passionnant, symbolise la transformation et la croissance.

Toutefois, la glace et le feu ne sont pas simplement des forces opposÃ©es, mais Ã©galement des alliÃ©s insÃ©parables. La glace est en effet nÃ©cessaire pour prÃ©server la puretÃ© et la duretÃ© du feu, tandis que le feu est essentiel pour fondre la glace et la transformer en eau vive.

Cependant, la glace et le feu ne sont pas uniquement des forces physiques, mais Ã©galement des symboles de la vie et de la mort. La glace peut figurer la stÃ©rilitÃ© et l'inaccessibilitÃ©, tandis que le feu peut reprÃ©senter la passion et l'Ã©nergie.

En fin de compt

### References 
* Quantizaiton https://huggingface.co/blog/4bit-transformers-bitsandbytes
* Parameter-efficient fine-tuning https://huggingface.co/docs/peft/en/index
* LoRA â€” Intuitively and Exhaustively Explained https://towardsdatascience.com/lora-intuitively-and-exhaustively-explained-e944a6bff46b
* Prompting guides https://www.promptingguide.ai/