# Part 2: Fine-Tuning 🤗 🔢

***Full notebook contents viewable on [Kaggle](https://www.kaggle.com/code/chuhuayang/prompt-recovery-pt-2-fine-tuning).***

Previously, we generated a set of training data. Now, we will use this data to fine-tune our model for better performance on the prompt recovery task. We will utilize the powerful [HuggingFace Transformers API](https://huggingface.co/docs/transformers/index)  and its various integrations, which together provide a comprehensive collection of LLM training techniques and allows us to easily save, serialize, and deploy fine-tuned models.

### Set-up

Kaggle's environments does not currently come pre-loaded with all the libraries from the Hugging Face ecosystem, so we will start by installing the necessary packages.

In [1]:
%%capture

!pip install -q -U transformers
!pip install -q -U accelerate
!pip install -q -U bitsandbytes
%pip uninstall -y -q datasets 
%pip install -q datasets==2.16.0
!pip install -q -U trl
!pip install -q -U peft

import numpy as np
import pandas as pd

import os
import warnings

warnings.filterwarnings("ignore")

import torch
import torch.nn as nn

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from datasets import Dataset
from peft import LoraConfig, PeftConfig, prepare_model_for_kbit_training, get_peft_model
import bitsandbytes as bnb
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

2024-05-26 07:48:38.671776: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-26 07:48:38.671878: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-26 07:48:38.841896: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


### Loading the Model

We load the model and tokenizer using the Transformers library. [bitsandbytes](https://huggingface.co/docs/bitsandbytes/main/en/index) is a library integrated with Transformers that allows quantization of LLMs in PyTorch to 8 or 4-bit. We create a bitsandbytes configuration and pass it to Transformers API when loading Gemma This significantly reduces the memory needed to run LLM inference. It also enables efficient LLM training techniques such as QLoRA, which we will be using.

Here is a simple run-through of the parameters. Further explanations can be found in these articles: [Introduction](https://huggingface.co/blog/hf-bitsandbytes-integration), [4-bit Quantization](https://huggingface.co/blog/4bit-transformers-bitsandbytes)
- `load_in_4bit`: bitsandbytes supports FP4 precision, a further reduction of model size from 8-bit quantizations
- `bnb_4bit_use_double_quant`: This option saves even more memory by quantizing the scaling factors as well
- `bnb_4bit_quant_type`: We have a choice between nf4 and fp4 datatypes. The QLoRA paper recommends nf4.
- `bnb_4bit_compute_dtype`: When the 4-bit weights are unpacked, they will be scaled to this datatype. We use [bfloat16](https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus), a datatype optimized for deep learning

We set some additional configurations to optimize the training process
- `config.use_cache`: Caching is unnecessary during fine-tuning. Disabling this saves memory.
- `config.pretraining_tp = 1`: Disables tensor parallelism to avoid unexpected errors
- `gradient_checkpointing_enable()`: Applies the [gradient checkpointing](https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing) strategy during the backward pass, reducing memory usage at the cost of longer compute time

In [2]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "/kaggle/input/gemma/transformers/7b-it/3",
    device_map="auto",
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16
)

model.config.use_cache = False
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()


tokenizer = AutoTokenizer.from_pretrained("/kaggle/input/gemma/transformers/7b-it/3")
tokenizer.padding_side = "right"
tokenizer.add_eos_token = True

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

### Loading the Data

To aid the LLM in learning, we will standardize inputs using a template with labels and an instruction set.

In [3]:
TEMPLATE = """### Instruction:
Below, the `Original Text` passage has been rewritten/transformed/improved into `Rewritten Text` by the `Gemma 7b-it` LLM with a certain prompt/instruction. Your task is to carefully analyze the differences between the "Original Text" and "Rewritten Text", and try to infer the specific prompt or instruction that was likely given to the LLM to rewrite/transform/improve the text in this way.

### Original Text: 
{original_text}

### Rewritten Text:
{rewritten_text}

### Response:
{prompt}
"""


def generate_prompt(row):
    return TEMPLATE.format(original_text=row["original_text"],
                           rewritten_text=row["rewritten_text"],
                           prompt=row["rewrite_prompt"])

Create train, test, and evaluation splits. Hugging Face uses its own [Datasets](https://huggingface.co/docs/datasets/main/en/index) library. There is a simple function that converts Pandas Dataframes to Datasets.

In [4]:
import random
random.seed(0)

df = pd.read_csv("/kaggle/input/prompt-recovery-pt-1-generate-training/training_data.csv")

df_train = df[0:4200].copy().reset_index()
df_eval = df[4200:4400].copy().reset_index()

df_train["text"] = df_train.apply(generate_prompt, axis=1)
df_eval["text"] = df_eval.apply(generate_prompt, axis=1)

print(random.choice(df_train["text"]))
print(random.choice(df_eval["text"]))

train_data = Dataset.from_pandas(df_train)
eval_data = Dataset.from_pandas(df_eval)

### Instruction:
Below, the `Original Text` passage has been rewritten/transformed/improved into `Rewritten Text` by the `Gemma 7b-it` LLM with a certain prompt/instruction. Your task is to carefully analyze the differences between the "Original Text" and "Rewritten Text", and try to infer the specific prompt or instruction that was likely given to the LLM to rewrite/transform/improve the text in this way.

### Original Text: 
i have the feeling that ladislaus is not too keen on visitors at his place

### Rewritten Text:
Sure, here is the text rewritten through the eyes of an aspiring poet:

In the halls of whispers and secrets,
Ladislaus's abode, a sanctuary of dreams,
Yet a veil of caution hangs thick in the air,
For visitors, a burden he does not care.

The poet's heart, a canvas of longing,
Paints a picture of a heart that is torn,
Between the desire to share his soul and the fear of intrusion,
Ladislaus's stance, a reflection of his mood.

### Response:
Convey the same message as 

### Training Configurations

The [PEFT](https://huggingface.co/docs/peft/index) library, which we imported earlier, integrates seamlessly with other Hugging Face libraries like Trainer, bitsandbytes, and Accelerate.

We create a PEFT-enabled model. Here is a quick run-through of the LoRA hyperparameters. We will mostly follow the numbers used in the [QLoRA paper's](https://arxiv.org/abs/2305.14314) chatbot training:
- `r`: The rank of the LoRA matrix being injected into the model. A larger number means more trainable parameters.
- `lora_alpha`: Controls how much influence the LoRA weights have over model behavior. A larger number means more influence, and vice versa.
- `lora_dropout`: Dropout is a common technique where randomly selected neurons are ignored during training, 0.1 means a 10% probability.
- `target_modules`: Controls which modules LoRA will be applied to. We target all linear layers.

In [5]:
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
    r=64,
    lora_alpha=16,
    lora_dropout=0.1,
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"]
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 200,015,872 || all params: 8,737,696,768 || trainable%: 2.2891


Preparing for training on the 2 T4 GPUs set-up that Kaggle offers

In [6]:
if torch.cuda.device_count() > 1: # If more than 1 GPU
    model.is_parallelizable = True
    model.model_parallel = True

torch.cuda.empty_cache()

import gc
gc.collect()

51

Hugging Face's [TRL library](https://huggingface.co/docs/trl/en/index) offers a large array of tools for model training. We will be using the Supervised Fine-tuning Trainer. Here is a quick run-through of the arguments:
- `save_steps` `save_total_limit` `load_best_model_at_end`: The Trainer saves checkpoints after every set number of steps. Our configuration will preserve the most recent checkpoint and the best checkpoint, and select the best checkpoint for use at the end of training.
- `logging_steps` `report_to`: Logs training metrics after every set number of training steps. Integrated with tools such as [Weights & Biases](https://docs.wandb.ai/guides) and [Tensorboard](https://www.tensorflow.org/tensorboard) for visualization and analysis. For simplicity, this notebook will not utilize any of these additional platforms. 
- `eval_strategy` `eval_steps`: Enables evaluation after every set number of training steps.
- `per_device_train_batch_size` `gradient_accumulation_steps`: Training batchsize and number of batches to [accumulate gradients](https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-accumulation) for. We set the former to 1 to minimize memory usage and use the latter to mimic the effects of batched samples.
- `per_device_eval_batch_size` `eval_accumulation_steps`: Controls batchsize and number of batches to evaluate before moving results to CPU. We set both to 1 to minimize memory usage.
- `learning_rate` `lr_scheduler_type`: Sets the learning rate and learning rate schedule. Values are those used by QLoRA paper.
- `weight_decay`: Adds a penalty term for larger weights to prevent overfitting
- `max_grad_norm`: Sets the threshold for gradient clipping. 0.3 is the value used by the QLoRA paper.
- `fp16=True`: Enables training in half precision.

We highlight two particularly important arguments
- **`optim`**: We will use a special version of the Adam optimizer with weight decay, [paged_adamw_8bit](https://huggingface.co/docs/bitsandbytes/en/optimizers), offered by bitsandbytes. It is quantized to 8-bits, saving large amounts of memory and speeding up compute with no performance loss. Additionally, this optimizer leverages CUDA's Unified Memory feature, utilizing CPU memory when the GPU runs out of memory.

- **`data_collator`**: Put simply, the [Data Collator](https://huggingface.co/docs/transformers/en/main_classes/data_collator) class forms batches from a Dataset, and applies processing. The data collator used here, `DataCollatorForCompletionOnlyLM` ensures that all tokens before the user-defined response template do not contribute to the gradient. The model is trained on only the generated rewrite prompts. `packing=False` prevents any conflicts with this data collator.

In [7]:
training_arguments = TrainingArguments(
    output_dir="./prompt-recovery-finetune",
    save_steps=15,
    save_total_limit=2,
    load_best_model_at_end=True,
    logging_steps=15,
    report_to="none",
    eval_strategy='steps',
    eval_steps = 15,
    num_train_epochs=1,
    gradient_checkpointing=True,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    per_device_eval_batch_size=1,
    eval_accumulation_steps=1,
    optim="paged_adamw_8bit",
    learning_rate=2e-4,
    lr_scheduler_type="constant",
    max_grad_norm=0.3,
    weight_decay=0.01,
    fp16=True,   
)

response_template = "### Response:"
collator = DataCollatorForCompletionOnlyLM(response_template=response_template, tokenizer=tokenizer)

trainer = SFTTrainer(
    model=model,
    data_collator=collator,
    train_dataset=train_data,
    eval_dataset=eval_data,
    dataset_text_field="text",
    args=training_arguments,
    packing=False,
)

Map:   0%|          | 0/4200 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

### Training and Saving

`trainer.train()` abstracts away the complicated training and evaluation loops, as well as Accelerate integrations.

Afterwards, instead of saving the entirety of the model's weights, PEFT saves an Adapter, a folder consisting of only the new LoRA weights and configurations. 

:::{tip}
If the notebook crashes with errors such as `RuntimeError: CUDA error: an illegal memory access was encountered`, we can easily resume training from the latest checkpoint with the argument `resume_from_checkpoint=True`. Ensure all other arguments are the same, and ensure `output_dir` is in the same state as before the training was interrupted. On Kaggle, since `/kaggle/working` resets between notebook runs, create a dataset containing the `outputs_dir` folder add it as an input for the next run.

```Python
!cp -r /kaggle/input/checkpoints/prompt_recovery_finetune /kaggle/working/

...

trainer.train(resume_from_checkpoint=True)
```
:::

In [8]:
trainer.train()

trainer.model.save_pretrained("./adapter")

Step,Training Loss,Validation Loss
15,1.6477,0.921029
30,0.8101,0.693134
45,0.727,0.589362
60,0.5558,0.5587
75,0.5389,0.526677
90,0.5603,0.439481
105,0.4295,0.426885
120,0.3995,0.377671
135,0.434,0.355361
150,0.3879,0.336657


### Results

Adapters are a lightweight and flexible design. To load and run a trained QLoRA model, first load the base model like normal, using the same configurations. Then, use the PeftModel.from_pretrained() method, passing in the base model and the directory of the Adapter. 

```Python
base_model = AutoModelForCausalLM.from_pretrained(
    "/kaggle/input/gemma/transformers/7b-it/3",
    device_map="auto",
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16
)
base_model.gradient_checkpointing_enable()
tokenizer = AutoTokenizer.from_pretrained(model_name)

...

model = PeftModel.from_pretrained(base_model, "/kaggle/input/{dataset_name}/adapter", adapter_name="adapter_0", is_trainable=True)
model.enable_input_require_grads()
```
This PeftModel can be used just like any normal model. We can perform inference using the `generate()` method. We can further finetune the Adapter by passing it into to a `SFTTrainer` set-up, just like above. This continues updating the LoRA weights of the Adapter.

PeftModels can even juggle multiple Adapters. After the PeftModel is instantiated, add more adapters with the `load_adapters()` method.

```Python
model.load_adapter("/kaggle/input/{dataset_name}/{additional_adapter_1}", adapter_name="adapter_1")
model.load_adapter("/kaggle/input/{dataset_name}/{additional_adapter_2}", adapter_name="adapter_2")
...
```

Only one Adapter can be active at a time. The active Adapter is used when performing inference and further finetuning. We can set any of the loaded Adapters as the active Adapter. We can also disable all Adapters to return to the base model

```Python
model.set_adapter("adapter_1")
model.disable_adapter()
```

Finally, PEFT offers an algorithm for [merging Adapters](https://huggingface.co/docs/peft/en/developer_guides/model_merging), combining the abilities of separate Adapters into one. An introduction to the algorithm can be found [here](https://huggingface.co/blog/peft_merging).

```Python
adapters = ["adapter_0", "adapter_1", "adapter_2"]
weights=[2.0, 1.0, 1.0]
finetuned_model.add_weighted_adapter(adapters=adapters, weights=weights, adapter_name="merge", combination_type="ties", density=1.0)
finetuned_model.set_adapter("merge")
```