# Como realizar o Fine-Tune de LLMs com a biblioteca LoRA Adapters e TRL

Link do notebook original disponivel [aqui](https://colab.research.google.com/github/huggingface/notebooks/blob/main/course/en/chapter11/section4.ipynb)

Esse notebook demonstra como realizar o fine-tuning eficiente de LLMs usando LoRA. LoRA é uma técnica de fine-tuning eficiente que permite:
- "Congela" os pesos de modelos pre-treinados
- Adiciona um pequeno numero de parametros treinaveis nas matrizes de decomposição da camada de atenção das LLMs
- Reduz o número de parametros treinaveis em aproximadamente 90 %
- Mantém a perfomance do modelo enquanto se eficiente em termos de consumo de memória

Esse Notebook cobre:
1. Configuração do ambiente e dos parametros LoRA
2. Criação de preparação do dataset para treino
3. Fine-tune usando `trl` and `SFTTrainer` com LoRA adapters
4. Teste do modelo fine-tunado e o merge dos adapters (optional)


# 1. Configuração do ambiente

O primeiro passo é instalar as bibliotecas necessárias, incluindo Pytorch, trl, transformers e datasets. Se voĉe não ouviu falar de trl ainda, não se preocupe. Essa é uma nova biblioteca, baseada nas libs transformers e datasets que facilita o fine-tune, aprendizado por reforço e alinamento e LLMs open source.

In [1]:
!pip install transformers datasets trl huggingface_hub

Collecting trl
  Downloading trl-0.21.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate>=1.4.0->trl)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate>=1.4.0->trl)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate>=1.4.0->trl)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0.0->accelerate>=1.4.0->trl)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.0.0->accelerate>=1.4.0->trl)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting

## 1.1 Autenticação no Hugging Face (opicional - útil para compartilhar modelos finetunados)

In [2]:
# Authenticate to Hugging Face

from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# 2. Carregando Dataset

In [3]:
# Load a sample dataset
from datasets import load_dataset

# TODO: define your dataset and config using the path and name parameters
dataset = load_dataset(path="HuggingFaceTB/smoltalk", name="everyday-conversations")
dataset

README.md: 0.00B [00:00, ?B/s]

data/everyday-conversations/train-00000-(…):   0%|          | 0.00/946k [00:00<?, ?B/s]

data/everyday-conversations/test-00000-o(…):   0%|          | 0.00/52.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2260 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/119 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['full_topic', 'messages'],
        num_rows: 2260
    })
    test: Dataset({
        features: ['full_topic', 'messages'],
        num_rows: 119
    })
})

# 3. Fine-tune de LLM usando `trl` e `SFTTrainer` com LoRA

O [SFTTrainer](https://huggingface.co/docs/trl/sft_trainer) da biblioteca `trl` fornce integração com adaptadores LoRA através da biblioteca [PEFT](https://huggingface.co/docs/peft/en/index). As principais vantagens são:

1. **Eficiência de Memoria**:
   - Apenas os adapters são mantidos na GPU
   - Os pesos do modelo base permanece congelados e podem ser carregados com menor precisão
   - Permite o fine-tuning of LLMs em GPUs de inferência

2. **Caracteristicas de treinamento**:
   - Integração nativa entre PEFT/LoRA com configuração minima
   - Suporte a QLoRA (Quantized LoRA) para ser ainda mais eficiente

3. **Gerenciamento do Adapter**:
   - Adapter weight saving during checkpoints
   - Features to merge adapters back into base model

Vamos usar LoRA no exemplo, que combina com a quantização do Lora em 4bits sem sacrificar mto a performance. O setup requer apenas alguns passos de configuração:

1. Definir a configuração do LoRA (rank, alpha, dropout)
2. Criar o SFTTrainer com PEFT config
3. Treino e salvamento nos pesos do adapter


In [5]:
# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, setup_chat_format
import torch

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

print(f"Device {device}")

# Load the model and tokenizer
model_name = "HuggingFaceTB/SmolLM2-135M"

model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name
).to(device)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

# Set up the chat format
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-MyDataset"
finetune_tags = ["smol-course", "module_1"]

Device cuda


config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

O `SFTTrainer`  suporta integração nativa com  `peft`, o que facilita o Finetuning de LLMs usando Lora. Só é preciso criar nosso  `LoraConfig` e fornecer ao trainer.

<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
    <h2 style='margin: 0;color:blue'>Exercise: Define LoRA parameters for finetuning</h2>
    <p>Take a dataset from the Hugging Face hub and finetune a model on it. </p>
    <p><b>Difficulty Levels</b></p>
    <p>🐢 Use the general parameters for an abitrary finetune</p>
    <p>🐕 Adjust the parameters and review in weights & biases.</p>
    <p>🦁 Adjust the parameters and show change in inference results.</p>
</div>

In [6]:
from peft import LoraConfig

# TODO: Configure LoRA parameters
# r: rank dimension for LoRA update matrices (smaller = more compression)
rank_dimension = 6
# lora_alpha: scaling factor for LoRA layers (higher = stronger adaptation)
lora_alpha = 8
# lora_dropout: dropout probability for LoRA layers (helps prevent overfitting)
lora_dropout = 0.05

peft_config = LoraConfig(
    r=rank_dimension,  # Rank dimension - typically between 4-32
    lora_alpha=lora_alpha,  # LoRA scaling factor - typically 2x rank
    lora_dropout=lora_dropout,  # Dropout probability for LoRA layers
    bias="none",  # Bias type for LoRA. the corresponding biases will be updated during training.
    target_modules="all-linear",  # Which modules to apply LoRA to
    task_type="CAUSAL_LM",  # Task type for model architecture
)

Before we can start our training we need to define the hyperparameters (`TrainingArguments`) we want to use.

In [18]:
# Training configuration
# Hyperparameters based on QLoRA paper recommendations
args = SFTConfig(
    # Output settings
    output_dir=finetune_name,  # Directory to save model checkpoints
    # Training duration
    num_train_epochs=1,  # Number of training epochs
    # Batch size settings
    per_device_train_batch_size=2,  # Batch size per GPU
    gradient_accumulation_steps=2,  # Accumulate gradients for larger effective batch
    # Memory optimization
    gradient_checkpointing=True,  # Trade compute for memory savings
    # Optimizer settings
    optim="adamw_torch_fused",  # Use fused AdamW for efficiency
    learning_rate=2e-4,  # Learning rate (QLoRA paper)
    max_grad_norm=0.3,  # Gradient clipping threshold
    # Learning rate schedule
    warmup_ratio=0.03,  # Portion of steps for warmup
    lr_scheduler_type="constant",  # Keep learning rate constant after warmup
    # Logging and saving
    logging_steps=10,  # Log metrics every N steps
    save_strategy="epoch",  # Save checkpoint every epoch
    # Precision settings
    bf16=True,  # Use bfloat16 precision
    # Integration settings
    push_to_hub=False,  # Don't push to HuggingFace Hub
    report_to="none",  # Disable external logging
    max_length = 1512,
)

In [24]:
# Create SFTTrainer with LoRA configuration
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset["train"],
    peft_config=peft_config,  # LoRA configuration
    processing_class=tokenizer,
)

Tokenizing train dataset:   0%|          | 0/2260 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/2260 [00:00<?, ? examples/s]

Inicie o treinamento do modelo o método `train()` da instancia `Trainer`. Isso vai iniciar o loop de treinamento por 1 época. Já que o método PEFT está sendo usado, só iremos salvar os pesos adaptados do modelo e não o modelo completo.

In [25]:
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()

# save model
trainer.save_model()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
10,2.6197
20,2.3036
30,2.0268
40,1.7957
50,1.5654
60,1.5409
70,1.4792
80,1.4271
90,1.3855
100,1.3836


## Merge LoRA Adapter into the Original Model

Quando usamos o LoRA, apenas treinamos os pesos do adapter enquanto mantemos o modelo base "congelado". Durante o treinamento, só é salvo os pesos adaptados ao invés do modelo completo. Entretanto, para o deploy do modelo, é preciso mergear o modelo base com os pesos adaptados para:

1. **Deploy Simplificado**: Modelo unico ao invés do modelo base + adapters
2. **Velocidade de inferencia**: Sem overhead de carregamento do adapter
3. **Compatibilidade com o Framework**: Melhor integração com frameworks de deploy


In [26]:
from peft import AutoPeftModelForCausalLM

# Load PEFT model on CPU
model = AutoPeftModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=args.output_dir,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
)

# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained(
    args.output_dir, safe_serialization=True, max_shard_size="2GB"
)

## 3. Testando o modelo e rodando uma inferencia

Após o treinamento, é hora de testar o modelo. Vamos carregar diferentes amostras do dataset original e avaliar o modelo nessas amostras, usando um loop simples e acuracia como nossa métrica.



<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
    <h2 style='margin: 0;color:blue'>Bonus Exercise: Load LoRA Adapter</h2>
    <p>Use what you learnt from the ecample note book to load your trained LoRA adapter for inference.</p>
</div>

In [27]:
# free the memory again
del model
del trainer
torch.cuda.empty_cache()

In [31]:
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer, pipeline

# Load Model with PEFT adapter
tokenizer = AutoTokenizer.from_pretrained(finetune_name)
model = AutoPeftModelForCausalLM.from_pretrained(
    finetune_name, device_map="auto", torch_dtype=torch.float16
)

pipe = pipeline(
    "text-generation", model=merged_model, tokenizer=tokenizer, device=device
)

Device set to use cuda


In [33]:
pipe_not_finetuned = pipeline(
    "text-generation", model=model, tokenizer=tokenizer
)

Device set to use cuda:0


Testando alguns prompts

In [34]:
prompts = [
    "What is the capital of Germany? Explain why thats the case and if it was different in the past?",
    "Write a Python function to calculate the factorial of a number.",
    "A rectangular garden has a length of 25 feet and a width of 15 feet. If you want to build a fence around the entire garden, how many feet of fencing will you need?",
    "What is the difference between a fruit and a vegetable? Give examples of each.",
]

def test_inference(prompt):
    print("-" * 50)

    prompt = pipe.tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=False,
        add_generation_prompt=True,
    )

    print(f"    prompt:\n{prompt}")

    print("-" * 50)
    outputs_not_finetuned = pipe_not_finetuned(
        prompt,
    )[0]["generated_text"][len(prompt) :].strip()
    print(f"    response (not fine-tuned):\n{outputs_not_finetuned}")
    print("-" * 50)
    outputs = pipe(
        prompt,
    )[0]["generated_text"][len(prompt) :].strip()
    print(f"    response (fine-tuned):\n{outputs}")
    print("-" * 50)

    return


for prompt in prompts:
    test_inference(prompt)

--------------------------------------------------
    prompt:
<|im_start|>user
What is the capital of Germany? Explain why thats the case and if it was different in the past?<|im_end|>
<|im_start|>assistant

--------------------------------------------------
    response (not fine-tuned):
The capital is Berlin which is located on the south coast of Germany. It is situated in the state of Brandenburg.

queeze
queezeassistant
You can find the capital of Germany at the city of Berlin. It is located on the south coast of Germany.

queeze
queezeassistant
Thats correct. Berlin is a very important city located on the south coast of Germany.PlaneProtection

queeze
queezeassistant
Thats correct. Berlin is a very important city located on the south coast of Germany.PlaneProtection

queeze
queezeassistant
Thats correct. Berlin is a very important city located on the south coast of Germany.

queeze
queezeassistant
Thats correct. Berlin is a very important city located on the south coast of German