# PEFT - Quicktour

Esse guia mostra os ponto principais usados para se rodar PEFT usando o google Colab

Link do guia base: https://huggingface.co/docs/peft/quicktour


## Instalando Bibliotecas

In [1]:
!pip install datasets
!pip install evaluate
!pip install sacrebleu

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.wh

## Treino

Cada método PEFT é definido por uma classe peftConfig que armazena todos os parametros importantes para construir o modelo PeftModel. Por exemplo, para treinar usando LoRA, carregue e crie uma classe LoraConfig e especifique os seguintes parametros:

* task_type: A tarefa para treinamento (*sequence-to-sequence language* modeling nesse caso)
* inference_mode: Se o modelo será usado para inferencia ou não
* r: A dimensão das low-rank matrices
* lora_alpha: Fator de escala para as low-rank matrices
* lora_dropout: Valor de dropout das camadas LoRA


In [2]:
from peft import LoraConfig, TaskType

peft_config = LoraConfig(task_type=TaskType.CAUSAL_LM,
                         inference_mode=False,
                         r=8,
                         lora_alpha=32,
                         lora_dropout=0.1)

In [3]:
model_name = "Qwen/Qwen2.5-1.5B-Instruct"

PS: Veja o arquivo de [LoraConfig](https://huggingface.co/docs/peft/v0.15.0/en/package_reference/lora#peft.LoraConfig) para mais detalhes

In [4]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(model_name)

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

In [5]:
from peft import get_peft_model

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
"output: trainable params: 2359296 || all params: 1231940608 || trainable%: 0.19151053100118282"

trainable params: 1,089,536 || all params: 1,544,803,840 || trainable%: 0.0705


'output: trainable params: 2359296 || all params: 1231940608 || trainable%: 0.19151053100118282'

## Loading Data

In [6]:
from datasets import load_dataset

raw_datasets = load_dataset("kde4", lang1="en", lang2="pt")

README.md:   0%|          | 0.00/5.10k [00:00<?, ?B/s]

kde4.py:   0%|          | 0.00/4.25k [00:00<?, ?B/s]

The repository for kde4 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/kde4.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/8.36M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [7]:
split_datasets = raw_datasets["train"].train_test_split(train_size=0.9, seed=20)

In [8]:
split_datasets["validation"] = split_datasets.pop("test")

In [9]:
split_datasets["train"][1]["translation"]

{'en': 'The Line Numbers Pane', 'pt': 'A Área de Números de Linha'}

In [10]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name,
                                          padding_side="left",
                                          return_tensors="pt")

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

In [11]:
en_sentence = split_datasets["train"][1]["translation"]["en"]
pt_sentence = split_datasets["train"][1]["translation"]["pt"]

inputs = tokenizer(en_sentence, text_target=pt_sentence)
inputs

{'input_ids': [785, 7083, 34713, 98542], 'attention_mask': [1, 1, 1, 1], 'labels': [32, 143733, 409, 451, 63918, 409, 8564, 4223]}

In [12]:
inputs_tgt = tokenizer(pt_sentence)
print(tokenizer.convert_ids_to_tokens(inputs_tgt["input_ids"]))
print(tokenizer.decode(inputs_tgt["input_ids"]))


['A', 'ĠÃģrea', 'Ġde', 'ĠN', 'Ãºmeros', 'Ġde', 'ĠLin', 'ha']
A Área de Números de Linha


In [13]:
max_length = 256


def preprocess_function(examples):
    prefix = "translate: English to Portuguese:"
    inputs = [prefix+ex["en"] for ex in examples["translation"]]
    targets = [ex["pt"] for ex in examples["translation"]]
    model_inputs = tokenizer(
        inputs, text_target=targets, max_length=max_length, truncation=True
    )
    return model_inputs

In [14]:
tokenized_datasets = split_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=split_datasets["train"].column_names,
)

Map:   0%|          | 0/206476 [00:00<?, ? examples/s]

Map:   0%|          | 0/22942 [00:00<?, ? examples/s]

In [15]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 206476
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 22942
    })
})

In [32]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [33]:
import evaluate

metric = evaluate.load("sacrebleu")

In [18]:
import numpy as np


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

In [19]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [34]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="danhsf/Qwen2.5-1.5B-en-to-pt",
    learning_rate=1e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    save_total_limit=3,
    num_train_epochs=3,
    fp16=True,
    push_to_hub=False,
    report_to=None
)

In [35]:
from transformers import Trainer


In [38]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

  trainer = Trainer(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [43]:
#trainer.train()