#**Fine-tuning de SLMs**

- STF de un modelo SLM (Tiny LLama)
- DPO sobre SFT

José Ochoa Luna - Departamento de Computación - Universidad Católica San Pablo

Basado en: https://github.com/HandsOnLLM/Hands-On-Large-Language-Models/tree/main/chapter12

Instalación requerida

In [None]:
%%capture
!pip install -q accelerate==0.31.0 peft==0.11.1 bitsandbytes==0.43.1 transformers==4.41.2 trl==0.9.4 sentencepiece==0.2.0 triton==3.1.0

Librerias principales

In [None]:
from transformers import AutoTokenizer
from datasets import load_dataset
## quantization
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig


# 1. Supervised Fine-Tuning (SFT)

## Procesamiento de Datos

In [None]:
# Load a tokenizer to use its chat template
template_tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Formato de prompt a usar en Tiny LLama y a colocar en los ejemplos que serán afinados.

In [None]:
def format_prompt(example):
    """Format the prompt to using the <|user|> template TinyLLama is using"""

    # Format answers
    chat = example["messages"]
    prompt = template_tokenizer.apply_chat_template(chat, tokenize=False)

    return {"text": prompt}



Datos que serán usados para SFT (prompt y completions). Solamente fueron seleccionados 3000 datos.

In [None]:
# Load and format the data using the template TinyLLama is using
dataset = (
    load_dataset("HuggingFaceH4/ultrachat_200k",  split="test_sft")
      .shuffle(seed=42)
      .select(range(3_000))
)
dataset = dataset.map(format_prompt)

README.md: 0.00B [00:00, ?B/s]

data/train_sft-00000-of-00003-a3ecf92756(…):   0%|          | 0.00/244M [00:00<?, ?B/s]

data/train_sft-00001-of-00003-0a1804bcb6(…):   0%|          | 0.00/244M [00:00<?, ?B/s]

data/train_sft-00002-of-00003-ee46ed25cf(…):   0%|          | 0.00/244M [00:00<?, ?B/s]

data/test_sft-00000-of-00001-f7dfac4afe5(…):   0%|          | 0.00/81.2M [00:00<?, ?B/s]

data/train_gen-00000-of-00003-a6c9fb894b(…):   0%|          | 0.00/244M [00:00<?, ?B/s]

data/train_gen-00001-of-00003-d6a0402e41(…):   0%|          | 0.00/243M [00:00<?, ?B/s]

data/train_gen-00002-of-00003-c0db75b92a(…):   0%|          | 0.00/243M [00:00<?, ?B/s]

data/test_gen-00000-of-00001-3d4cd830914(…):   0%|          | 0.00/80.4M [00:00<?, ?B/s]

Generating train_sft split:   0%|          | 0/207865 [00:00<?, ? examples/s]

Generating test_sft split:   0%|          | 0/23110 [00:00<?, ? examples/s]

Generating train_gen split:   0%|          | 0/256032 [00:00<?, ? examples/s]

Generating test_gen split:   0%|          | 0/28304 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Ejemplo de prompt formateado en formato de TinyLlama.

In [None]:
# Example of formatted prompt
print(dataset["text"][2576])

<|user|>
Given the text: Knock, knock. Who’s there? Hike.
Can you continue the joke based on the given text material "Knock, knock. Who’s there? Hike"?</s>
<|assistant|>
Sure! Knock, knock. Who's there? Hike. Hike who? Hike up your pants, it's cold outside!</s>
<|user|>
Can you tell me another knock-knock joke based on the same text material "Knock, knock. Who's there? Hike"?</s>
<|assistant|>
Of course! Knock, knock. Who's there? Hike. Hike who? Hike your way over here and let's go for a walk!</s>



## Proceso de Quantization (QLoRA)

In [None]:

model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"

# 4-bit quantization configuration - Q in QLoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Use 4-bit precision model loading
    bnb_4bit_quant_type="nf4",  # Quantization type
    bnb_4bit_compute_dtype="float16",  # Compute dtype
    bnb_4bit_use_double_quant=True,  # Apply nested quantization
)




Este procedimiento de cuantificación permite reducir el tamaño del modelo original, conservando la mayor parte de la precisión de los pesos originales. Cargar el modelo ahora solo utiliza aproximadamente 1 GB de VRAM, en comparación con los aproximadamente 4 GB que necesitaría sin cuantificación.

In [None]:
# Load the model to train on the GPU
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",

    # Leave this out for regular SFT
    quantization_config=bnb_config,
)
model.config.use_cache = False
model.config.pretraining_tp = 1



config.json:   0%|          | 0.00/560 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/129 [00:00<?, ?B/s]

In [None]:
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=False)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_side = "left"

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

## Configuración

**Configuración LORA**

Se define la configuración de LoRA utilizando la biblioteca peft, que representa los hiperparámetros del proceso de ajuste fino:

In [None]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# Prepare LoRA Configuration
peft_config = LoraConfig(
    lora_alpha=32,  # LoRA Scaling
    lora_dropout=0.1,  # Dropout for LoRA Layers
    r=64,  # Rank
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=  # Layers to target
     ['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)

# prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

- r: es el rango de las matrices comprimidas. Aumentar este valor también aumentará el tamaño de las matrices comprimidas, lo que resulta en una menor compresión. Los valores suelen oscilar entre 4 y 64.

- lora_alpha: Controla la cantidad de cambio que se añade a los pesos originales. En esencia, equilibra el conocimiento del modelo original con el de la nueva tarea.

- target_modules: define que capas se actualizarán. LoRA puede optar por ignorar capas específicas, como capas de proyección específicas. Esto puede acelerar el entrenamiento, pero reducir el rendimiento, y viceversa.

**Configuración de Entrenamiento**

In [None]:
from transformers import TrainingArguments

output_dir = "./results"

# Training arguments
training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    num_train_epochs=1,
    logging_steps=10,
    fp16=True,
    gradient_checkpointing=True,
    report_to="none"
)

- num_train_epochs: El número total de epocas de entrenamiento. Los valores altos tienden a degradar el rendimiento, por lo que generalmente se prefiere valores bajos.
- learning_rate: Determina el tamaño del paso en cada iteración de las actualizaciones de peso. Los autores de QLoRA descubrieron que las tasas de aprendizaje más altas funcionan mejor para modelos más grandes (>33 000 000 parámetros).
- lr_scheduler_type: scheduler basado en coseno para ajustar la tasa de aprendizaje dinámicamente.
- optim: Los optimizadores utilizados en el artículo original de QLoRA

**Entrenamiento**

In [None]:
from trl import SFTTrainer

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    max_seq_length=512,

    # Leave this out for regular SFT
    peft_config=peft_config,
)

# Train model
trainer.train()




Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  return fn(*args, **kwargs)


Step,Training Loss
10,1.6701
20,1.4761
30,1.4512
40,1.4883
50,1.4778
60,1.3907
70,1.4949
80,1.4502
90,1.4274
100,1.4041


TrainOutput(global_step=375, training_loss=1.4169020805358887, metrics={'train_runtime': 1434.8629, 'train_samples_per_second': 2.091, 'train_steps_per_second': 0.261, 'total_flos': 9994755938844672.0, 'train_loss': 1.4169020805358887, 'epoch': 1.0})

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
url ="/content/drive/My Drive/Colab Notebooks/LASAI2025/"
#url ="/content/drive/My Drive/Colab Notebooks/"

In [None]:
# Save QLoRA weights
trainer.model.save_pretrained(url+"TinyLlama-1.1B-qlora")



**Merge Adapter**

Después de entrenar los pesos QLoRA, aún se necesita combinarlos con los pesos originales para podercusarlos. Se recarga el modelo en 16 bits, en lugar de los 4 bits cuantificados, para fusionar los pesos.

In [None]:
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    url+"TinyLlama-1.1B-qlora",
    low_cpu_mem_usage=True,
    device_map="auto",
)

# Merge LoRA and base model
merged_model = model.merge_and_unload()

**Inferencia**

In [None]:
from transformers import pipeline

# Use our predefined prompt template
prompt = """<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
"""

# Run our instruction-tuned model
pipe = pipeline(task="text-generation", model=merged_model, tokenizer=tokenizer)
print(pipe(prompt)[0]["generated_text"])

<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
Large Language Models (LLMs) are a type of artificial intelligence (AI) that can generate human-like language. They are trained on large amounts of data, including text, audio, and video, and are capable of generating complex and nuanced language.

LLMs are used in a variety of applications, including natural language processing (NLP), machine translation, and chatbots. They can be used to generate text, speech, or images, and can be trained to understand different languages and dialects.

One of the most significant applications of LLMs is in the field of natural language generation (NLG). LLMs can be used to generate text in a variety of languages, including English, French, and German. They can also be used to generate speech, such as in chatbots or voice assistants.

LLMs have also been used in the field of machine translation (MT). LLMs can be trained to translate between different languages, and can be used

El modelo ahora sigue instrucciones

In [None]:

# Use our predefined prompt template
prompt = """<|user|>
Dime algo sobre el Perú.</s>
<|assistant|>
"""

# Run our instruction-tuned model
pipe = pipeline(task="text-generation", model=merged_model, tokenizer=tokenizer)
print(pipe(prompt)[0]["generated_text"])

<|user|>
Dime algo sobre el Perú.</s>
<|assistant|>
Peru is a country located in South America. It is bordered by Bolivia, Brazil, Colombia, Chile, and Ecuador. The capital city is Lima, which is located on the Pacific coast. The country has a diverse geography, with mountains, deserts, and rainforests. The main language is Spanish, but there are also indigenous languages spoken in the country. Peru is known for its culture, cuisine, and history.


# 2. Preference Tuning (DPO)

El ajuste de preferencias es sorprendentemente similar al SFT, con algunas ligeras diferencias. Se Sigue usando TinyLlama, pero esta vez una versión optimizada para instrucciones que primero se entrenó con un ajuste fino completo y luego se alineó con DPO. En comparación con el modelo inicial optimizado para instrucciones, este LLM se entrenó con conjuntos de datos mucho más grandes.

Se probará cómo alinear aún más este modelo utilizando DPO con conjuntos de datos basados ​​en recompensas.

## Procesamiento de datos

Se utilizará un conjunto de datos que, para cada prompt, contiene una generación aceptada y una rechazada. Este conjunto de datos fue generado en parte por ChatGPT, con puntuaciones sobre qué resultados deben aceptarse y cuáles deben rechazarse.

In [None]:
from datasets import load_dataset

def format_prompt(example):
    """Format the prompt to using the <|user|> template TinyLLama is using"""

    # Format answers
    system = "<|system|>\n" + example['system'] + "</s>\n"
    prompt = "<|user|>\n" + example['input'] + "</s>\n<|assistant|>\n"
    chosen = example['chosen'] + "</s>\n"
    rejected = example['rejected'] + "</s>\n"

    return {
        "prompt": system + prompt,
        "chosen": chosen,
        "rejected": rejected,
    }

# Apply formatting to the dataset and select relatively short answers
dpo_dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")
dpo_dataset = dpo_dataset.filter(
    lambda r:
        r["status"] != "tie" and
        r["chosen_score"] >= 8 and
        not r["in_gsm8k_train"]
)
dpo_dataset = dpo_dataset.map(format_prompt, remove_columns=dpo_dataset.column_names)
dpo_dataset

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/79.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12859 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12859 [00:00<?, ? examples/s]

Map:   0%|          | 0/5922 [00:00<?, ? examples/s]

Dataset({
    features: ['chosen', 'rejected', 'prompt'],
    num_rows: 5922
})

**Quantization del Modelo**

Se carga el modelo Base y LORA creado previamente.
Como antes, se cuantifica el modelo para reducir la VRAM necesaria para el entrenamiento:

In [None]:
from peft import AutoPeftModelForCausalLM
from transformers import BitsAndBytesConfig, AutoTokenizer

# 4-bit quantization configuration - Q in QLoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Use 4-bit precision model loading
    bnb_4bit_quant_type="nf4",  # Quantization type
    bnb_4bit_compute_dtype="float16",  # Compute dtype
    bnb_4bit_use_double_quant=True,  # Apply nested quantization
)

# Merge LoRA and base model
model = AutoPeftModelForCausalLM.from_pretrained(
    url+"TinyLlama-1.1B-qlora",
    low_cpu_mem_usage=True,
    device_map="auto",
    quantization_config=bnb_config,
)
merged_model = model.merge_and_unload()

# Load LLaMA tokenizer
model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=False)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_side = "left"



Configuración

Se utiliza la misma configuración de LoRA anterior para realizar el entrenamiento de DPO:

In [None]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# Prepare LoRA Configuration
peft_config = LoraConfig(
    lora_alpha=32,  # LoRA Scaling
    lora_dropout=0.1,  # Dropout for LoRA Layers
    r=64,  # Rank
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=  # Layers to target
     ['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)

# prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

In [None]:
from trl import DPOConfig

output_dir = "./results"

# Training arguments
training_arguments = DPOConfig(
    output_dir=output_dir,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    learning_rate=1e-5,
    lr_scheduler_type="cosine",
    max_steps=200,
    logging_steps=10,
    fp16=True,
    gradient_checkpointing=True,
    warmup_ratio=0.1,
    report_to="none"
)

Se usa los mismos argumentos de entrenamiento anterior, con una diferencia. En lugar de ejecutar una sola época (que puede tardar hasta dos horas), se ejecuta 200 pasos a modo de ejemplo. Además, Se agrega el parámetro warmup_ratio, que aumenta la tasa de aprendizaje de 0 al valor  learning_rate establecido para los primeros 10 % de los pasos. Al mantener una tasa de aprendizaje baja al inicio (es decir, el período de calentamiento), permitimos que el modelo se ajuste a los datos antes de aplicar tasas de aprendizaje mayores, evitando así divergencias perjudiciales.

In [None]:
from trl import DPOTrainer

# Create DPO trainer
dpo_trainer = DPOTrainer(
    model,
    args=training_arguments,
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
    beta=0.1,
    max_prompt_length=512,
    max_length=512,
)

# Fine-tune model with DPO
dpo_trainer.train()




Deprecated positional argument(s) used in DPOTrainer, please use the DPOConfig to set these arguments instead.


Map:   0%|          | 0/5922 [00:00<?, ? examples/s]

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
max_steps is given, it will override any value given in num_train_epochs
  return fn(*args, **kwargs)
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
10,0.6924
20,0.6782
30,0.646
40,0.6069
50,0.595
60,0.6169
70,0.5934
80,0.532
90,0.5596
100,0.6396


TrainOutput(global_step=200, training_loss=0.6045445656776428, metrics={'train_runtime': 1297.9211, 'train_samples_per_second': 1.233, 'train_steps_per_second': 0.154, 'total_flos': 0.0, 'train_loss': 0.6045445656776428, 'epoch': 0.2701789935832489})

In [None]:
# Save adapter
dpo_trainer.model.save_pretrained(url+"TinyLlama-1.1B-dpo-qlora")

Se ha creado un segundo adaptador, el cual se fusiona iterativamente con el modelo base:

In [None]:
from peft import PeftModel

# Merge LoRA and base model
model = AutoPeftModelForCausalLM.from_pretrained(
    url+"TinyLlama-1.1B-qlora",
    low_cpu_mem_usage=True,
    device_map="auto",
)
sft_model = model.merge_and_unload()

# Merge DPO LoRA and SFT model
dpo_model = PeftModel.from_pretrained(
    sft_model,
    url+"TinyLlama-1.1B-dpo-qlora",
    device_map="auto",
)
dpo_model = dpo_model.merge_and_unload()

In [None]:
from transformers import pipeline

# Use our predefined prompt template
prompt = """<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
"""

# Run our instruction-tuned model
pipe = pipeline(task="text-generation", model=dpo_model, tokenizer=tokenizer)
print(pipe(prompt)[0]["generated_text"])

<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
Large Language Models (LLMs) are a type of artificial intelligence (AI) that can generate human-like language. They are trained on large amounts of data, including text, audio, and video, and are capable of generating complex and nuanced language.

LLMs are used in a variety of applications, including natural language processing (NLP), machine translation, and chatbots. They can be used to generate text, speech, or images, and can be trained to understand different languages and dialects.

One of the most significant applications of LLMs is in the field of natural language generation (NLG). LLMs can be used to generate text in a variety of languages, including English, French, and German. They can also be used to generate speech, such as in chatbots or voice assistants.

LLMs have also been used in the field of machine translation (MT). LLMs can be trained to translate between different languages, and can be used

In [None]:
# Use our predefined prompt template
prompt = """<|user|>
Dime algo sobre el Peru.</s>
<|assistant|>
"""

# Run our instruction-tuned model
pipe = pipeline(task="text-generation", model=dpo_model, tokenizer=tokenizer)
print(pipe(prompt)[0]["generated_text"])

<|user|>
Dime algo sobre el Peru.</s>
<|assistant|>
Peru is a country located in South America. It is bordered by Bolivia, Brazil, Colombia, Chile, Ecuador, and Mexico. The capital city is Lima, which is located on the Pacific coast. The country has a diverse geography, with mountains, deserts, and rainforests. The main language is Spanish, but there are also indigenous languages spoken in the country. The main religion is Catholicism, but there are also other religions such as Buddhism and Hinduism. The climate is tropical, with hot and humid summers and cool and dry winters. The main crops are corn, potatoes, and bananas. The main exports are textiles, coffee, and silver. The main tourist attractions are the ruins of the Inca Empire, the Amazon rainforest, and the Nazca lines.


- Esta combinación de SFT+DPO es una excelente manera de ajustar el modelo para que realice conversaciones básicas y luego alinee sus respuestas con las preferencias humanas.

- Sin embargo, tiene un costo, ya que necesitamos realizar dos ciclos de entrenamiento y, potencialmente, ajustar los parámetros en dos procesos.

- Desde el lanzamiento de DPO, se han desarrollado nuevos métodos para alinear preferencias. Cabe destacar Odds Ratio Preference Optimization (ORPO), un proceso que combina SFT y DPO en un solo proceso de entrenamiento. Elimina la necesidad de realizar dos ciclos de entrenamiento separados, simplificando aún más el proceso de entrenamiento y permitiendo el uso de QLoRA.