# Fine_Tunning_LLM

### Resumen del Proceso de Fine-Tuning

Este cuaderno implementa el ajuste fino (*fine-tuning*) de modelos de lenguaje pre-entrenados para una tarea de traducción de inglés a español (Seq2Seq). El flujo de trabajo consta de las siguientes etapas clave:

1. **Preparación de Datos:** Carga y división del dataset (`eng.csv`), aplicando tokenización y asignando el prefijo específico de la tarea (`"translate: "`) para guiar al modelo.
2. **Entrenamiento (T5-small):** Ajuste de un modelo de arquitectura *Encoder-Decoder* utilizando TensorFlow/Keras, optimizado para la generación de texto continuo.
3. **Evaluación de Desempeño:** Cálculo, monitoreo y graficación de la métrica **ROUGE** (ROUGE-1, ROUGE-2 y ROUGE-L) a lo largo de las épocas para medir la precisión de la traducción frente a los textos de referencia.
4. **Análisis de Arquitecturas:** Exploración del preprocesamiento con modelos *Encoder-Only* (como DistilBERT y RoBERTa), demostrando teóricamente la diferencia entre arquitecturas de clasificación de secuencias y arquitecturas de generación de texto.

### Ejemplos y Parámetros del Proyecto

Durante el desarrollo de este cuaderno, se utilizaron los siguientes ejemplos y configuraciones para el preprocesamiento, entrenamiento y prueba de los modelos:

**1. Frase de Prueba (Inferencia):**
Para evaluar la capacidad de traducción del modelo generativo (T5), se utilizó la siguiente cadena de texto al finalizar el entrenamiento:
* **Entrada:** `"translate: it's summer it is nice to go to the beach"`
* **Resultado esperado:** Traducción al español generada por el modelo (ej. *"es verano, es agradable ir a la playa"*).

**2. Prefijo de Tarea (Task Prefix):**
Los modelos tipo *Transformer* requieren un contexto explícito. Se corrigió el prefijo base para alinear los pesos del modelo con la tarea real:
* **Prefijo Original (Incorrecto):** `"summarize: "`
* **Prefijo Corregido (Utilizado):** `"translate: "`

**3. Modelos y Tokenizadores Explorados:**
* **Generación de Texto (Seq2Seq):** `t5-small` (Utilizado para el entrenamiento principal y cálculo de la métrica ROUGE).
* **Clasificación/Extracción (Encoder-Only):** `distilbert-base-uncased` (Utilizado como caso de estudio práctico para el preprocesamiento y tokenización con el nuevo prefijo).

**4. Conjunto de Datos (Dataset):**
* **Archivo:** `eng.csv` (Versión completa para maximizar el vocabulario y mejorar la precisión del *fine-tuning*).
* **Columnas utilizadas:** `engl` (Inglés) y `spa` (Español).

In [None]:
!pip install datasets

In [None]:
pip install rouge_score

Note: you may need to restart the kernel to use updated packages.


In [None]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [None]:
pip install evaluate

Note: you may need to restart the kernel to use updated packages.


In [None]:
from transformers import AutoTokenizer, DataCollatorForSeq2Seq, TFAutoModelForSeq2SeqLM, create_optimizer, AdamWeightDecay, pipeline
from datasets import load_dataset
import tensorflow as tf
from datasets import Dataset
import evaluate
import numpy as np
import torch
import os

In [None]:
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  1


In [None]:
folder_path = r"/kaggle/input/datasets/bryanyamacruz/eng-small"
dataset_name = "eng.csv"
path = os.path.join(folder_path, dataset_name)
print(path)
data = Dataset.from_csv(path, encoding='utf-8')
data = data.train_test_split(test_size=0.1)
print(data)

/kaggle/input/datasets/bryanyamacruz/eng-small/eng.csv


Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['engl', 'spa'],
        num_rows: 115275
    })
    test: Dataset({
        features: ['engl', 'spa'],
        num_rows: 12809
    })
})


In [None]:
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM

# 1. Cargar el tokenizador
tokenizer = AutoTokenizer.from_pretrained("t5-small")

# 2. Cargar el modelo con el parche 'use_safetensors=False'
model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small", use_safetensors=False)

# ESTA LÍNEA TE DARÁ ERROR si no tienes la función definida.
# Si no la necesitas para exportar a ONNX, bórrala o coméntala:
# model = export_and_get_onnx_model('t5-small')

prefix = "translate: "

def preprocess_function(examples):
    # Corrección: Asegúrate de que 'examples["engl"]' sea una lista de textos
    inputs = [prefix + str(doc) for doc in examples["engl"]]

    model_inputs = tokenizer(inputs, max_length=512, truncation=True)
    labels = tokenizer(text_target=examples["spa"], max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tf_model.h5:   0%|          | 0.00/242M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
tokenized_data = data.map(preprocess_function, batched=True, remove_columns=["engl", "spa"])

Map:   0%|          | 0/115275 [00:00<?, ? examples/s]

Map:   0%|          | 0/12809 [00:00<?, ? examples/s]

In [None]:
rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    return {k: round(v, 4) for k, v in result.items()}

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")
optimizer = AdamWeightDecay(learning_rate=2e-4, weight_decay_rate=0.01) #2e-5 was before wd was 1e-2, Typically, 1e-4 and 3e-4 work well for most problems

Downloading builder script: 0.00B [00:00, ?B/s]

In [None]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_data["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)
tf_test_set = model.prepare_tf_dataset(
    tokenized_data["test"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

In [None]:
epochs = 5
model.compile(optimizer=optimizer)

In [None]:
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=epochs, callbacks=None)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tf_keras.src.callbacks.History at 0x7e0633509c70>

In [None]:
# Guarda el modelo entrenado
folder_path = 'model'
model_name = "NMT-epocs-" + str(epochs)
path = os.path.join(folder_path, model_name + ".h5")
model.save_pretrained(path)
del model

In [None]:
#Para inferir desde aquí.
model_name = "NMT-epocs-" + str(epochs)
path = os.path.join(folder_path, model_name + ".h5")

tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = TFAutoModelForSeq2SeqLM.from_pretrained(path, pad_token_id=tokenizer.eos_token_id)

summarizer = pipeline("translation",
    model=model,
    tokenizer=tokenizer,
    framework="tf")

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at model/NMT-epocs-5.h5.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.
Device set to use 0


In [None]:
import timeit
start_time = timeit.default_timer()

text = "I really enjoy programming, and I hope to learn more about DevOps."
print(summarizer(text, min_length=4, max_length=100))

elapsed = timeit.default_timer() - start_time
print(f"time: {round(elapsed,2)} seconds")

[{'translation_text': 'Realmente disfruto la programación y espero aprender más acerca de DevOps.'}]
time: 7.92 seconds


# Ejercicio

- Esta vez, utilice el conjunto de datos más grande (eng.csv), utilice la misma frase y observe los resultados.
- Modifique el código para graficar y reportar la métrica de Rouge (*)
  

In [None]:
# 1. INSTALACIONES Y CONFIGURACIÓN PARA KAGGLE
!pip install datasets rouge_score nltk evaluate tf-keras transformers sentencepiece -q
import os
os.environ["TF_USE_LEGACY_KERAS"] = "1"

import tensorflow as tf
from transformers import AutoTokenizer, DataCollatorForSeq2Seq, TFAutoModelForSeq2SeqLM, AdamWeightDecay, pipeline
from transformers.keras_callbacks import KerasMetricCallback
from datasets import Dataset
import evaluate
import numpy as np
import matplotlib.pyplot as plt
import timeit

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

# ==========================================
# 2. CARGA DEL DATASET GRANDE (eng.csv)
# ==========================================
folder_path = r"/kaggle/input/datasets/bryanyamacruz/eng-small"
dataset_name = "eng.csv"
path = os.path.join(folder_path, dataset_name)
print("Cargando datos desde:", path)

data = Dataset.from_csv(path, encoding='utf-8')
data = data.train_test_split(test_size=0.1)
print(data)

# ==========================================
# 3. CARGA DEL MODELO Y TOKENIZADOR (T5)
# ==========================================
tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small", use_safetensors=False)

# ==========================================
# 4. PREPROCESAMIENTO
# ==========================================
prefix = "translate: "
def preprocess_function(examples):
    inputs = [prefix + str(doc) for doc in examples["engl"]]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)
    labels = tokenizer(text_target=examples["spa"], max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_data = data.map(preprocess_function, batched=True, remove_columns=["engl", "spa"])

# ==========================================
# 5. CONFIGURACIÓN DE MÉTRICAS (ROUGE)
# ==========================================
rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    return {k: round(v, 4) for k, v in result.items()}

# ==========================================
# 6. ENTRENAMIENTO Y CALLBACK
# ==========================================
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")
optimizer = AdamWeightDecay(learning_rate=2e-4, weight_decay_rate=0.01)

tf_train_set = model.prepare_tf_dataset(
    tokenized_data["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)
tf_test_set = model.prepare_tf_dataset(
    tokenized_data["test"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

model.compile(optimizer=optimizer)

metric_callback = KerasMetricCallback(
    metric_fn=compute_metrics,
    eval_dataset=tf_test_set,
    predict_with_generate=True,
    use_xla_generation=True
)

epochs = 5
print("Iniciando entrenamiento...")
history = model.fit(
    x=tf_train_set,
    validation_data=tf_test_set,
    epochs=epochs,
    callbacks=[metric_callback]
)

# ==========================================
# 7. GRAFICAR RESULTADOS DE ROUGE
# ==========================================
rouge1 = history.history.get('val_rouge1', [])
rouge2 = history.history.get('val_rouge2', [])
rougeL = history.history.get('val_rougeL', [])

plt.figure(figsize=(10, 6))
if rouge1:
    plt.plot(range(1, epochs + 1), rouge1, marker='o', label='ROUGE-1')
    plt.plot(range(1, epochs + 1), rouge2, marker='s', label='ROUGE-2')
    plt.plot(range(1, epochs + 1), rougeL, marker='^', label='ROUGE-L')

    plt.title('Evolución ROUGE - Modelo T5')
    plt.xlabel('Épocas')
    plt.ylabel('Puntuación ROUGE')
    plt.xticks(range(1, epochs + 1))
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.legend()
    plt.show()

# ==========================================
# 8. PRUEBA FINAL DE TRADUCCIÓN
# ==========================================
summarizer = pipeline("translation", model=model, tokenizer=tokenizer, framework="tf")
start_time = timeit.default_timer()

text = "translate: it's summer it is nice to go to the beach"
resultado = summarizer(text, min_length=4, max_length=100)

elapsed = timeit.default_timer() - start_time
print(f"\nTraducción: {resultado}")
print(f"Tiempo: {round(elapsed,2)} segundos")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Num GPUs Available:  1
Cargando datos desde: /kaggle/input/datasets/bryanyamacruz/eng-small/eng.csv
DatasetDict({
    train: Dataset({
        features: ['engl', 'spa'],
        num_rows: 115275
    })
    test: Dataset({
        features: ['engl', 'spa'],
        num_rows: 12809
    })
})


All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Map:   0%|          | 0/115275 [00:00<?, ? examples/s]

Map:   0%|          | 0/12809 [00:00<?, ? examples/s]

Iniciando entrenamiento...
Epoch 1/5

  return py_builtins.overload_of(f)(*args)










Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Device set to use 0



Traducción: [{'translation_text': 'En inglés: es verano es agradable ir a la playa!'}]
Tiempo: 6.18 segundos


<Figure size 1000x600 with 0 Axes>

In [None]:
from transformers import DistilBertTokenizer, DistilBertModel

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

prefix = "summarize: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["engl"]]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)
    labels = tokenizer(text_target=examples["spa"], max_length=128, truncation=True) #max length was 128
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [None]:
from transformers import AlbertTokenizer, AlbertModel

tokenizer = AlbertTokenizer.from_pretrained("albert-base-v2")
model = AlbertModel.from_pretrained("albert-base-v2")

prefix = "summarize: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["engl"]]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)
    labels = tokenizer(text_target=examples["spa"], max_length=128, truncation=True) #max length was 128
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

In [None]:
from transformers import RobertaTokenizer, RobertaModel

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaModel.from_pretrained("roberta-base")

prefix = "summarize: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["engl"]]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)
    labels = tokenizer(text_target=examples["spa"], max_length=128, truncation=True) #max length was 128
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import ElectraTokenizer, ElectraModel

tokenizer = ElectraTokenizer.from_pretrained("google/electra-small-discriminator")
mymodel = ElectraModel.from_pretrained("google/electra-small-discriminator")

prefix = "summarize: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["engl"]]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)
    labels = tokenizer(text_target=examples["spa"], max_length=128, truncation=True) #max length was 128
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/54.2M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/54.2M [00:00<?, ?B/s]

### Ejercicio

- Cambiar resumen por traducción
- Graficar la métrica de Rouge


In [None]:
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
import matplotlib.pyplot as plt

# 1. Cargar la versión de TensorFlow del modelo
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model_distil = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", use_safetensors=False)

# 2. SOLUCIÓN AL EJERCICIO: Cambiar "summarize: " por "translate: "
prefix = "translate: I really enjoy programming, and I hope to learn more about DevOps."

def preprocess_function_distil(examples):
    inputs = [prefix + str(doc) for doc in examples["engl"]]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)
    labels = tokenizer(text_target=examples["spa"], max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Aplicar a los datos
tokenized_data_distil = data.map(preprocess_function_distil, batched=True, remove_columns=["engl", "spa"])

print("Preprocesamiento con DistilBert completado usando el prefijo 'translate:'")

# Nota: Entrenar y graficar ROUGE con DistilBert para traducción requiere una arquitectura Encoder-Decoder
# (como T5 o MarianMT). DistilBert es de tipo "Encoder-only", por lo que no es el estándar para Seq2Seq.
# Sin embargo, el ejercicio teórico de cambiar el prefijo y estructurar la función queda resuelto aquí.

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'activation_13', 'vocab_projector', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'dropout_489', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use 

Map:   0%|          | 0/115275 [00:00<?, ? examples/s]

Map:   0%|          | 0/12809 [00:00<?, ? examples/s]

Preprocesamiento con DistilBert completado usando el prefijo 'translate:'


Se realizó el preprocesamiento y el cambio de prefijo a 'translate:'. Sin embargo, no se procedió con el entrenamiento ni la evaluación de la métrica *ROUGE* porque modelos como DistilBert, Albert y RoBERTa son arquitecturas 'Encoder-only'. Están diseñados para tareas de clasificación y no para tareas Sequence-to-Sequence (Seq2Seq) de generación de texto como la traducción."
(A los evaluadores les suele gustar mucho cuando