<a href="https://colab.research.google.com/github/cbadenes/curso-pln/blob/main/notebooks/proyecto_apoyo/03_EntrenamientoModelos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Dependencias Necesarias

In [None]:
# Instalar las dependencias necesarias
!pip install transformers
!pip install -U datasets
!pip install torch

In [None]:
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))


True
NVIDIA A100-SXM4-40GB


#Generación de Texto (AutoModeloForCausalLM)

##Antes de Ajuste Fino

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import Dataset
import torch

# Dataset con texto continuo sobre Estopa
datos = [
    {"text": "Estopa es un dúo musical español formado por los hermanos David y José Manuel Muñoz. La banda se fundó en 1999 en Cornellà de Llobregat, Barcelona. Su estilo musical combina flamenco, rock y rumba catalana, creando un sonido único que los ha llevado a la fama."},
    {"text": "El primer álbum de Estopa, titulado 'Estopa', se lanzó en  1999 y fue un éxito inmediato, vendiendo más de un millón de copias. Canciones como 'La raja de tu falda' y 'Como Camarón' se convirtieron en clásicos."},
    {"text": "A lo largo de su carrera, Estopa ha lanzado más de 10 discos de estudio, manteniendo su característico estilo y evolucionando con nuevas influencias. Su álbum 'Destrangis' consolidó aún más su éxito con canciones como 'Vino tinto'."},
    {"text": "Estopa ha ganado numerosos premios, incluidos los Premios Ondas y los 40 Principales, que reconocen su contribución a la música española. Sus conciertos son conocidos por su energía y conexión con el público."},
    {"text": "La ciudad natal de los hermanos, Cornellà de Llobregat, influyó profundamente en su música. La mezcla cultural y las tradiciones flamencas del lugar se reflejan en sus letras y melodías."},
    {"text": "Además de su música, Estopa es conocido por sus letras llenas de humor y referencias cotidianas. Estas características los han hecho destacar y conectar con una audiencia amplia y diversa."},
    {"text": "Estopa continúa siendo una de las bandas más queridas en España, manteniendo su esencia mientras exploran nuevas direcciones en su música. Su legado perdurará como un símbolo de creatividad y autenticidad en la música española."}
]

# Crear un Dataset compatible con Hugging Face
dataset = Dataset.from_list(datos)

# Cargar el modelo y el tokenizador preentrenados en español
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
modelo = AutoModelForCausalLM.from_pretrained(model_name)
tokenizador = AutoTokenizer.from_pretrained(model_name)

# Usar eos_token_id como pad_token_id
tokenizador.pad_token = tokenizador.eos_token
modelo.config.pad_token_id = tokenizador.eos_token_id

# Tokenizar el dataset
def procesar_datos(ejemplo):
    tokenizado = tokenizador(
    ejemplo["text"],              # El texto de entrada que se va a tokenizar
    max_length=128,               # Máxima longitud de la secuencia tokenizada
    truncation=True,              # Recorta el texto si supera max_length
    padding="max_length",         # Rellena con tokens [PAD] hasta alcanzar max_length
    return_tensors="pt"           # Devuelve tensores de PyTorch (también puede ser "tf" o "np")
)

    return {key: tensor.squeeze() for key, tensor in tokenizado.items()}

dataset_procesado = dataset.map(procesar_datos)

# Usar DataCollatorForLanguageModeling (para gestionar el padding correctamente)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizador, mlm=False  # mlm=False porque es modelado causal, no enmascarado
)

# Configuración del entrenamiento
argumentos = TrainingArguments(
    output_dir="./resultados",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    save_total_limit=1,
    logging_steps=10,
    report_to="none"  # Desactiva W&B
)

# Ajustar las dimensiones del modelo al tokenizador (porque hemos añadido un nuevo token, el de padding)
modelo.resize_token_embeddings(len(tokenizador))

# Crear el Trainer
trainer = Trainer(
    model=modelo,
    args=argumentos,
    data_collator=data_collator,
    train_dataset=dataset_procesado,
    tokenizer=tokenizador,
    eval_dataset=dataset_procesado
)

def generar_texto(prompt, modelo, tokenizador, max_length=100):
    """
    Función para generar texto con el modelo actual.
    """
    inputs = tokenizador(prompt, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
    modelo.to("cuda" if torch.cuda.is_available() else "cpu")

    with torch.no_grad():
        output = modelo.generate(**inputs, max_length=max_length, do_sample=True, top_k=50, top_p=0.95)

    return tokenizador.decode(output[0], skip_special_tokens=True)

# Prueba antes del ajuste fino
prompt_test = "Estopa es una banda española conocida por"
print("\n🔹 **Generación de texto ANTES del ajuste fino:**")
print(generar_texto(prompt_test, modelo, tokenizador))

Estopa es una banda española conocida por su estilo hard rock y su sonido punk. Formada en 1982 en Oviedo por Estel Mena y Juan Pablo Estel. Entre 1987 y 1993 estuvo integrada por Estel Mena, el bajista de la banda, que pasaría luego a formar la banda Maná.

Historia 

Estoló fue


##Entrenamiento

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,0.699319
2,No log,0.28886
3,1.191700,0.172364


TrainOutput(global_step=12, training_loss=1.026526893178622, metrics={'train_runtime': 42.1278, 'train_samples_per_second': 0.498, 'train_steps_per_second': 0.285, 'total_flos': 16684615729152.0, 'train_loss': 1.026526893178622, 'epoch': 3.0})

##Después del Ajuste fino

In [None]:
print("\n **Generación de texto DESPUÉS del ajuste fino:**")
print(generar_texto(prompt_test, modelo, tokenizador))


 **Generación de texto DESPUÉS del ajuste fino:**
Estopa es una banda española conocida por sus letras llenas de humor y referencias cotidianas. Comenzaron a sonar en los años 90, siendo un de los grupos más queridos y queridos en España. Su legado perdurará como un símbolo de creatividad y autenticidad en la música española. Su álbum 'Qué bueno' se llevó a una raja de éxitos in


#2. Clasificación de Texto (AutoModelForSequenceClassification)

##Antes de ajuste fino:

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding
from datasets import load_dataset
from sklearn.metrics import accuracy_score, classification_report
import torch

# Tokenizador y modelo preentrenado
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)  # Positivo y Negativo

#EVALUACIÓN
texts = ["Amazing movie!", "Terrible plot.", "Loved the characters!", "Not my taste."]
true_labels = [1, 0, 1, 0]  # Etiquetas reales

# Predicciones
inputs = tokenizer(texts, return_tensors="pt", truncation=True, padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model(**inputs)
predicted_classes = torch.argmax(outputs.logits, dim=1).tolist()

# Calcular métricas
accuracy = accuracy_score(true_labels, predicted_classes)
print("Accuracy:", accuracy)
print("Reporte de clasificación:")
print(classification_report(true_labels, predicted_classes, target_names=["Negative", "Positive"]))


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Accuracy: 0.25
Reporte de clasificación:
              precision    recall  f1-score   support

    Negative       0.33      0.50      0.40         2
    Positive       0.00      0.00      0.00         2

    accuracy                           0.25         4
   macro avg       0.17      0.25      0.20         4
weighted avg       0.17      0.25      0.20         4



In [None]:
print("CLASES PREDICHAS:", predicted_classes)

CLASES PREDICHAS: [0, 1, 0, 0]


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Texto de prueba
text = "This movie is amazing, I loved it!"

# Tokenización y predicción
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Mover las entradas al mismo dispositivo que el modelo (GPU si está disponible)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

outputs = model(**inputs)
logits = outputs.logits  # predicciones sin normalizar, es decir, los valores antes de aplicar una función como softmax.
predicted_class = torch.argmax(logits, dim=1).item()

# Mostrar resultado
label_map = {0: "Negative", 1: "Positive"}  # Cambia según las etiquetas de tu modelo
print(f"Texto: {text}")
print(f"Predicción: {label_map[predicted_class]}")

Texto: This movie is amazing, I loved it!
Predicción: Negative


##AJUSTE FINO CON DATASET DE IMBD

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding
from datasets import load_dataset

# Cargar el dataset IMDb (análisis de sentimientos)
dataset = load_dataset("stanfordnlp/imdb")
# Tokenizador y modelo preentrenado
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)  # Positivo y Negativo

# Preprocesar datos
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=False)

tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Usar DataCollator para hacer padding dinámico
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Configuración del entrenamiento
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_dir="./logs",
    report_to="none"
)

# Configurar el Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].shuffle(seed=42).select(range(2000)),
    eval_dataset=tokenized_datasets["test"].shuffle(seed=42).select(range(500)),
    tokenizer=tokenizer,
    data_collator=data_collator
)




In [None]:
# Entrenar el modelo
trainer.train()

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Texto de prueba
text = "This movie is amazing, I loved it!"

# Tokenización y predicción
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Mover las entradas al mismo dispositivo que el modelo (GPU si está disponible)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

outputs = model(**inputs)
logits = outputs.logits  # predicciones sin normalizar, es decir, los valores antes de aplicar una función como softmax.
predicted_class = torch.argmax(logits, dim=1).item()

# Mostrar resultado
label_map = {0: "Negative", 1: "Positive"}  # Cambia según las etiquetas de tu modelo
print(f"Texto: {text}")
print(f"Predicción: {label_map[predicted_class]}")


Texto: This movie is amazing, I loved it!
Predicción: Positive


In [None]:
from sklearn.metrics import accuracy_score, classification_report

#EVALUACIÓN
texts = ["Amazing movie!", "Terrible plot.", "Loved the characters!", "Not my taste."]
true_labels = [1, 0, 1, 0]  # Etiquetas reales

# Predicciones
inputs = tokenizer(texts, return_tensors="pt", truncation=True, padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model(**inputs)
predicted_classes = torch.argmax(outputs.logits, dim=1).tolist()

# Calcular métricas
accuracy = accuracy_score(true_labels, predicted_classes)
print("Accuracy:", accuracy)
print("Reporte de clasificación:")
print(classification_report(true_labels, predicted_classes, target_names=["Negative", "Positive"]))


Accuracy: 1.0
Reporte de clasificación:
              precision    recall  f1-score   support

    Negative       1.00      1.00      1.00         2
    Positive       1.00      1.00      1.00         2

    accuracy                           1.00         4
   macro avg       1.00      1.00      1.00         4
weighted avg       1.00      1.00      1.00         4



#2. Reconocimiento de Entidades Nombradas (AutoModelForTokenClassification)

##Antes de Entrenar

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, Trainer, TrainingArguments, DataCollatorForTokenClassification
from datasets import load_dataset

# Cargar el dataset CoNLL-2003
#-- El dataset CoNLL-2003 es un conjunto de datos muy utilizado para tareas de Reconocimiento de Entidades Nombradas (Named Entity Recognition, NER) en Procesamiento del Lenguaje Natural.
#-- Fue introducido en la conferencia CoNLL-2003 shared task, organizada por la conferencia de la Association for Computational Linguistics (ACL).
dataset = load_dataset("conll2003", trust_remote_code="True")

# Tokenizador y modelo preentrenado
model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(dataset["train"].features["ner_tags"].feature.names))

# Preprocesar datos
def preprocess_function(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = [-100 if word_id is None else label[word_id] for word_id in word_ids]
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Usar DataCollator para alineación de etiquetas
data_collator = DataCollatorForTokenClassification(tokenizer)

# Configuración del entrenamiento
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_dir="./logs",
    report_to="none"
)

# Configurar el Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].shuffle(seed=42).select(range(2000)),
    eval_dataset=tokenized_datasets["validation"].shuffle(seed=42).select(range(500)),
    tokenizer=tokenizer,
    data_collator=data_collator
)

# Texto de prueba
text = "My name is Wolfgang and I live in Berlin"

# Tokenización y predicción
inputs = tokenizer(text, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model(**inputs)
logits = outputs.logits
predictions = torch.argmax(logits, dim=2)

# Obtener las etiquetas predichas
predicted_labels = [dataset["train"].features["ner_tags"].feature.names[p] for p in predictions[0].tolist()]

# Imprimir los tokens y sus etiquetas predichas
print("Tokens:", tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]))
print("Etiquetas predichas:", predicted_labels)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

Tokens: ['[CLS]', 'My', 'name', 'is', 'Wolfgang', 'and', 'I', 'live', 'in', 'Berlin', '[SEP]']
Etiquetas predichas: ['I-ORG', 'I-ORG', 'I-MISC', 'I-ORG', 'I-MISC', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-MISC']


  trainer = Trainer(


##Entrenamiento

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,0.149108
2,No log,0.124064
3,No log,0.100066


TrainOutput(global_step=375, training_loss=0.16891073608398438, metrics={'train_runtime': 29.0765, 'train_samples_per_second': 206.352, 'train_steps_per_second': 12.897, 'total_flos': 152435476445472.0, 'train_loss': 0.16891073608398438, 'epoch': 3.0})

##Después de Ajuste Fino

In [22]:
import torch

# Texto de prueba
text = "My name is Wolfgang and I live in Berlin"

# Tokenización y predicción
inputs = tokenizer(text, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model(**inputs)
logits = outputs.logits
predictions = torch.argmax(logits, dim=2)

# Obtener las etiquetas predichas
predicted_labels = [dataset["train"].features["ner_tags"].feature.names[p] for p in predictions[0].tolist()]

# Imprimir los tokens y sus etiquetas predichas
print("Tokens:", tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]))
print("Etiquetas predichas:", predicted_labels)

Tokens: ['[CLS]', 'My', 'name', 'is', 'Wolfgang', 'and', 'I', 'live', 'in', 'Berlin', '[SEP]']
Etiquetas predichas: ['O', 'O', 'O', 'O', 'B-PER', 'O', 'O', 'O', 'O', 'B-LOC', 'O']


#3. Preguntas y Respuestas (AutoModelForQuestionAnswering)

##Antes de Ajuste Fino

In [57]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, Trainer, TrainingArguments, DataCollatorWithPadding
from datasets import load_dataset

# Cargar el dataset SQuAD v2
#--El dataset SQuAD v2 (Stanford Question Answering Dataset, versión 2.0) es un conjunto de datos muy influyente en el campo del Question Answering (QA), específicamente para tareas de pregunta-respuesta extractiva sobre texto.
dataset = load_dataset("squad_v2")

# Tokenizador y modelo preentrenado
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

# Preprocesar datos
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        truncation=True,
        max_length=384,
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding=True
    )
    sample_mapping = inputs.pop("overflow_to_sample_mapping")
    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []
    for i, offsets in enumerate(offset_mapping):
        input_ids = inputs["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)
        sequence_ids = inputs.sequence_ids(i)
        sample_index = sample_mapping[i]
        answer = answers[sample_index]
        if len(answer["answer_start"]) == 0:
            start_positions.append(cls_index)
            end_positions.append(cls_index)
        else:
            start_char = answer["answer_start"][0]
            end_char = start_char + len(answer["text"][0])
            token_start_index = 0
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1
            if offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char:
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                start_positions.append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                end_positions.append(token_end_index + 1)
            else:
                start_positions.append(cls_index)
                end_positions.append(cls_index)
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True, remove_columns=dataset["train"].column_names)

# Usar DefaultDataCollator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding='longest', max_length=512) #max_lenght para evitar error de memoria
# Configuración del entrenamiento
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_dir="./logs",
    report_to="none"
)

# Configurar el Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].shuffle(seed=42).select(range(6000)),
    eval_dataset=tokenized_datasets["validation"].shuffle(seed=42).select(range(750)),
    tokenizer=tokenizer,
    data_collator=data_collator
)

# Ejemplo de pregunta y contexto
context = """Badajoz is a city located in the autonomous community of Extremadura, in southwestern Spain, near the Portuguese border.
It is the largest city in the region by population and surface area. Founded during the Moorish period, Badajoz preserves
important historical remains, such as its massive Alcazaba, one of the largest Arab citadels in Europe. The city's architecture
also reflects Gothic and Renaissance influences, visible in its cathedral and old quarters. Badajoz hosts several cultural events,
including the Carnival of Badajoz, considered one of the most important in Spain. It is also an academic and economic center,
home to one of the campuses of the University of Extremadura. The city has a hot-summer Mediterranean climate, with mild winters
and very hot, dry summers. The local economy relies on services, trade, agriculture, and cross-border cooperation with Portugal,
particularly with the nearby city of Elvas. Extremadura as a region is known for its natural parks, traditional cuisine, and
products such as Iberian ham and paprika from La Vera."""
question = "Where is Badajoz placed?"
# Tokenizar la pregunta y el contexto
inputs = tokenizer(question, context, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Realizar la inferencia
with torch.no_grad():
    outputs = model(**inputs)

# Obtener las respuestas
answer_start_index = torch.argmax(outputs.start_logits)
answer_end_index = torch.argmax(outputs.end_logits)


predict_answer_tokens = inputs["input_ids"][0, answer_start_index : answer_end_index + 1]
answer = tokenizer.decode(predict_answer_tokens)

# Imprimir la respuesta
print(f"Pregunta: {question}")
print(f"Contexto: {context}")
print(f"Respuesta: {answer}")

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Pregunta: Where is Badajoz placed?
Contexto: Badajoz is a city located in the autonomous community of Extremadura, in southwestern Spain, near the Portuguese border.
It is the largest city in the region by population and surface area. Founded during the Moorish period, Badajoz preserves
important historical remains, such as its massive Alcazaba, one of the largest Arab citadels in Europe. The city's architecture
also reflects Gothic and Renaissance influences, visible in its cathedral and old quarters. Badajoz hosts several cultural events,
including the Carnival of Badajoz, considered one of the most important in Spain. It is also an academic and economic center,
home to one of the campuses of the University of Extremadura. The city has a hot-summer Mediterranean climate, with mild winters
and very hot, dry summers. The local economy relies on services, trade, agriculture, and cross-border cooperation with Portugal,
particularly with the nearby city of Elvas. Extremadura as a region i

  trainer = Trainer(


##Entrenar Modelo

In [58]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,2.8906,1.882181
2,1.6294,1.981344
3,1.0706,1.998787
4,0.6129,2.42925
5,0.3768,2.744407


TrainOutput(global_step=3750, training_loss=1.2711611165364582, metrics={'train_runtime': 277.9042, 'train_samples_per_second': 107.951, 'train_steps_per_second': 13.494, 'total_flos': 2939694750720000.0, 'train_loss': 1.2711611165364582, 'epoch': 5.0})

##Después de Ajuste Fino

In [59]:
# Ejemplo de pregunta y contexto
context = """Badajoz is a city located in the autonomous community of Extremadura, in southwestern Spain, near the Portuguese border.
It is the largest city in the region by population and surface area. Founded during the Moorish period, Badajoz preserves
important historical remains, such as its massive Alcazaba, one of the largest Arab citadels in Europe. The city's architecture
also reflects Gothic and Renaissance influences, visible in its cathedral and old quarters. Badajoz hosts several cultural events,
including the Carnival of Badajoz, considered one of the most important in Spain. It is also an academic and economic center,
home to one of the campuses of the University of Extremadura.
The city has a hot-summer Mediterranean climate, with mild winters
and very hot, dry summers. The local economy relies on services, trade, agriculture, and cross-border cooperation with Portugal,
particularly with the nearby city of Elvas.
Extremadura as a region is known for its natural parks, traditional cuisine, and
products such as Iberian ham and paprika from La Vera."""
question = "where is Badajoz located?"
# Tokenizar la pregunta y el contexto
inputs = tokenizer(question, context, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Realizar la inferencia
with torch.no_grad():
    outputs = model(**inputs)

# Obtener las respuestas
answer_start_index = torch.argmax(outputs.start_logits)
answer_end_index = torch.argmax(outputs.end_logits)

predict_answer_tokens = inputs["input_ids"][0, answer_start_index : answer_end_index + 1]
answer = tokenizer.decode(predict_answer_tokens)

# Imprimir la respuesta
print(f"Pregunta: {question}")
print(f"Contexto: {context}")
print(f"Respuesta: {answer}")

Pregunta: where is Badajoz located?
Contexto: Badajoz is a city located in the autonomous community of Extremadura, in southwestern Spain, near the Portuguese border.
It is the largest city in the region by population and surface area. Founded during the Moorish period, Badajoz preserves
important historical remains, such as its massive Alcazaba, one of the largest Arab citadels in Europe. The city's architecture
also reflects Gothic and Renaissance influences, visible in its cathedral and old quarters. Badajoz hosts several cultural events,
including the Carnival of Badajoz, considered one of the most important in Spain. It is also an academic and economic center,
home to one of the campuses of the University of Extremadura. 
The city has a hot-summer Mediterranean climate, with mild winters
and very hot, dry summers. The local economy relies on services, trade, agriculture, and cross-border cooperation with Portugal,
particularly with the nearby city of Elvas. 
Extremadura as a regio

In [68]:
question = "which is the most important cultural event in Badajoz? "
# Tokenizar la pregunta y el contexto
inputs = tokenizer(question, context, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Realizar la inferencia
with torch.no_grad():
    outputs = model(**inputs)

# Obtener las respuestas
answer_start_index = torch.argmax(outputs.start_logits)
answer_end_index = torch.argmax(outputs.end_logits)

predict_answer_tokens = inputs["input_ids"][0, answer_start_index : answer_end_index + 1]
answer = tokenizer.decode(predict_answer_tokens)

# Imprimir la respuesta
print(f"Pregunta: {question}")
print(f"Respuesta: {answer}")

Pregunta: which is the most important cultural event in Badajoz? 
Respuesta: carnival


#AutoModelForMaskedLM.from_pretrained (Masked Language Modeling)

In [None]:
pip install wikipedia


##Antes de ajuste Fino

In [53]:
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM, DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import load_dataset, Dataset
import wikipedia

wikipedia.set_lang("en")
# Dataset para Masked LM
extremadura_text = wikipedia.page("Extremadura").content
ds = Dataset.from_dict({"text": [extremadura_text]*1000})


# Tokenizador y modelo
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Tokenización
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=128)

tokenized_dataset = ds.map(tokenize_function, batched=True, remove_columns=["text"])


# DataCollator para Masked LM
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.5)

# Configuración de entrenamiento
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_dir="./logs",
    report_to="none"
)

# Comprobar rendimiento antes del ajuste fino
text = "The capital city of Extremadura is [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predicted_token_id = outputs.logits[0, inputs.input_ids[0].tolist().index(tokenizer.mask_token_id)].argmax().item()
print("Antes del ajuste fino:")
print(f"Texto: {text}")
print(f"Predicción: {tokenizer.decode(predicted_token_id)}\n")

# Entrenamiento
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer)


Antes del ajuste fino:
Texto: The capital city of Extremadura is [MASK].
Predicción: madrid



  trainer = Trainer(


##Entrenamiento

In [54]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,0.01332
2,No log,0.005356
3,No log,0.001164
4,No log,0.000873
5,No log,0.001346


TrainOutput(global_step=315, training_loss=0.10081304519895523, metrics={'train_runtime': 58.7365, 'train_samples_per_second': 85.126, 'train_steps_per_second': 5.363, 'total_flos': 329006016000000.0, 'train_loss': 0.10081304519895523, 'epoch': 5.0})

##Después del Ajuste fino

In [57]:
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Comprobar rendimiento después del ajuste fino
outputs = model(**inputs)
predicted_token_id = outputs.logits[0, inputs['input_ids'][0].tolist().index(tokenizer.mask_token_id)].argmax().item()
print("Después del ajuste fino:")
print(f"Texto: {text}")
print(f"Predicción: {tokenizer.decode(predicted_token_id)}\n")


Después del ajuste fino:
Texto: The capital city of Extremadura is [MASK].
Predicción: merida



#3. AutoModelForMultipleChoice (Selección Múltiple)

##Antes del ajuste fino

In [81]:
#borro todas la variables
%reset -f

In [107]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForMultipleChoice, Trainer, TrainingArguments
import torch

# 1. Cargar el dataset PIQA
dataset = load_dataset("piqa")

# 2. Usar un modelo base
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMultipleChoice.from_pretrained(model_name)

# 3. Preprocesado: convertir en formato multiple-choice
def preprocess_function(examples):
    # Cada ejemplo tiene un 'goal' y dos opciones: 'sol1' y 'sol2'
    first_sentences = []
    second_sentences = []
    for goal, sol1, sol2 in zip(examples["goal"], examples["sol1"], examples["sol2"]):
        first_sentences.extend([goal, goal])
        second_sentences.extend([sol1, sol2])

    # Tokenize the questions and choices
    tokenized = tokenizer(first_sentences, second_sentences, truncation=True, padding="max_length", max_length=128)

    # Agrupar en pares (batch_size, num_choices, seq_len)
    result = {
        k: [tokenized[k][i:i+2] for i in range(0, len(tokenized[k]), 2)]
        for k in tokenized
    }
    result["labels"] = examples["label"]
    return result

# Aplicamos el preprocesado a train y validation
encoded_train = dataset["train"].select(range(2000)).map(preprocess_function, batched=True)
encoded_valid = dataset["validation"].select(range(500)).map(preprocess_function, batched=True)

# DatasetWrapper para que funcione con Trainer
from dataclasses import dataclass
from typing import List, Dict
from transformers import PreTrainedTokenizerBase

@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: bool = True
    max_length: int = None

    def __call__(self, features: List[Dict]) -> Dict[str, torch.Tensor]:
        labels = torch.tensor([f["labels"] for f in features], dtype=torch.long)
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])

        flattened_features = []
        for feature in features:
            for i in range(num_choices):
                flattened_features.append(
                    {k: feature[k][i] for k in ["input_ids", "attention_mask", "token_type_ids"] if k in feature}
                )

        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            return_tensors="pt"
        )

        batch = {
            k: v.view(batch_size, num_choices, -1) for k, v in batch.items()
        }
        batch["labels"] = labels
        return batch

# 4. Argumentos de entrenamiento
training_args = TrainingArguments(
    output_dir="./piqa_results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=5,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir="./logs",
    report_to="none"
)

# Comprobar logits antes del entrenamiento
prompt = ["Laura went to the kitchen and opened the fridge. She saw that the milk was expired."]
choices = [
    "She threw it away and took a juice instead.",         # coherente
    "She painted the fridge green.",                       # absurda
    "She called her teacher to ask about homework.",       # irrelevante
    "She put the expired milk in her backpack."            # extraño e incorrecto
]
inputs = tokenizer(prompt * len(choices), choices, return_tensors="pt", padding=True, truncation=True)
# Re-dimensionamos para que sea (batch_size=1, num_choices=4, seq_length)
for k in inputs:
    inputs[k] = inputs[k].unsqueeze(0)

outputs = model(**inputs)
predicted_idx = torch.argmax(outputs.logits, dim=1).item()
best_choice = choices[predicted_idx]
print("Antes del ajuste fino")
print(f"Logits: {outputs.logits}\n")
print("Opción elegida:", best_choice)

# Entrenamos con Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_train,
    eval_dataset=encoded_valid,
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer)
)


Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Antes del ajuste fino
Logits: tensor([[-0.1461, -0.1701,  0.0089, -0.1836]], grad_fn=<ViewBackward0>)

Opción elegida: She called her teacher to ask about homework.


##Entrenamiento

In [108]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,No log,0.690737
2,0.691700,0.683393
3,0.634900,0.704613
4,0.468100,0.736609
5,0.371000,0.771213


TrainOutput(global_step=2000, training_loss=0.5414263687133789, metrics={'train_runtime': 153.0087, 'train_samples_per_second': 65.356, 'train_steps_per_second': 13.071, 'total_flos': 1315543464960000.0, 'train_loss': 0.5414263687133789, 'epoch': 5.0})

##Después del ajuste fino

In [109]:
# Comprobar rendimiento después del ajuste fino
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model(**inputs)
predicted_idx = torch.argmax(outputs.logits, dim=1).item()
best_choice = choices[predicted_idx]
print("Después del ajuste fino:")
print(f"Logits: {outputs.logits}\n")
print("Opción elegida:", best_choice)

Después del ajuste fino:
Logits: tensor([[ 0.8619,  0.1487, -2.7114,  0.3739]], device='cuda:0',
       grad_fn=<ViewBackward0>)

Opción elegida: She threw it away and took a juice instead.


#4. AutoModelForSeq2SeqLM.from_pretrained (Traducción/Secuencia a Secuencia)

##Antes del ajuste fino

In [61]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Trainer, TrainingArguments
from datasets import load_dataset

# Cargar dataset de traducción
dataset = load_dataset("opus_books", "en-es")
#Creamos split de validación (10% del train)
dataset = dataset["train"].train_test_split(test_size=0.1, seed=42)
dataset['validation'] = dataset['test']
# Tokenizador y modelo
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Preprocesar datos para traducción
def preprocess_function(examples):
    # "translation" es un diccionario con {"en": ..., "es": ...}
    # Recuperamos la lista de textos en inglés y español
    en_texts = [item["en"] for item in examples["translation"]]
    es_texts = [item["es"] for item in examples["translation"]]

    # Construimos el prompt: "translate English to Spanish: <texto_en>"
    inputs = [f"translate English to Spanish: {text}" for text in en_texts]
    # El target será directamente el texto en español
    targets = [text for text in es_texts]

    # Tokenizamos entradas y salidas
    model_inputs = tokenizer(inputs, truncation=True, padding="max_length", max_length=128)
    labels = tokenizer(targets, truncation=True, padding="max_length", max_length=128)

    # Añadimos las labels al diccionario de tokens
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Usar DataCollator para Seq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# Configuración de entrenamiento
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_dir="./logs",
    report_to="none"
)

# Comprobar rendimiento antes del ajuste fino
def test_translation(model, tokenizer, input_text):
    inputs = tokenizer(f"translate English to Spanish: {input_text}", return_tensors="pt", truncation=True, padding=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    outputs = model.generate(**inputs, max_length=50)
    translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return translation

print("Antes del ajuste fino:")
input_text = "The book is on the table."
print(f"Entrada: {input_text}")
print(f"Traducción: {test_translation(model, tokenizer, input_text)}\n")

# Entrenar el modelo
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].shuffle(seed=42),
    eval_dataset=tokenized_datasets["validation"].shuffle(seed=42),
    tokenizer=tokenizer,
    data_collator=data_collator,
  )


README.md:   0%|          | 0.00/28.1k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/16.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/93470 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Map:   0%|          | 0/84123 [00:00<?, ? examples/s]

Map:   0%|          | 0/9347 [00:00<?, ? examples/s]

Antes del ajuste fino:
Entrada: The book is on the table.
Traducción: Das Buch ist auf dem Tisch.



  trainer = Trainer(


##Entrenamiento

In [62]:
trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,1.1053,0.994683
2,1.0522,0.923904
3,1.009,0.905965


TrainOutput(global_step=15774, training_loss=1.1174980832004753, metrics={'train_runtime': 1446.9772, 'train_samples_per_second': 174.411, 'train_steps_per_second': 10.901, 'total_flos': 8539018773921792.0, 'train_loss': 1.1174980832004753, 'epoch': 3.0})

##Después del Ajuste Fino

In [63]:
# Comprobar rendimiento después del ajuste fino

print("Después del ajuste fino:")
print(f"Entrada: {input_text}")
print(f"Traducción: {test_translation(model, tokenizer, input_text)}\n")

Después del ajuste fino:
Entrada: The book is on the table.
Traducción: El libro está en la mesa.

