# Question Answering Fine-tuning

La siguiente tarea consiste en entrenar un modelo de HuggingFace (HF) para realizar la _task_ de _question_answering_. El dataset para entrenar dicho modelo está predefinido. Sin embargo, el modelo, el tokenizador y el trainer pueden ser totalmente personalizados. Es decir, que tendréis que realizar un trabajo de investigación, de prueba y error, para poder ir aprendiendo y ganando destreza con HF.

Recomendaciones:
- Durante este proceso, tendréis muchas dudas y encontraréis muchos errores. Tratad de resolverlas primero por vuestra cuenta, enteniendo la causa del error. Después con recursos online. Y, finalmente, siempre está el foro, que puede ser utilizado de forma participativa.
- No dejeis la tarea para el último día. Los modelos tardan en entrenar. Los problemas no se resuelven en la primera iteración.

Finalmente, se pide:
- Limpieza rigurosa en la presentación del notebook.
- El notebook se entrega con todas las celdas ejecutadas.
- Los comentarios (opcionales), mejor sobre el código con '#'.

Ánimo!

## Dataset

A continuación, descargarás un dataset llamado _squad_ que contiene 87599 filas en el dataset de train y 10570 registros en el dataset de validation.

Lo primero que tendrás que hacer es construir un dataset nuevo, llamado **ds_tarea**, que filtre el anterior dataset para quedarse con los registros que tengan el contenido de la columna _context_  con menos (estrictamente) de 300 caracteres.

In [44]:
!pip install datasets



In [45]:
!pip install transformers[torch]



In [46]:
!pip install accelerate -U



In [47]:
import torch
from transformers import Trainer, TrainingArguments

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [48]:
from datasets import load_dataset

dataset = load_dataset("squad")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [49]:
ds_tarea_train = dataset['train'].filter(lambda x: len(x['context']) < 300)
ds_tarea_validation = dataset['validation'].filter(lambda x: len(x['context']) < 300)

# Crear un nuevo dataset combinando los subsets filtrados
ds_tarea = {
    'train': ds_tarea_train,
    'validation': ds_tarea_validation
}

In [50]:
assert len(ds_tarea['train']) == 3466
assert len(ds_tarea['validation']) == 345

## EDA

Si tenéis que realizar alguna exploración del datos, utilizad esta sección.

In [51]:
# Celdas de libre uso

## Feature Engineering

Si tenéis que realizar alguna modificación de los datos (no siempre es necesaria, pero algunos modelos preentrenados lo piden), podéis utilizar esta sección.

Al finalizar la sección, bien si modificais el dataset, bien si no lo modificáis, lo guardaréis en un dataset llamado __ds_tarea_featured__.

In [52]:
# Celdas de libre uso

In [53]:
ds_tarea_featured = ds_tarea # Esta línea tiene sentido en caso de que no se modifique el dataset

In [54]:
assert len(ds_tarea_featured['train']) == 3466
assert len(ds_tarea_featured['validation']) == 345

## Model and Tokenizer

El modelo finalmente escogido para hacer el fine-tuning, declaradlo en la variable _model_checkpoint_. Con dicho modelo seleccionado, se pide guardar el modelo y el tokenizador en las variables _model_ y _tokenizer_.

In [55]:
from transformers import DistilBertTokenizerFast, DistilBertForQuestionAnswering, Trainer, TrainingArguments, DataCollatorWithPadding

# Initialize the tokenizer and model
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased')


Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Fine-tuning

A continuación, de forma libre se pide entrenar un modelo de HuggingFace deseado. Se pide usar un Trainer de HuggingFace que tenga los siguientes argumentos como mínimo (puede haber más argumentos en todas las variables):

In [56]:

# Define the preprocess function to be used for evaluation
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation=True,
        padding="max_length",
        return_tensors="pt"
    )

    start_positions = []
    end_positions = []

    for i, answer in enumerate(examples["answers"]):
        start_char = answer["answer_start"][0]
        end_char = start_char + len(answer["text"][0])

        start_token_idx = inputs.char_to_token(i, start_char)
        end_token_idx = inputs.char_to_token(i, end_char - 1)

        # If there's a misalignment, set positions to max_length (ignores these cases)
        if start_token_idx is None or end_token_idx is None:
            start_token_idx = inputs.input_ids.shape[1] - 1
            end_token_idx = inputs.input_ids.shape[1] - 1

        start_positions.append(start_token_idx)
        end_positions.append(end_token_idx)

    inputs["start_positions"] = torch.tensor(start_positions)
    inputs["end_positions"] = torch.tensor(end_positions)

    return inputs

# Apply the preprocessing function to the datasets
tokenized_ds_tarea_train = ds_tarea["train"].map(preprocess_function, batched=True, remove_columns=ds_tarea["train"].column_names)
tokenized_ds_tarea_validation = ds_tarea["validation"].map(preprocess_function, batched=True, remove_columns=ds_tarea["validation"].column_names)


Map:   0%|          | 0/3466 [00:00<?, ? examples/s]

Map:   0%|          | 0/345 [00:00<?, ? examples/s]

In [57]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results_definitive",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-4,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=5,
    weight_decay=0.01,
    warmup_steps=500
)




In [58]:
from transformers import DataCollatorWithPadding


# Initialize the data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds_tarea_train,
    eval_dataset=tokenized_ds_tarea_validation,
    data_collator=data_collator,
    tokenizer=tokenizer
)


A continuación se entrena el modelo

In [59]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,0.953978
2,No log,0.911891
3,No log,0.986455
4,No log,1.045606
5,1.177200,1.285928


TrainOutput(global_step=545, training_loss=1.1230580706115163, metrics={'train_runtime': 641.0375, 'train_samples_per_second': 27.034, 'train_steps_per_second': 0.85, 'total_flos': 1698163667665920.0, 'train_loss': 1.1230580706115163, 'epoch': 5.0})

In [60]:
trainer.save_model("modelo_prueba_4")

## Evaluation

En este apartado, no vamos a entrar esta vez métrics. Lo que se va a pedir, es tomar dos ejemplos del dataset de evaluación.

Con ambos ejemplos, vamos a ver cómo responden a las preguntas.

In [61]:
sample1 = 100
sample2 = 159

In [62]:
# Extract contexts, questions, and answers for the selected samples
context1 = ds_tarea['validation'][sample1]['context']
question1 = ds_tarea['validation'][sample1]['question']
answer1 = ds_tarea['validation'][sample1]['answers']

ds_tarea['validation'][sample1]

{'id': '56d20650e7d4791d00902615',
 'title': 'Super_Bowl_50',
 'context': 'Six-time Grammy winner and Academy Award nominee Lady Gaga performed the national anthem, while Academy Award winner Marlee Matlin provided American Sign Language (ASL) translation.',
 'question': 'What actress did the ASL translation for the game?',
 'answers': {'text': ['Marlee Matlin', 'Marlee Matlin', 'Marlee Matlin'],
  'answer_start': [117, 117, 117]}}

In [63]:
context2 = ds_tarea['validation'][sample2]['context']
question2 = ds_tarea['validation'][sample2]['question']
answer2 = ds_tarea['validation'][sample2]['answers']

ds_tarea['validation'][sample2]

{'id': '56e11c24e3433e1400422c18',
 'title': 'Nikola_Tesla',
 'context': 'Tesla was 6 feet 2 inches (1.88 m) tall and weighed 142 pounds (64 kg), with almost no weight variance from 1888 to about 1926.:292 He was an elegant, stylish figure in New York City, meticulous in his grooming, clothing, and regimented in his daily activities.',
 'question': 'How much did Tesla weigh?',
 'answers': {'text': ['142 pounds', '142 pounds', '142 pounds (64 kg)'],
  'answer_start': [52, 52, 52]}}

Aquí se pide hacer la inferencia del modelo entrenado y poner los resultados en las variables _response1_ y _response2_.

In [72]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments
import torch

# Load the saved model and tokenizer
model_path = "modelo_prueba_4"
model = AutoModelForQuestionAnswering.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Preprocess the examples
example1 = {
    "question": [question1],
    "context": [context1],
    "answers": [answer1]
}

example2 = {
    "question": [question2],
    "context": [context2],
    "answers": [answer2]
}

# Extract tokenized examples from the validation dataset
processed_example1 = tokenized_ds_tarea_validation[sample1]
processed_example2 = tokenized_ds_tarea_validation[sample2]

# Convert lists to tensors
input_ids1 = torch.tensor(processed_example1["input_ids"]).unsqueeze(0)
attention_mask1 = torch.tensor(processed_example1["attention_mask"]).unsqueeze(0)

input_ids2 = torch.tensor(processed_example2["input_ids"]).unsqueeze(0)
attention_mask2 = torch.tensor(processed_example2["attention_mask"]).unsqueeze(0)


# Make predictions
model.eval()
with torch.no_grad():
    outputs1 = model(
        input_ids=input_ids1,
        attention_mask=attention_mask1,
    )
    outputs2 = model(
        input_ids=input_ids2,
        attention_mask=attention_mask2,
    )

# Obtain the predicted start and end positions
end_pred1 = torch.argmax(outputs1.end_logits, dim=1).item()
end_pred2 = torch.argmax(outputs2.end_logits, dim=1).item()

# Decode the predicted answers from the context
tokens1 = input_ids1[0][0:end_pred1 + 1]
tokens2 = input_ids2[0][0:end_pred2 + 1]

predicted_answer1 = tokenizer.decode(tokens1, skip_special_tokens=True)
predicted_answer2 = tokenizer.decode(tokens2, skip_special_tokens=True)

response1 = {
    "context": context1,
    "question": question1,
    "true_answer": answer1,
    "predicted_answer": predicted_answer1
}

response2 = {
    "context": context2,
    "question": question2,
    "true_answer": answer2,
    "predicted_answer": predicted_answer2
}

print("Response 1:", response1)
print("Response 2:", response2)

Response 1: {'context': 'Six-time Grammy winner and Academy Award nominee Lady Gaga performed the national anthem, while Academy Award winner Marlee Matlin provided American Sign Language (ASL) translation.', 'question': 'What actress did the ASL translation for the game?', 'true_answer': {'text': ['Marlee Matlin', 'Marlee Matlin', 'Marlee Matlin'], 'answer_start': [117, 117, 117]}, 'predicted_answer': 'what actress did the asl translation for the game? six - time grammy winner and academy award nominee lady gaga performed the national anthem, while academy award winner marlee matlin provided american sign language ( asl ) translation.'}
Response 2: {'context': 'Tesla was 6 feet 2 inches (1.88 m) tall and weighed 142 pounds (64 kg), with almost no weight variance from 1888 to about 1926.:292 He was an elegant, stylish figure in New York City, meticulous in his grooming, clothing, and regimented in his daily activities.', 'question': 'How much did Tesla weigh?', 'true_answer': {'text': 