### Task Summary

We will fine-tune the **8 billion parameter Gemma 3** for **abstractive question answering** on Spanish questions, using the ieuniversity/abstractive-qa-ie-train dataset. Gemma 3 is an LLM and an instruct model previously imbued with knowledge on question answering.

**Fine-tuning** creates additional knowledge on a specific task that the model was not trained on. We use LoRa, a method for training only a small percentage of the model's parameters to preserve compute efficiency and avoid catastrophic forgetting.

We will use the **unsloth library and 4-bit quantization** to significantly speed up the training process. We adapt the format of the dataset into a conversation style text suitable for an LLM, watching for thinking tokens and input patterns.

We evaluate the model's answers with SAS * Exact match, which takes into account semantic similarity and meaning, not just position and word matching. The SAS evaluation uses the sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model.


### Libraries & Loading Data

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    !pip install git+https://github.com/lenguajenatural-ai/simple-llms-eval.git # simple llms eval library
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [2]:
from unsloth import FastLanguageModel
import torch
import simple_llms_eval
import os
import torch
import pandas as pd
from datasets import load_dataset, Dataset
from trl import SFTTrainer, SFTConfig
import re
from difflib import ndiff

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 05-09 09:32:06 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-09 09:32:06 [__init__.py:239] Automatically detected platform cuda.


In [3]:
fourbit_models = [
    "unsloth/Qwen3-1.7B-unsloth-bnb-4bit", # Qwen 14B 2x faster
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",

    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit"
]

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-8B",
    max_seq_length = 2048,   # Context length - can be longer, but uses more memory
    load_in_4bit = True,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    full_finetuning = False
)

==((====))==  Unsloth 2025.4.7: Fast Qwen3 patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json:   0%|          | 0.00/144k [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/10.3k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/4.67k [00:00<?, ?B/s]

In [4]:
# Load datasets
print("Loading datasets...")
train_dataset = load_dataset("ieuniversity/abstractive-qa-ie-train")
test_dataset = load_dataset("ieuniversity/abstractive-qa-ie-test")

Loading datasets...


README.md:   0%|          | 0.00/492 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/519k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/70.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/200 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/335 [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/53.6k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/200 [00:00<?, ? examples/s]

In [5]:
train_dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'context', 'answer', 'id'],
        num_rows: 1599
    })
    validation: Dataset({
        features: ['question', 'context', 'answer', 'id'],
        num_rows: 200
    })
})

In [6]:
test_dataset

DatasetDict({
    test: Dataset({
        features: ['question', 'context', 'id'],
        num_rows: 200
    })
})

### Setting LoRa adapters

In [7]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2025.4.7 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


### Adapt to conversation format

In [8]:
def generate_conversation(train):
    questions  = train["question"]
    contexts = train["context"]
    answers = train["answer"]
    conversations = []
    for question, context, answer in zip(questions, contexts, answers):
        conversations.append([
            {"role" : "user",      "content" : question},
            {"role" : "system", "content" : context},
            {"role" : "assistant", "content" : answer},
        ])
    return { "conversations": conversations, }

In [9]:
mapped_dataset = train_dataset.map(generate_conversation, batched = True)

conversations = tokenizer.apply_chat_template(
    mapped_dataset["train"]["conversations"],
    tokenize = False,
)

Map:   0%|          | 0/1599 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [10]:
conversations[0]

'<|im_start|>user\n¿Qué hace que la gestión adecuada del uso del agua se encuentre entre sus objetivos medioambientales? <|im_end|>\n<|im_start|>system\nLa gestión adecuada del uso del agua se encuentra entre sus objetivos medioambientales ya que el procesado de materiales a elevadas temperaturas obliga a la utilización de este recurso para su refrigeración. Para reducir al máximo los vertidos, la compañía dispone de instalaciones y sistemas propios de tratamiento de aguas, que se encargan de su tratamiento y su recuperación.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\nel procesado de materiales a elevadas temperaturas obliga a la utilización de este recurso para su refrigeración<|im_end|>\n'

In [11]:
len(conversations)

1599

In [12]:
data = pd.concat([
    pd.Series(conversations),
])
data.name = "text"

combined_dataset = Dataset.from_pandas(pd.DataFrame(data))
combined_dataset = combined_dataset.shuffle(seed = 3407)

In [13]:
combined_dataset

Dataset({
    features: ['text'],
    num_rows: 1599
})

### Training

In [14]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = combined_dataset,
    eval_dataset = None,
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Using GA to mimic batch size
        warmup_steps = 5,
        num_train_epochs = 1,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = 3407,
        report_to = "none",
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=12):   0%|          | 0/1599 [00:00<?, ? examples/s]

In [15]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA L4. Max memory = 22.161 GB.
7.865 GB of memory reserved.


In [16]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,599 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 87,293,952/8,000,000,000 (1.09% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.9305
2,2.4742
3,2.5995
4,2.1611
5,2.398
6,1.8802
7,1.6558
8,1.7183
9,1.4264
10,1.3786


In [17]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

605.3117 seconds used for training.
10.09 minutes used for training.
Peak reserved memory = 8.906 GB.
Peak reserved memory for training = 1.041 GB.
Peak reserved memory % of max memory = 40.188 %.
Peak reserved memory for training % of max memory = 4.697 %.


### Validation

In [18]:
# --- Configuration for Inference and Evaluation ---
STUDENT_ID = "16491"
MAX_NEW_TOKENS_GEN = 400  # Max tokens for generated answer during inference
SAS_MODEL_NAME = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2" # Specified for SAS

# --- Set model to evaluation mode ---
model.eval()

# --- Helper function for Exact Match ---
def calculate_exact_match(predictions, references):
    exact_match_count = 0
    for pred, ref in zip(predictions, references):
        if pred.strip().lower() == ref.strip().lower():
            exact_match_count += 1
    return exact_match_count / len(predictions) if len(predictions) > 0 else 0

In [19]:
# --- Validation Phase ---
print("\n--- Starting Validation Phase ---")
validation_data = train_dataset["validation"] # train_dataset contains both train and validation splits
val_predictions = []
val_references = []
val_ids = []

print(f"Processing {len(validation_data)} validation samples...")
for i, example in enumerate(validation_data):
    if i % 50 == 0 and i > 0:
        print(f"Validated {i}/{len(validation_data)} samples...")
    question = example["question"]
    context = example["context"]
    reference_answer = example["answer"]

    # Prepare prompt for Qwen chat model
    # The training data used: [{"role": "user", "content": q}, {"role": "system", "content": c}, {"role": "assistant", "content": a}]
    # For inference, we provide user and system, and model generates assistant's part.
    inference_chat = [
        {"role": "user", "content": question},
        {"role": "system", "content": context},
    ]

    inputs = tokenizer.apply_chat_template(
        inference_chat,
        tokenize = True,
        add_generation_prompt = True, # Important! Tells the model to generate a response.
        return_tensors = "pt",
    ).to("cuda")

    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs,
            max_new_tokens=MAX_NEW_TOKENS_GEN,
            use_cache=True,
            do_sample=True, # Using sampling as typically done for generation tasks
            temperature=0.6,
            top_p=0.9,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id,
        )

    # Decode the generated response, removing the prompt part
    # The input_ids are part of outputs[0]
    generated_text_raw = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)

    # Clean the <think> tags
    cleaned_generated_text = re.sub(r"<think>.*?</think>\s*", "", generated_text_raw, flags=re.DOTALL).strip()

    val_predictions.append(cleaned_generated_text)
    val_references.append(reference_answer.strip())
    val_ids.append(example["id"])

    # Diagnostic print for the first 5 samples, and first 5 non-matches
    if i < 5 or (calculate_exact_match([cleaned_generated_text], [reference_answer.strip()]) == 0 and sum(1 for p, r in zip(val_predictions, val_references) if calculate_exact_match([p], [r]) == 0) <= 5):
        print(f"\n--- Sample {i} (ID: {example['id']}) ---")
        print(f"Question: {question[:100]}...")
        print(f"Context: {context[:100]}...")
        print(f"RAW Decoded Text: [{generated_text_raw}]")
        print(f"Reference Answer (len {len(reference_answer.strip())}): [{reference_answer.strip()}]")
        print(f"CLEANED Generated Text (len {len(cleaned_generated_text)}): [{cleaned_generated_text}]")
        is_match = calculate_exact_match([cleaned_generated_text], [reference_answer.strip()])
        print(f"Exact Match for this sample: {bool(is_match)}")
        if not is_match:
            # Simple diff visualization
            diff = list(ndiff(reference_answer.strip().lower().split(), cleaned_generated_text.lower().split()))
            print("Diff (ref vs gen, word-level):")
            for d_item in diff:
                if d_item.startswith('-'): print(f"  {d_item}")
                elif d_item.startswith('+'): print(f"  {d_item}")

print("Validation processing complete.")


--- Starting Validation Phase ---
Processing 200 validation samples...

--- Sample 0 (ID: validation-0) ---
Question: ¿Qué efecto tuvo la política de precios implementada en España en la segunda mitad del año?...
Context: El EBITDA ajustado descendió un 9,4% en 2017 hasta EUR568.6m, una caída del 8,9% a divisa constante....
RAW Decoded Text: [<think>

</think>

una erosión de 65 puntos básicos en el margen EBITDA ajustado, que fue del 6,6% en el año]
Reference Answer (len 89): [una erosión de 65 puntos básicos en el margen EBITDA ajustado, que fue del 6,6% en el año]
CLEANED Generated Text (len 89): [una erosión de 65 puntos básicos en el margen EBITDA ajustado, que fue del 6,6% en el año]
Exact Match for this sample: True

--- Sample 1 (ID: validation-1) ---
Question: ¿Cuál es el resultado de que la innovación sea uno de los motores del desarrollo de Gas Natural Feno...
Context: La innovación es uno de los motores del desarrollo de Gas Natural Fenosa, por lo que destina una par...
RA

In [20]:
try:
    sas_metric = simple_llms_eval.BiEncoderScore() # Uses a sentence transformer
    sas_scores_list, sas_avg_score = sas_metric.compute(
        predictions=val_predictions,
        references=val_references,
        model_name=SAS_MODEL_NAME, # Ensure we use the assignment-specified model
        return_average=True
    )
    print(f"Average SAS (BiEncoderScore with {SAS_MODEL_NAME}) on Validation Set: {sas_avg_score:.4f}")
except Exception as e:
    print(f"Could not compute SAS score: {e}")
    sas_avg_score = 0.0 # Default if calculation fails

# Calculate Exact Match
em_score = calculate_exact_match(val_predictions, val_references)
print(f"Exact Match (EM) on Validation Set: {em_score:.4f}")

# Final Score (SAS * EM)
final_validation_score = sas_avg_score * em_score
print(f"Final Validation Score (SAS * EM): {final_validation_score:.4f}")

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.89k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Average SAS (BiEncoderScore with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) on Validation Set: 0.9686
Exact Match (EM) on Validation Set: 0.8250
Final Validation Score (SAS * EM): 0.7991


### Inference

In [21]:
# --- Test Set Prediction Generation ---
print("\n--- Starting Test Set Prediction Generation ---")
test_data_for_pred = test_dataset["test"] # This is the actual test set for submission
submission_predictions = []

print(f"Generating predictions for {len(test_data_for_pred)} test samples...")
for i, example in enumerate(test_data_for_pred):
    if i % 50 == 0 and i > 0:
        print(f"Processed {i}/{len(test_data_for_pred)} test samples...")
    question = example["question"]
    context = example["context"]
    current_id = example["id"]

    inference_chat = [
        {"role": "user", "content": question},
        {"role": "system", "content": context},
    ]

    inputs = tokenizer.apply_chat_template(
        inference_chat,
        tokenize = True,
        add_generation_prompt = True,
        return_tensors = "pt",
    ).to("cuda")

    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs,
            max_new_tokens=MAX_NEW_TOKENS_GEN,
            use_cache=True,
            do_sample=True,
            temperature=0.6,
            top_p=0.9,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id,
        )

    generated_text_raw = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)

    # Clean the <think> tags for test predictions as well
    cleaned_generated_text = re.sub(r"<think>.*?</think>\s*", "", generated_text_raw, flags=re.DOTALL).strip()

    submission_predictions.append({"id": current_id, "answer": cleaned_generated_text})

print("Test set prediction generation complete.")


--- Starting Test Set Prediction Generation ---
Generating predictions for 200 test samples...
Processed 50/200 test samples...
Processed 100/200 test samples...
Processed 150/200 test samples...
Test set prediction generation complete.


### Output

In [22]:
# Save predictions to CSV
predictions_df = pd.DataFrame(submission_predictions)
output_csv_filename = f"predictions_{STUDENT_ID}.csv"
predictions_df.to_csv(output_csv_filename, index=False)
print(f"Predictions saved to {output_csv_filename}")
print(predictions_df.head().to_string())

Predictions saved to predictions_16491.csv
       id                                                                                                                                                                      answer
0  test-0                                     en 2017 se ha celebrado la segunda Semana de la Educación Vial, con el objetivo de formar y sensibilizar a la plantilla en esta materia
1  test-1                                                                                                           Somos conscientes del impacto de nuestra actividad en la sociedad
2  test-2       Vidrala incrementará su notoriedad en el mercado de envases de vidrio europeo, aportando tamaño, diversificación y futuro a una base de negocio de solidez demostrada
3  test-3                        el Grupo aplica la contabilidad de coberturas al objeto de mitigar la volatilidad que se produciría en la cuenta de pérdidas y ganancias consolidada
4  test-4  el índice Footsie 100 mostró una sub