<a href="https://colab.research.google.com/github/elizabethavargas/Llama-Fine-tuning/blob/main/curr_Contest_DL_2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



 <h1>
Welcome to the Math Question Answer Verification Competition! 🚀

The goal is to fine-tune a Llama-3-8B model to predict if a given solution to a math problem is correct or not. Your model should output True if the solution is correct, and False otherwise.


## **Step 1: Install Necessary Libraries**

First, we need to install the required Python libraries. We'll be using the unsloth library, which provides highly efficient, memory-saving training methods for large language models, making it possible to fine-tune powerful models on a single free-tier GPU. We'll also install xformers for further optimization.


In [None]:
# %%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git@31b667b54139962832ea2de890383eed14a0a17d"
# !pip install --no-deps "xformers<0.0.26" "trl<0.9.0" "peft<0.12.0" "accelerate<0.32.0" "bitsandbytes<0.44.0" "transformers<4.43.0"

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

## **Step 2: Load the Model and Tokenizer**

Next, we'll load the Llama-3-8B model, which is the only model permitted for this competition. We'll use Unsloth's FastLanguageModel to handle this efficiently.

A key technique we'll use is 4-bit quantization (load_in_4bit = True). Think of this as compressing the model's knowledge into a much smaller file size. This significantly reduces the amount of GPU memory required, allowing us to fine-tune this large model even on a free platform like Google Colab.



In [None]:
import unsloth
from unsloth import FastLanguageModel
import torch

max_seq_length = 4096  # Choose any sequence length
dtype = None  # This will auto-detect the best data type for your GPU
load_in_4bit = True  # Use 4-bit quantization to save memory

checkpoint_path = f"/content/drive/MyDrive/llama_finetune_checkpoints/checkpoint-625"

print(f"Loading model from: {checkpoint_path}")


# Load the model and tokenizer from Hugging Face
# Note: We use the base model, not a 4-bit pre-quantized one,
# to ensure we start from the official weights.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name =  checkpoint_path, #"unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

tokenizer.chat_template = """{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' + message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}"""

## **Step 3: Prepare the Dataset**

This is a crucial step where we format our data into a structure the model can learn from. The process involves three parts:

1.  **Loading**: We'll load the official competition dataset from Hugging Face.
2.  **Splitting**: The full dataset is massive. For this starter notebook, we'll create a much smaller, more manageable version to speed things up: **5,000 samples for training** and **500 for validation**.
3.  **Prompting**: We will format each data sample into a clear instructional prompt. This helps the model understand its role as a mathematician verifying a solution.



In [None]:
from datasets import load_dataset, concatenate_datasets

# Load the full training dataset
full_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="train")

# Shuffle the dataset for randomness and create our smaller splits
total_size = 40000
total_size_buffered = int(total_size * 1.3)
true_proportion = 0.50

shuffled_dataset = full_dataset.shuffle(seed=45)
train_dataset = shuffled_dataset.select(range(total_size_buffered))
validation_dataset = shuffled_dataset.select(range(total_size_buffered, total_size_buffered + 1000))

# added to ensure  true/false train split
true_examples = train_dataset.filter(lambda x: x["is_correct"] == True).select(range(int(total_size*true_proportion)))
false_examples = train_dataset.filter(lambda x: x["is_correct"] == False).select(range(int(total_size*(1-true_proportion))))
train_dataset = concatenate_datasets([true_examples, false_examples]).shuffle(seed=42)

In [None]:
EOS_TOKEN = tokenizer.eos_token

# This function formats our data samples using the Llama-3 chat template
def formatting_prompts_func(examples):
    questions = examples["question"]
    solutions = examples["solution"]
    outputs = examples["is_correct"] # These are True/False booleans
    texts = []
    for question, solution, output in zip(questions, solutions, outputs):
        # Define the conversation messages
        messages = [
            {"role": "system", "content": "You are a great mathematician. Your task is to verify if a given solution to a math problem is correct. Respond with only 'True' if the solution is correct, and only 'False' otherwise."},
            {"role": "user", "content": f"Question:\n{question}\n\nSolution:\n{str(solution)}"},
            {"role": "assistant", "content": str(output)}
        ]

        formatted_text = tokenizer.apply_chat_template(
            messages,
            tokenize=False, # We want the formatted string, not tokens yet
            add_generation_prompt=False # We are providing the assistant's response
        )
        texts.append(formatted_text)

    return { "text" : texts }

# Apply the formatting function (this part stays the same)
formatted_train_dataset = train_dataset.map(formatting_prompts_func, batched=True)
formatted_validation_dataset = validation_dataset.map(formatting_prompts_func, batched=True)

In [None]:
def data_collator(examples):
    # Convert lists back to tensors and pad
    ids  = [torch.tensor(e["input_ids"], dtype=torch.long) for e in examples]
    labs = [torch.tensor(e["labels"],    dtype=torch.long) for e in examples]

    ids  = torch.nn.utils.rnn.pad_sequence(
        ids, batch_first=True, padding_value=tokenizer.pad_token_id
    )
    labs = torch.nn.utils.rnn.pad_sequence(
        labs, batch_first=True, padding_value=-100  # ignore_index for loss
    )
    attn = (ids != tokenizer.pad_token_id).long()

    return {"input_ids": ids, "attention_mask": attn, "labels": labs}


## **Step 4: Configure LoRA and Set Up the Trainer**

### **LoRA Configuration**

Instead of training the entire model (which has billions of parameters), we'll use a technique called **Lo**w-**R**ank **A**daptation (LoRA). 🎛️

Think of it like this: rather than rewriting an entire textbook, we're just adding small, efficient "sticky notes" (the LoRA adapters) to update the model's knowledge. This is much faster and requires significantly less memory. We'll use a small **rank** (`r = 8`) to keep the training process light and quick for this starter notebook.


In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 256, # A common practice is to set alpha = 2 * r
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
)


### **SFTTrainer Setup**

Now we'll set up the `SFTTrainer` (Supervised Fine-tuning Trainer). This is the main tool from the `trl` library that will handle the entire training loop for us. We'll give it our model, tokenizer, dataset, and a set of training instructions, such as the batch size and number of epochs.

We will train for just **one epoch** (a single pass over our 5,000-sample dataset) to keep this demonstration fast.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = formatted_train_dataset,
    eval_dataset = formatted_validation_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    args = TrainingArguments(
        per_device_train_batch_size = 64,
        gradient_accumulation_steps = 1,
        warmup_steps = 10,
        max_steps = 625,
        learning_rate = 5e-5,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 300,
        eval_strategy = "steps",
        eval_steps = 300,
        save_strategy = "steps",
        save_steps = 500,
        optim = "adamw_bnb_8bit", #"adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "/content/drive/MyDrive/llama_finetune_checkpoints",
        report_to = "none",

        #gradient_checkpointing = True,             # reduces memory use
        dataloader_num_workers = 12,                # increase for faster loading
        group_by_length = True,
    ),
)

## **Step 5: Start Training\!**

Now, we'll call the `train()` function on our `trainer` object. This will kick off the fine-tuning process. Based on our settings, this will run for one full epoch over our 5,000 examples.


In [None]:
trainer.train()
#trainer.train(resume_from_checkpoint = True)


## **Step 6: Evaluation**


In [None]:
from datasets import load_dataset, concatenate_datasets

# Load the full training dataset
full_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="train")
test_dataset = full_dataset.shuffle(seed=46).select(range(1000))


In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 4096  # Or whatever you used
dtype = None           # This will auto-detect
load_in_4bit = True    # Use 4-bit quantization

checkpoint_step = 8
checkpoint_path = f"/content/drive/MyDrive/llama_finetune_checkpoints/checkpoint-625"
print(f"Loading model from: {checkpoint_path}")

# Load the fine-tuned model and tokenizer from your checkpoint
model2, tokenizer2 = FastLanguageModel.from_pretrained(
    model_name = checkpoint_path,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

print("Model loaded successfully!")

In [None]:
import torch
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset

# Prepare model for inference
FastLanguageModel.for_inference(model2)

# Get token IDs for stopping
eos_id = tokenizer2.eos_token_id
eot_id = tokenizer2.convert_tokens_to_ids("<|eot_id|>")

predictions = []
true_probs_list = []

# Generate predictions for each test example
for example in tqdm(test_dataset, desc="Generating predictions"):
    question = example["question"]
    solution = example["solution"]

    # Format using chat template (same as training/validation)
    messages = [
        {
            "role": "system",
            "content": "You are a great mathematician. Your task is to verify if a given solution to a math problem is correct. Respond with only 'True' if the solution is correct, and only 'False' otherwise."
        },
        {
            "role": "user",
            "content": f"Question:\n{question}\n\nSolution:\n{str(solution)}"
        }
    ]

    # Apply chat template
    inputs = tokenizer2.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to("cuda")

    # Generate with proper stopping
    with torch.no_grad():
        outputs = model2.generate(
            input_ids=inputs,
            max_new_tokens=5,
            do_sample=False,
            temperature=0.0,
            num_beams=1,
            eos_token_id=[eos_id, eot_id],
            pad_token_id=eos_id,
            use_cache=True,
        )

    # Extract only newly generated tokens
    prompt_len = inputs.shape[1]
    gen_ids = outputs[0, prompt_len:]

    # Cut at first EOT token
    eot_positions = (gen_ids == eot_id).nonzero(as_tuple=True)[0]
    if len(eot_positions) > 0:
        gen_ids = gen_ids[:eot_positions[0].item()]

    # Cut at first EOS token
    eos_positions = (gen_ids == eos_id).nonzero(as_tuple=True)[0]
    if len(eos_positions) > 0:
        gen_ids = gen_ids[:eos_positions[0].item()]

    # Decode and parse
    response_text = tokenizer2.decode(gen_ids, skip_special_tokens=True).strip()

    # Extract first word and normalize
    first_word = response_text.split()[0] if response_text else ""

    # Map to boolean (Kaggle expects True/False boolean values)
    if first_word.lower().startswith("true"):
        prediction = True
    elif first_word.lower().startswith("false"):
        prediction = False
    else:
        # Fallback: default to False if unparseable (shouldn't happen with good training)
        prediction = False
        print(f"Warning: Unexpected output '{response_text}' - defaulting to False")

    predictions.append(prediction)


In [None]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import numpy as np

y_pred = predictions
y_true = list(test_dataset['is_correct'])

print(classification_report(y_true, y_pred))
print("Confusion:\n", confusion_matrix(y_true, y_pred))

## **Step 7: Generate Submission File**

This is the final step\! We will now run our fine-tuned model on the official `test` dataset.

We will loop through each example in the test set, generate a prediction, and format the results into a CSV file with two columns: `ID` and `is_correct`, as required by the competition.


In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 4096  # Or whatever you used
dtype = None           # This will auto-detect
load_in_4bit = True    # Use 4-bit quantization

checkpoint_step = 8
checkpoint_path = f"/content/drive/MyDrive/llama_finetune_checkpoints/checkpoint-625"
print(f"Loading model from: {checkpoint_path}")

# Load the fine-tuned model and tokenizer from your checkpoint
model2, tokenizer2 = FastLanguageModel.from_pretrained(
    model_name = checkpoint_path,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

print("Model loaded successfully!")

In [None]:
import torch
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset

# Prepare model for inference
FastLanguageModel.for_inference(model2)

# Load the official test set
test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")

# Get token IDs for stopping
eos_id = tokenizer2.eos_token_id
eot_id = tokenizer2.convert_tokens_to_ids("<|eot_id|>")

predictions = []

# Generate predictions for each test example
for example in tqdm(test_dataset, desc="Generating predictions"):
    question = example["question"]
    solution = example["solution"]

    # Format using chat template (same as training/validation)
    messages = [
        {
            "role": "system",
            "content": "You are a great mathematician. Your task is to verify if a given solution to a math problem is correct. Respond with only 'True' if the solution is correct, and only 'False' otherwise."
        },
        {
            "role": "user",
            "content": f"Question:\n{question}\n\nSolution:\n{str(solution)}"
        }
    ]

    # Apply chat template
    inputs = tokenizer2.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to("cuda")

    # Generate with proper stopping
    with torch.no_grad():
        outputs = model2.generate(
            input_ids=inputs,
            max_new_tokens=5,
            do_sample=False,
            temperature=0.0,
            num_beams=1,
            eos_token_id=[eos_id, eot_id],
            pad_token_id=eos_id,
            use_cache=True,
        )

    # Extract only newly generated tokens
    prompt_len = inputs.shape[1]
    gen_ids = outputs[0, prompt_len:]

    # Cut at first EOT token
    eot_positions = (gen_ids == eot_id).nonzero(as_tuple=True)[0]
    if len(eot_positions) > 0:
        gen_ids = gen_ids[:eot_positions[0].item()]

    # Cut at first EOS token
    eos_positions = (gen_ids == eos_id).nonzero(as_tuple=True)[0]
    if len(eos_positions) > 0:
        gen_ids = gen_ids[:eos_positions[0].item()]

    # Decode and parse
    response_text = tokenizer2.decode(gen_ids, skip_special_tokens=True).strip()

    # Extract first word and normalize
    first_word = response_text.split()[0] if response_text else ""

    # Map to boolean (Kaggle expects True/False boolean values)
    if first_word.lower().startswith("true"):
        prediction = True
    elif first_word.lower().startswith("false"):
        prediction = False
    else:
        # Fallback: default to False if unparseable (shouldn't happen with good training)
        prediction = False
        print(f"Warning: Unexpected output '{response_text}' - defaulting to False")

    predictions.append(prediction)

# Create submission DataFrame
submission = pd.DataFrame({
    'ID': range(len(predictions)),
    'is_correct': predictions
})
checkpoint_step = 8750
submission_path = f"/content/drive/MyDrive/llama_finetune_checkpoints/submission-{checkpoint_step}.csv"
submission.to_csv(submission_path, index=False)

print(f"\n✓ Submission file 'submission.csv' created successfully!")
print(f"Total predictions: {len(predictions)}")
print(f"True: {sum(predictions)}, False: {len(predictions) - sum(predictions)}")
print("\nFirst 10 predictions:")
print(submission.head(10))
print("\nYou can now download this file and submit it to the Kaggle competition.")

In [None]:
submission.is_correct.value_counts()