

 <h1>
Welcome to the Math Question Answer Verification Competition! 🚀

The goal is to fine-tune a Llama-3-8B model to predict if a given solution to a math problem is correct or not. Your model should output True if the solution is correct, and False otherwise.

Good luck, and have fun! 🎉

## **Step 1: Install Necessary Libraries**

First, we need to install the required Python libraries. We'll be using the unsloth library, which provides highly efficient, memory-saving training methods for large language models, making it possible to fine-tune powerful models on a single free-tier GPU. We'll also install xformers for further optimization.


In [None]:
# %%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.26" "trl<0.9.0" "peft<0.12.0" "accelerate<0.32.0" "bitsandbytes<0.44.0" "transformers<4.43.0"

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-quhtr0ui/unsloth_c0d033f2d7fa4b3ebfcf3abd399cc048
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-quhtr0ui/unsloth_c0d033f2d7fa4b3ebfcf3abd399cc048
  Resolved https://github.com/unslothai/unsloth.git to commit 71172a6bd7160cb386d9f3630b2f8675f9338538
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting transformers!=4.52.0,!=4.52.1,!=4.52.2,!=4.52.3,!=4.53.0,!=4.54.0,!=4.55.0,!=4.55.1,<=4.56.2,>=4.51.3 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Using cached transformers-4.56.2-py3-none-any.whl.metadata (40 kB)
Co

## **Step 2: Load the Model and Tokenizer**

Next, we'll load the Llama-3-8B model, which is the only model permitted for this competition. We'll use Unsloth's FastLanguageModel to handle this efficiently.

A key technique we'll use is 4-bit quantization (load_in_4bit = True). Think of this as compressing the model's knowledge into a much smaller file size. This significantly reduces the amount of GPU memory required, allowing us to fine-tune this large model even on a free platform like Google Colab.



In [1]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048 * 2  # Choose any sequence length
dtype = None  # This will auto-detect the best data type for your GPU
load_in_4bit = True  # Use 4-bit quantization to save memory

# Load the model and tokenizer from Hugging Face
# Note: We use the base model, not a 4-bit pre-quantized one,
# to ensure we start from the official weights.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B", # Competition-approved model
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


#### Unsloth: `hf_xet==1.1.10` and `ipykernel>6.30.1` breaks progress bars. Disabling for now in XET.
#### Unsloth: To re-enable progress bars, please downgrade to `ipykernel==6.30.1` or wait for a fix to
https://github.com/huggingface/xet-core/issues/526
Switching to PyTorch attention since your Xformers is broken.

('unterminated string literal (detected at line 997)', (997, 1))
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.10.9: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.617 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## **Step 3: Prepare the Dataset**

This is a crucial step where we format our data into a structure the model can learn from. The process involves three parts:

1.  **Loading**: We'll load the official competition dataset from Hugging Face.
2.  **Splitting**: The full dataset is massive. For this starter notebook, we'll create a much smaller, more manageable version to speed things up: **5,000 samples for training** and **500 for validation**.
3.  **Prompting**: We will format each data sample into a clear instructional prompt. This helps the model understand its role as a mathematician verifying a solution.



In [None]:
from datasets import load_dataset

# Load the full training dataset
full_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="train")

# Shuffle the dataset for randomness and create our smaller splits
shuffled_dataset = full_dataset.shuffle(seed=42)
train_dataset = shuffled_dataset.select(range(500000, 800000))      # Use the first 300,000 for training

validation_dataset = shuffled_dataset.select(range(900000, 910000)) # Use the 10,000 for validation

print(f"{len(train_dataset)} / {len(shuffled_dataset)}")

300000 / 1000000


### I've used TEMPLATE switching techinique to avoid overfitting

In [17]:
# The instructional prompt template for training
TEMPLATES1 = [
    # 1) Your original, kept verbatim
    (
    "You are a strict math answer verifier. Reply with exactly one token: True or False.\n\n"
    "QUESTION:\n{q}\n\n"
    "STUDENT'S FINAL ANSWER:\n{a}\n\n"
    "STUDENT'S REASONING:\n{s}\n\n"
    "IS CORRECT:\n{y}"
    ),
    # 2) Minimal variation, same sections
    (
    "Decide if the student's answer is correct. Respond with exactly one token: True or False.\n\n"
    "QUESTION:\n{q}\n\n"
    "STUDENT'S FINAL ANSWER:\n{a}\n\n"
    "STUDENT'S REASONING:\n{s}\n\n"
    "IS CORRECT:\n{y}"
    ),
    # 3) Slightly different headings (keeps semantics)
    (
    "Act as an impartial grader. Output exactly one token: True or False.\n\n"
    "Problem:\n{q}\n\n"
    "Final Answer (student):\n{a}\n\n"
    "Reasoning (student):\n{s}\n\n"
    "IS CORRECT:\n{y}"
    ),
    # 4) Compact headings
    (
    "Judge correctness. Allowed outputs: True or False.\n\n"
    "Q:\n{q}\n\n"
    "A(final):\n{a}\n\n"
    "Work:\n{s}\n\n"
    "IS CORRECT:\n{y}"
    ),
    # 5) Emphasize strict output
    (
    "Strictly return one token (True or False) indicating whether the student's solution is correct.\n\n"
    "QUESTION:\n{q}\n\n"
    "STUDENT'S FINAL ANSWER:\n{a}\n\n"
    "STUDENT'S REASONING:\n{s}\n\n"
    "IS CORRECT:\n{y}"
    ),
    # 6) Another neutral phrasing
    (
    "Evaluate the student's solution and reply with one token: True or False.\n\n"
    "QUESTION:\n{q}\n\n"
    "STUDENT'S FINAL ANSWER:\n{a}\n\n"
    "STUDENT'S REASONING:\n{s}\n\n"
    "IS CORRECT:\n{y}"
    ),
]

In [18]:
# The instructional prompt template for training
# Stronger template set1
TEMPLATES2 = [
    # 1) Strict single-token, explicit casing & format (training usage)
    (
    "SYSTEM: You are a strict math verifier. You MUST respond with exactly one token: 'True' or 'False' (capitalized). "
    "Do NOT add punctuation, explanation, or extra whitespace. Output must be exactly one of: True / False.\n\n"
    "QUESTION:\n{q}\n\n"
    "STUDENT'S FINAL ANSWER:\n{a}\n\n"
    "STUDENT'S REASONING:\n{s}\n\n"
    "IS CORRECT:\n{y}"
    ),

    # 2) 1-shot exemplar (helps formatting consistency)
    (
    "You are a verifier. For each example, output exactly one token: 'True' or 'False'.\n\n"
    "EXAMPLE:\nQUESTION: 2 + 2 = ? | STUDENT ANSWER: 4 | REASONING: correct arithmetic.\nIS CORRECT: True\n\n"
    "Now evaluate:\nQUESTION:\n{q}\n\n"
    "STUDENT'S FINAL ANSWER:\n{a}\n\n"
    "STUDENT'S REASONING:\n{s}\n\n"
    "IS CORRECT:\n{y}"
    ),

    # 3) Contrastive instruction emphasising correctness criteria (good for boundary cases)
    (
    "Instruction: Determine whether the student's final answer is exactly correct for the given question and reasoning. "
    "If the final answer is mathematically correct and the reasoning supports it, return 'True'. Otherwise return 'False'. "
    "Return exactly one token: True or False.\n\n"
    "QUESTION:\n{q}\n\n"
    "STUDENT'S FINAL ANSWER:\n{a}\n\n"
    "STUDENT'S REASONING:\n{s}\n\n"
    "IS CORRECT:\n{y}"
    ),

    # 4) Compact minimal template (fast inference / short context)
    (
    "Judge correctness: output one token 'True' or 'False'.\n\n"
    "Q: {q}\n\n"
    "A: {a}\n\n"
    "Work: {s}\n\n"
    "IS CORRECT:\n{y}"
    ),

    # 5) Negative-check template (explicitly check for common mistake types)
    (
    "Role: Math verifier. First check whether the numeric result matches the derivation and units. "
    "If there's an arithmetic, sign, off-by-one, or algebraic error in the work or final answer, output 'False'. "
    "Otherwise output 'True'. Exactly one token only.\n\n"
    "QUESTION:\n{q}\n\n"
    "STUDENT'S FINAL ANSWER:\n{a}\n\n"
    "STUDENT'S REASONING:\n{s}\n\n"
    "IS CORRECT:\n{y}"
    ),

    # 6) Enforced-label-format template (helps normalize different label encodings)
    (
    "Evaluator: Respond with EXACTLY 'True' or 'False' (capitalized, no trailing periods). "
    "Treat 'True' as correct and 'False' as incorrect. Do not output any other characters.\n\n"
    "Problem:\n{q}\n\n"
    "Final Answer (student):\n{a}\n\n"
    "Solution/Work:\n{s}\n\n"
    "IS CORRECT:\n{y}"
    ),

    # 7) Few-shot with a counterexample (teaches the model what 'False' looks like)
    (
    "You are a verifier. Examples:\n"
    "EX1: Q: 5+3=8 | Ans: 8 | Work: correct => IS CORRECT: True\n"
    "EX2: Q: 6/2=4 | Ans: 4 | Work: incorrect division => IS CORRECT: False\n\n"
    "Now evaluate:\nQUESTION:\n{q}\n\n"
    "STUDENT'S FINAL ANSWER:\n{a}\n\n"
    "STUDENT'S REASONING:\n{s}\n\n"
    "IS CORRECT:\n{y}"
    ),
]

In [19]:
# The instructional prompt template for training
# Stronger template set2
TEMPLATES3 = [
    # 1) Self-verification with backward checking
    (
    "Verify math correctness using backward validation.\n\n"
    "STEP 1: Check if the final answer is numerically correct.\n"
    "STEP 2: Verify the reasoning supports the answer.\n"
    "STEP 3: Check for arithmetic/algebraic errors.\n"
    "Output EXACTLY one token: True or False\n\n"
    "###QUESTION###\n{q}\n\n"
    "###FINAL ANSWER###\n{a}\n\n"
    "###STUDENT WORK###\n{s}\n\n"
    "IS CORRECT:\n{y}"
    ),

    # 2) Error-focused contrastive with specific checks
    (
    "Math Verifier: Check for these errors:\n"
    "- Arithmetic mistakes (addition, subtraction, multiplication, division)\n"
    "- Sign errors (positive/negative)\n"
    "- Unit mismatches\n"
    "- Off-by-one errors\n"
    "- Order of operations violations\n\n"
    "If ANY error exists: output False\n"
    "If all correct: output True\n"
    "Output exactly one token.\n\n"
    "Q: {q}\nA: {a}\nWork: {s}\n\n"
    "IS CORRECT:\n{y}"
    ),

    # 3) Few-shot with diverse contrastive examples
    (
    "Evaluate mathematical correctness. Examples:\n\n"
    "EX1: Q: 15÷3=? | Ans: 5 | Work: 15/3=5 ✓ => True\n"
    "EX2: Q: 8+7=? | Ans: 16 | Work: 8+7=14 ✗ (arithmetic error) => False\n"
    "EX3: Q: 3x²=12, x=? | Ans: 2 | Work: x²=4, x=2 ✓ => True\n"
    "EX4: Q: -5×3=? | Ans: 15 | Work: -5×3=15 ✗ (sign error) => False\n\n"
    "Now verify:\n"
    "QUESTION: {q}\n"
    "ANSWER: {a}\n"
    "REASONING: {s}\n\n"
    "IS CORRECT:\n{y}"
    ),

    # 4) Numerical precision validator
    (
    "Role: Precision Validator. Verify numerical accuracy.\n"
    "Check: (1) Final numeric value matches derivation\n"
    "      (2) All intermediate calculations correct\n"
    "      (3) Units/dimensions consistent\n"
    "Return single token True if perfectly correct, False otherwise.\n\n"
    "===PROBLEM===\n{q}\n\n"
    "===STUDENT ANSWER===\n{a}\n\n"
    "===SOLUTION===\n{s}\n\n"
    "IS CORRECT:\n{y}"
    ),

    # 5) Step-by-step decomposition
    (
    "Verification Protocol:\n"
    "1. Parse the question requirements\n"
    "2. Trace through each calculation step\n"
    "3. Compare final answer to question\n"
    "4. Confirm: Does answer = work output?\n"
    "Output True only if all checks pass, otherwise False.\n"
    "Single token response required.\n\n"
    "QUESTION:\n{q}\n\n"
    "CLAIMED ANSWER:\n{a}\n\n"
    "WORK SHOWN:\n{s}\n\n"
    "IS CORRECT:\n{y}"
    ),

    # 6) Affirmative instruction style
    (
    "DO evaluate if the answer is mathematically correct.\n"
    "DO verify calculations match the final answer.\n"
    "DO check all arithmetic operations.\n"
    "DO confirm units and signs are correct.\n"
    "Output one token: True for correct, False for incorrect.\n\n"
    "Problem: {q}\n"
    "Student's Answer: {a}\n"
    "Student's Work: {s}\n\n"
    "IS CORRECT:\n{y}"
    ),

    # 7) Tip-incentivized template
    (
    "I will tip you $200 for accurate verification!\n"
    "Task: Determine if the student's answer is exactly correct.\n"
    "Check: numerical accuracy, reasoning validity, calculation steps.\n"
    "Respond with ONLY True or False (no explanation).\n\n"
    "Question: {q}\n"
    "Answer: {a}\n"
    "Solution: {s}\n\n"
    "IS CORRECT:\n{y}"
    ),

    # 8) Natural language verification
    (
    "Act as a math teacher grading student work. "
    "The answer is correct ONLY if both the final numeric value AND "
    "the reasoning steps are valid. Check for: "
    "calculation errors, logical flaws, wrong formula application. "
    "Reply with exactly one word: True or False.\n\n"
    "Problem Statement: {q}\n"
    "Student Submitted: {a}\n"
    "Student Reasoning: {s}\n\n"
    "IS CORRECT:\n{y}"
    ),

    # 9) Binary decision tree
    (
    "Decision: Does final answer match correct result from work?\n"
    "If YES: Is work logically sound?\n"
    "  If YES: output True\n"
    "  If NO: output False\n"
    "If NO: output False\n\n"
    "Answer in one token only: True or False\n\n"
    "Q: {q}\nA: {a}\nWork: {s}\n\n"
    "IS CORRECT:\n{y}"
    ),

    # 10) Meta-prompt with explicit format
    (
    "VERIFICATION TASK\n"
    "Input: Math question + student answer + solution steps\n"
    "Process: Validate numerical correctness and logical coherence\n"
    "Output Format: Single token - either True or False\n"
    "True = answer is correct | False = answer is incorrect\n\n"
    "QUESTION: {q}\n"
    "ANSWER: {a}\n"
    "SOLUTION: {s}\n\n"
    "IS CORRECT:\n{y}"
    ),

    # 11) Zero-shot CoT variant
    (
    "Let's verify this step by step:\n"
    "First, understand what the question asks.\n"
    "Second, check if the work produces the given answer.\n"
    "Third, verify no calculation errors exist.\n"
    "Final output: True if correct, False if wrong.\n"
    "One token only.\n\n"
    "Q: {q}\nA: {a}\nWork: {s}\n\n"
    "IS CORRECT:\n{y}"
    ),

    # 12) Minimal signal-to-noise
    (
    "Verify: {q}\n"
    "Claimed: {a}\n"
    "Work: {s}\n\n"
    "IS CORRECT:\n{y}"
    ),
]

In [4]:

# We must add an End Of Sequence (EOS) token to tell the model when a completion is finished.
EOS_TOKEN = tokenizer.eos_token

# This function formats our data samples into the prompt template.
def formatting_prompts_func(examples):
    texts = []
    for q, sol, y, a in zip(examples["question"], examples["solution"], examples["is_correct"], examples["answer"]):
        t = TEMPLATES3[hash(q) % len(TEMPLATES3)]
        texts.append(t.format(q=q, s=sol, y=y, a=a) + EOS_TOKEN)
    return {"text": texts}

# Apply the formatting function to our training dataset
def add_len(example):
    example["sol_len"] = len(tokenizer.encode(example["solution"]))
    return example


train_len = train_dataset.map(add_len)
formatted_train_dataset = train_len.sort("sol_len")  # easy→hard curriculum
formatted_train_dataset = formatted_train_dataset.map(formatting_prompts_func, batched=True)

Map: 100%|██████████| 300000/300000 [00:53<00:00, 5566.85 examples/s]
Map: 100%|██████████| 300000/300000 [00:03<00:00, 93064.29 examples/s] 


In [6]:
example = formatted_train_dataset[-1]["text"]
print(example)

Decision: Does final answer match correct result from work?
If YES: Is work logically sound?
  If YES: output True
  If NO: output False
If NO: output False

Answer in one token only: True or False

Q: John has five children.  What is the probability that at least half of them are girls? (We can assume a boy is equally likely to be born as is a girl, and vice-versa.)
A: \frac{1}{0}
Work: To calculate the probability of having at least 3 girls, we need to calculate the sample space as well. We can either do it theoretically or empirically.

For empirical approach, we can simulate as below. And it gives us that it is 100 percent chance that there is at least half girls.

<llm-code>
import random

def simulate():
    sample_space = []
    boys_num = 0
    girls_num = 0
    for _ in range(5):
        num = random.randint(0,1)
        sample_space.append(num)
        if num == 0:
            boys_num += 1
        else:
            girls_num += 1
    return " ".join(map(str, sample_space)), 

## **Step 4: Configure LoRA and Set Up the Trainer**

### **LoRA Configuration**

Instead of training the entire model (which has billions of parameters), we'll use a technique called **Lo**w-**R**ank **A**daptation (LoRA). 🎛️

We used r = 32


In [7]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # A small rank for lighter training
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 64, # A common practice is to set alpha = 2 * r
    lora_dropout = 0.1,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.1.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.10.9 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.



### **SFTTrainer Setup**

Now we'll set up the `SFTTrainer` (Supervised Fine-tuning Trainer). This is the main tool from the `trl` library that will handle the entire training loop for us. We'll give it our model, tokenizer, dataset, and a set of training instructions, such as the batch size and number of epochs.

We will train for just **one epoch** (a single pass over our 5,000-sample dataset) to keep this demonstration fast.

In [9]:
# Location: new cell just above the trainer creation
import math, torch

N = len(formatted_train_dataset)  # should equal 100000
B = 4                              # per_device_train_batch_size (update if you change)
G = 4                              # gradient_accumulation_steps
D = max(1, torch.cuda.device_count())  # number of GPUs (1 if none detected)

micro_batch_total = B * D
micro_batches = math.ceil(N / micro_batch_total)
steps_per_epoch = math.ceil(micro_batches / G)

print(f"TRAINING SIZE N = {N}")
print(f"Devices (D) = {D}, per-device batch (B) = {B}, grad_accum (G) = {G}")
print(f"micro_batches = ceil(N/(B*D)) = {micro_batches}")
print(f"steps_per_epoch = ceil(micro_batches / G) = {steps_per_epoch}")

# Recommended schedule (example)
desired_epochs = 3
max_steps_computed = steps_per_epoch * desired_epochs
print(f"For {desired_epochs} epochs -> max_steps = {max_steps_computed} optimizer steps")


TRAINING SIZE N = 300000
Devices (D) = 1, per-device batch (B) = 4, grad_accum (G) = 4
micro_batches = ceil(N/(B*D)) = 75000
steps_per_epoch = ceil(micro_batches / G) = 18750
For 3 epochs -> max_steps = 56250 optimizer steps


In [10]:
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
from transformers import TrainingArguments
import numpy as np

response_template = "IS CORRECT:\n"
collator = DataCollatorForCompletionOnlyLM(
    tokenizer=tokenizer,
    response_template=response_template,
    mlm=False,
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = formatted_train_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = collator,
    args = TrainingArguments(
        per_device_train_batch_size = 8,
        gradient_accumulation_steps = 4,
        warmup_ratio = 0.1,
        num_train_epochs = 5,
        learning_rate = 5e-5,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 100,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = 42,
        output_dir = "outputs6",
        report_to = "none",
        gradient_checkpointing = True
    ),
)

Map: 100%|██████████| 300000/300000 [00:24<00:00, 12061.89 examples/s]


## **Step 5: Start Training\!**

Now, we'll call the `train()` function on our `trainer` object. This will kick off the fine-tuning process. Based on our settings, this will run for one full epoch over our 5,000 examples.

Grab a coffee, as this will take a few minutes\! ☕


In [11]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 300,000 | Num Epochs = 5 | Total steps = 46,875
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 4 x 1) = 32
 "-____-"     Trainable parameters = 83,886,080 of 8,114,147,328 (1.03% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
100,2.6512
200,0.52
300,0.2677
400,0.2599
500,0.243
600,0.2236
700,0.2083
800,0.2075
900,0.1987
1000,0.1893


KeyboardInterrupt: 


## **Step 6: Inference and Evaluation**

Now that our model is trained, we need to test it on our validation set. We'll use a slightly different prompt for inference—one where we leave the `Output:` section blank for the model to complete.

Let's test it on a single example from our validation set to see what it predicts.

In [21]:
model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "trained_models/model2",
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )

# Prepare the model for faster inference
FastLanguageModel.for_inference(model)

# Create the prompt template for inference (no answer included)
inference_prompt =(
    "Instruction: Determine whether the student's final answer is exactly correct for the given question and reasoning. "
    "If the final answer is mathematically correct and the reasoning supports it, return 'True'. Otherwise return 'False'. "
    "Return exactly one token: True or False.\n\n"
    "QUESTION:\n{q}\n\n"
    "STUDENT'S FINAL ANSWER:\n{a}\n\n"
    "STUDENT'S REASONING:\n{s}\n\n"
    "IS CORRECT:\n"
)

# Select a sample from the validation set
example = validation_dataset[10] # You can change the index (e.g., to 1, 2, 50)
q = example["question"]
s = example["solution"]
a = example["answer"]

# Format the prompt with the validation data
inputs = tokenizer(
[
    inference_prompt.format(q=q, s=s, a=a)
], return_tensors = "pt").to("cuda")

# Generate the model's response
outputs = model.generate(**inputs, max_new_tokens = 8, use_cache = True)
response = tokenizer.batch_decode(outputs)

# Print the results
print("#### QUESTION ####")
print(q)
print("\n#### SOLUTION ####")
print(s)
print("#### MODEL'S PREDICTION ####")
prediction_text = response[0].split("IS CORRECT:\n")[-1].strip()
print(prediction_text)  # Now shows: False

# Parse prediction
is_correct_prediction = 'true' in prediction_text.lower()
print(f"\nParsed as: {is_correct_prediction}")
print(f"Ground truth: {example['is_correct']}")
print(f"Match: {is_correct_prediction == example['is_correct']}")
print("\n#### CORRECT ANSWER ####")
print(example["is_correct"])

print(response)

==((====))==  Unsloth 2025.10.9: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.617 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
#### QUESTION ####
Evaluate: $\frac{10^{-2}5^0}{10^{-3}}$

#### SOLUTION ####
Let's use sympy to get the result of the division.
<llm-code>
from sympy import Symbol, Integer

M = Symbol('M')

expression = (M**(-2) * 5**0) / (M**(-3))
expression.simplify()
</llm-code>
<llm-code-output>
M
</llm-code-output>
So $10^{-2}5^0/10^{-3} = \boxed{M}$.
#### MODEL'S PREDICTION ####
False<|end_of_text|>

Parsed as: False
Ground truth: False
Match: True

#### CORRECT ANSWER ####
False
["<|begin_of_text|>Instruction: Determine whether the student

### We've used majority voting technique. Each models are trained on different subsets of data.

In [None]:
import gc
import torch

# The quotes are the accuracy on the validation set for each model
model_checkpoint_list = [
    "trained_models/model2", #0.90
    "trained_models/model3", #0.91
    "trained_models/model4", #0.90
    "trained_models/model5", #0.91
    "trained_models/model6", #0.96
]

template_map = {
    "trained_models/model2": TEMPLATES1,
    "trained_models/model3": TEMPLATES1,
    "trained_models/model4": TEMPLATES2,
    "trained_models/model5": TEMPLATES3,
    "trained_models/model6": TEMPLATES3,
}

import math
from collections import Counter
from tqdm import tqdm

def parse_output(response_text):
    # Find the text after "Output:"
    output_part = response_text.split("IS CORRECT:\n")[-1]
    # Check if "True" is in that part, case-insensitively
    if 'true' in output_part.lower():
        return True
    return False
    
# --- Configurable tie-breaker: "False" (conservative) or "True" ---
TIE_BREAKER = False

# Prepare storage
n_examples = len(validation_dataset)
# votes[i] will be a list of booleans (one per checkpoint) for example i
votes = [[] for _ in range(n_examples)]
labels = [bool(ex["is_correct"]) for ex in validation_dataset]  # collect once

# Loop checkpoints and collect one vote per example
for model_checkpoint in model_checkpoint_list:
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = model_checkpoint,
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    model.to("cuda")
    model.eval()

    template_set = template_map[model_checkpoint]

    # iterate with index so we can place vote into votes[i]
    for i, example in enumerate(tqdm(validation_dataset, desc=f"Validating {model_checkpoint}")):
        q = example["question"]
        s = example["solution"]
        a = example["answer"]

        template = template_set[hash(q) % len(template_set)]
        inference_prompt = template.split("IS CORRECT:\n{y}")[0] + "IS CORRECT:\n"

        prompt = inference_prompt.format(q=q, s=s, a=a)
        inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

        # Generate (use deterministic sampling for checkpoint voting)
        outputs = model.generate(**inputs, max_new_tokens=8, do_sample=False)
        response_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

        vote = parse_output(response_text)        # True / False boolean
        votes[i].append(bool(vote))

    del model
    del tokenizer
    torch.cuda.empty_cache()
    gc.collect()

# Now compute majority vote per example
majority_preds = []
for i, vote_list in enumerate(votes):
    if len(vote_list) == 0:
        # no votes -> abstain, choose tie-breaker to be safe
        majority = TIE_BREAKER
    else:
        cnt = Counter(vote_list)
        # cnt[True], cnt[False]
        if cnt[True] > cnt[False]:
            majority = True
        elif cnt[True] < cnt[False]:
            majority = False
        else:
            # tie -> use tie-breaker
            majority = TIE_BREAKER
    majority_preds.append(majority)

import numpy as np
# Compute metrics
labels_arr = np.array(labels, dtype=bool)
preds_arr  = np.array(majority_preds, dtype=bool)
accuracy = (labels_arr == preds_arr).mean()

tp = int(((preds_arr == True)  & (labels_arr == True)).sum())
tn = int(((preds_arr == False) & (labels_arr == False)).sum())
fp = int(((preds_arr == True)  & (labels_arr == False)).sum())
fn = int(((preds_arr == False) & (labels_arr == True)).sum())

print("\nValidation results (majority voting)")
print(f"Accuracy: {accuracy:.4f}")
print("Confusion Matrix (TN, FP, FN, TP):")
print(f"TN={tn}, FP={fp}, FN={fn}, TP={tp}")


==((====))==  Unsloth 2025.10.9: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.617 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Validating outputs6/checkpoint-22500: 100%|██████████| 10000/10000 [17:06<00:00,  9.74it/s]



Validation results (majority voting)
Accuracy: 0.9636
Confusion Matrix (TN, FP, FN, TP):
TN=5771, FP=200, FN=164, TP=3865


model1 = 0.91
model2 (caliculum) = 0.89
model3 (lora16) = 0.90
model4 (stronger prompt) = 0.93
model5 (stronger prompt2) = 0.92

## **Step 7: Generate Submission File**

This is the final step\! We will now run our fine-tuned model on the official `test` dataset.

We will loop through each example in the test set, generate a prediction, and format the results into a CSV file with two columns: `ID` and `is_correct`, as required by the competition.


In [20]:
import pandas as pd
from tqdm import tqdm
import gc
import torch

# Load the official test set
test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

model_checkpoint_list = [
    "trained_models/model2", #0.90
    "trained_models/model3", #0.91
    "trained_models/model4", #0.90
    "trained_models/model5", #0.91
    "trained_models/model6",
]

template_map = {
    "trained_models/model2": TEMPLATES1,
    "trained_models/model3": TEMPLATES1,
    "trained_models/model4": TEMPLATES2,
    "trained_models/model5": TEMPLATES3,
    "trained_models/model6": TEMPLATES3,
}

import math
from collections import Counter
from tqdm import tqdm

def parse_output(response_text):
    # Find the text after "Output:"
    output_part = response_text.split("IS CORRECT:\n")[-1]
    # Check if "True" is in that part, case-insensitively
    if 'true' in output_part.lower():
        return True
    return False
    
# --- Configurable tie-breaker: "False" (conservative) or "True" ---
TIE_BREAKER = True

# Prepare storage
n_examples = len(test_dataset)
# votes[i] will be a list of booleans (one per checkpoint) for example i
votes = [[] for _ in range(n_examples)]
labels = [bool(ex["is_correct"]) for ex in test_dataset]  # collect once

# Loop checkpoints and collect one vote per example
for model_checkpoint in model_checkpoint_list:
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = model_checkpoint,
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    model.to("cuda")
    model.eval()

    template_set = template_map[model_checkpoint]

    # iterate with index so we can place vote into votes[i]
    for i, example in enumerate(tqdm(test_dataset, desc=f"Validating {model_checkpoint}")):
        q = example["question"]
        s = example["solution"]
        a = example["answer"]

        template = template_set[hash(q) % len(template_set)]
        prompt = template.split("IS CORRECT:\n{y}")[0] + "IS CORRECT:\n"

        prompt = inference_prompt.format(q=q, s=s, a=a)
        inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

        # Generate (use deterministic sampling for checkpoint voting)
        outputs = model.generate(**inputs, max_new_tokens=8, do_sample=False)
        response_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

        vote = parse_output(response_text)        # True / False boolean
        votes[i].append(bool(vote))

    del model
    del tokenizer
    torch.cuda.empty_cache()
    gc.collect()

# Now compute majority vote per example
majority_preds = []
for i, vote_list in enumerate(votes):
    if len(vote_list) == 0:
        # no votes -> abstain, choose tie-breaker to be safe
        majority = TIE_BREAKER
    else:
        cnt = Counter(vote_list)
        # cnt[True], cnt[False]
        if cnt[True] > cnt[False]:
            majority = True
        elif cnt[True] < cnt[False]:
            majority = False
        else:
            # tie -> use tie-breaker
            majority = TIE_BREAKER
    majority_preds.append(majority)

# Create the submission DataFrame
submission = pd.DataFrame({
    'ID': range(len(majority_preds)),
    'is_correct': majority_preds
})

# Save the DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

print("\nSubmission file 'submission.csv' created successfully!")
print("You can now download this file and submit it to the Kaggle competition.")

==((====))==  Unsloth 2025.10.9: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.617 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Validating trained_models/model2: 100%|██████████| 10000/10000 [16:22<00:00, 10.18it/s]


==((====))==  Unsloth 2025.10.9: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.617 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Validating trained_models/model3: 100%|██████████| 10000/10000 [16:20<00:00, 10.20it/s]


==((====))==  Unsloth 2025.10.9: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.617 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Validating trained_models/model4: 100%|██████████| 10000/10000 [15:35<00:00, 10.68it/s]


==((====))==  Unsloth 2025.10.9: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.617 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Validating trained_models/model5: 100%|██████████| 10000/10000 [16:15<00:00, 10.25it/s]


==((====))==  Unsloth 2025.10.9: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.617 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Validating trained_models/model6: 100%|██████████| 10000/10000 [16:17<00:00, 10.23it/s]



Submission file 'submission.csv' created successfully!
You can now download this file and submit it to the Kaggle competition.
