

 <h1>
The Math Question Answer Verification Competition! üöÄ

The goal is to fine-tune a Llama-3-8B model to predict if a given solution to a math problem is correct or not. Your model should output True if the solution is correct, and False otherwise.


## **Step 1: Install Necessary Libraries**

First, we need to install the required Python libraries. We'll be using the unsloth library, which provides highly efficient, memory-saving training methods for large language models, making it possible to fine-tune powerful models on a single free-tier GPU. We'll also install xformers for further optimization.


In [1]:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-q0rd1cfh/unsloth_e13f940ce8614cb4b3a9ed9e6457fa29
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-q0rd1cfh/unsloth_e13f940ce8614cb4b3a9ed9e6457fa29
  Resolved https://github.com/unslothai/unsloth.git to commit 874b262b5da1e38160312e1b5689a7c01303a51e
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting unsloth_zoo>=2025.10.11 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading unsloth_zoo-2025.10.12-py3-none-any.whl.metadata (32 kB)
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.

## **Step 2: Load the Model and Tokenizer**

Next, we'll load the Llama-3-8B model, which is the only model permitted for this competition. We'll use Unsloth's FastLanguageModel to handle this efficiently.

A key technique we'll use is 4-bit quantization (load_in_4bit = True). Think of this as compressing the model's knowledge into a much smaller file size. This significantly reduces the amount of GPU memory required, allowing us to fine-tune this large model even on a free platform like Google Colab.



In [None]:
from unsloth import FastLanguageModel
import torch

#  Configuration 
max_seq_length = 2048          # Context length (inputs + solution + answer + label)
dtype = torch.bfloat16         # L4 supports bfloat16 natively (good perf / low memory)
load_in_4bit = True            # 4-bit quantization to fit 8B on Colab

#  Load base Llama3-8B model + tokenizer 
# NOTE:
# - We load the official Unsloth-wrapped Llama 3 8B Instruct weights.
#   ("Instruct" generally gives better reasoning/QA behavior for this task)
# - We are NOT starting from any pre-quantized LoRA checkpoint; this is the base model.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B", # Competition-approved model
    max_seq_length  = max_seq_length,
    dtype           = dtype,
    load_in_4bit    = load_in_4bit,
)

#  Put model in train mode (enables grad, LoRA attach later, etc.) 
FastLanguageModel.for_training(model)

print("Loaded model and tokenizer for training on L4 (bf16, 4-bit).")


ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.




ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.10.9: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-80GB. Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Loaded model and tokenizer for training on L4 (bf16, 4-bit).


## **Step 3: Prepare the Dataset**

This is a crucial step where we format our data into a structure the model can learn from. The process involves three parts:

1.  **Loading**: We'll load the official competition dataset from Hugging Face.
2.  **Splitting**: The full dataset is massive. For this starter notebook, we'll create a much smaller, more manageable version to speed things up: **5,000 samples for training** and **500 for validation**.
3.  **Prompting**: We will format each data sample into a clear instructional prompt. This helps the model understand its role as a mathematician verifying a solution.



In [None]:
# Step 3 (Hardened): Data loading, stratified split, balance, formatting

from datasets import load_dataset, Dataset
import numpy as np, random
import math

SEED = 42
random.seed(SEED); np.random.seed(SEED)

#  0) Safety: tokenizer must exist & have pad_token
try:
    _ = tokenizer.encode("ok", add_special_tokens=False)
except NameError as e:
    raise RuntimeError("Tokenizer is not defined. Run the model/tokenizer cell first.") from e

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
if tokenizer.padding_side != "right":
    tokenizer.padding_side = "right"  # better with causal LMs + LoRA SFT

print(f"[Tokenizer] eos_token={tokenizer.eos_token!r} | pad_token={tokenizer.pad_token!r} | padding_side={tokenizer.padding_side}")

#  1) Load ONLY the official train split
# trust_remote_code=False to avoid executing custom code from HF repo
full_train = load_dataset(
    "ad6398/nyu-dl-teach-maths-comp",
    split="train",
    trust_remote_code=False,
)

#  2) STRATIFIED train/val split (deterministic 50/50 ratio for VAL)
VAL_SIZE = 5000  # can tune if you want a larger validation
y = np.array([bool(v) for v in full_train["is_correct"]], dtype=bool)
idx_all = np.arange(len(full_train))

true_idx  = idx_all[y]
false_idx = idx_all[~y]

rng = np.random.RandomState(SEED)

# we try to sample VAL_SIZE/2 positives and VAL_SIZE/2 negatives (or as close as possible)
half_val = VAL_SIZE // 2
val_true_sel  = rng.choice(true_idx,  size=min(half_val, len(true_idx)),  replace=False)
val_false_sel = rng.choice(false_idx, size=min(half_val, len(false_idx)), replace=False)

val_idx = np.concatenate([val_true_sel, val_false_sel])
# if we are still short because one class was too small, top up from the other
if len(val_idx) < VAL_SIZE:
    needed = VAL_SIZE - len(val_idx)
    # pick remaining from whichever class still has room
    remain_pool = np.setdiff1d(idx_all, val_idx, assume_unique=False)
    # deterministic top-up
    extra_idx = rng.choice(remain_pool, size=needed, replace=False)
    val_idx = np.concatenate([val_idx, extra_idx])

# final dedupe + sort for reproducibility
val_idx = np.unique(val_idx)
train_idx = np.setdiff1d(idx_all, val_idx, assume_unique=False)

raw_train = full_train.select(train_idx.tolist())
raw_val   = full_train.select(val_idx.tolist())

#  3) Balance TRAIN and cap per class for Colab budget
ys_train = np.array([bool(v) for v in raw_train["is_correct"]], dtype=bool)
idx_true_train  = np.where(ys_train)[0].tolist()
idx_false_train = np.where(~ys_train)[0].tolist()

n_min = min(len(idx_true_train), len(idx_false_train))

# You can raise CAP_PER_CLASS if VRAM/time allows (e.g. 30_000+ each for better accuracy).
# L4 + LoRA 4-bit can usually handle ~40k-60k total steps if you tune batch size / grad_accum.
CAP_PER_CLASS = min(30_000, n_min)

idx_true_train  = idx_true_train[:CAP_PER_CLASS]
idx_false_train = idx_false_train[:CAP_PER_CLASS]

balanced_local_idx = idx_true_train + idx_false_train
# Deterministic shuffle for reproducibility
rng_perm = np.random.RandomState(SEED)
rng_perm.shuffle(balanced_local_idx)

train_balanced = raw_train.select(balanced_local_idx)

#  4) Smart truncation for long solutions (keep head+tail)
# keep more head reasoning to help correctness judgment but still stay under context
HEAD_TOK, TAIL_TOK = 800, 256
EOS = tokenizer.eos_token

def smart_clip(text: str) -> str:
    if text is None:
        return ""
    s = str(text)
    toks = tokenizer.encode(s, add_special_tokens=False)
    if len(toks) <= (HEAD_TOK + TAIL_TOK):
        return s
    head = tokenizer.decode(toks[:HEAD_TOK], skip_special_tokens=True)
    tail = tokenizer.decode(toks[-TAIL_TOK:], skip_special_tokens=True)
    return head + "\n...\n" + tail

#  5) Concise instruction templates
# NOTE: we REMOVE randomness in template choice.
# Stochastic prompting during supervised fine-tuning can inject label noise.
TEMPLATES = [
    (
        "You are a rigorous mathematician. Determine if the provided solution correctly "
        "solves the problem. Reply strictly with 'True' if it is fully correct, "
        "otherwise reply 'False'.\n\n"
        "Question:\n{q}\n\nSolution:\n{s}\n\nAnswer:\n{y}"
    )
]

def lbl(v):
    return "True" if bool(v) else "False"

def format_batch(batch):
    qs, ss, ys = batch["question"], batch["solution"], batch["is_correct"]
    out = []
    for q, s, yv in zip(qs, ss, ys):
        q_str = "" if q is None else str(q)
        s_str = smart_clip("" if s is None else str(s))
        tmpl = TEMPLATES[0]  # deterministic
        out.append(tmpl.format(q=q_str, s=s_str, y=lbl(yv)) + EOS)
    return {"text": out}

# We keep originals intact; only drop columns in the formatted copies.
formatted_train = train_balanced.map(
    format_batch,
    batched=True,
    desc="Formatting balanced train",
).remove_columns([c for c in train_balanced.column_names if c != "text"])

formatted_val = raw_val.map(
    format_batch,
    batched=True,
    desc="Formatting validation",
).remove_columns([c for c in raw_val.column_names if c != "text"])

#  6) Reports
def class_report(ds, name):
    if "is_correct" in ds.column_names:
        yv = [bool(v) for v in ds["is_correct"]]
        t = sum(yv)
        f = len(yv) - t
        ratio = (t / len(yv)) if len(yv) else 0.0
        print(f"{name}: {len(yv)} samples | True={t} False={f} | True ratio={ratio:.3f}")
    else:
        print(f"{name}: no labels column")

print("=== Data summary ===")
class_report(raw_train, "Raw Train (pre-balance)")
class_report(train_balanced, "Train Balanced (used for SFT)")
class_report(raw_val, "Validation (held-out)")

print(f"Formatted sizes ‚Üí train={len(formatted_train)} | val={len(formatted_val)}")

print("Sample formatted example:\n", formatted_train[0]["text"][:400])


[Tokenizer] eos_token='<|end_of_text|>' | pad_token='<|finetune_right_pad_id|>' | padding_side=right


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

data/train-00001-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/3.65M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Formatting balanced train:   0%|          | 0/60000 [00:00<?, ? examples/s]

Formatting validation:   0%|          | 0/5000 [00:00<?, ? examples/s]

=== Data summary ===
Raw Train (pre-balance): 995000 samples | True=397500 False=597500 | True ratio=0.399
Train Balanced (used for SFT): 60000 samples | True=30000 False=30000 | True ratio=0.500
Validation (held-out): 5000 samples | True=2500 False=2500 | True ratio=0.500
Formatted sizes ‚Üí train=60000 | val=5000
Sample formatted example:
 You are a rigorous mathematician. Determine if the provided solution correctly solves the problem. Reply strictly with 'True' if it is fully correct, otherwise reply 'False'.

Question:
How many of the positive divisors of 3240 are multiples of 3?

Solution:
We can use sympy to get all divisors and count the ones that are multiples of 3.
<llm-code>
from sympy import divisors

all_divisors = diviso


## **Step 4: Configure LoRA and Set Up the Trainer**

### **LoRA Configuration**

Instead of training the entire model (which has billions of parameters), we'll use a technique called **Lo**w-**R**ank **A**daptation (LoRA). üéõÔ∏è

Think of it like this: rather than rewriting an entire textbook, we're just adding small, efficient "sticky notes" (the LoRA adapters) to update the model's knowledge. This is much faster and requires significantly less memory. We'll use a small **rank** (`r = 8`) to keep the training process light and quick for this starter notebook.


In [None]:
# Step 4: Configure LoRA (High-capacity variant for accuracy)

# Assumes `model` and `tokenizer` are already loaded and prepped for training.

from unsloth import FastLanguageModel
import torch

#  LoRA configuration 
# Higher-capacity LoRA:
# r=32 / lora_alpha=64 gives the adapter more expressive power.
# We keep a small dropout to reduce overfitting on balanced 40k samples.
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,                              # LoRA rank (higher capacity)
    lora_alpha = 64,                     # typically ~2 * r
    lora_dropout = 0.05,                 # keep regularization to avoid overfit spikes
    bias = "none",
    target_modules = [
        "q_proj","k_proj","v_proj","o_proj",   # attention projections
        "gate_proj","up_proj","down_proj",     # MLP projections
    ],
    use_gradient_checkpointing = "unsloth",    # keeps memory in check
    random_state = 42,
)

#  Padding / special tokens safety
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "right"

model.config.pad_token_id = tokenizer.pad_token_id
model.config.eos_token_id = tokenizer.eos_token_id

print("[After fix] eos_token =", tokenizer.eos_token,
      "| pad_token =", tokenizer.pad_token,
      "| pad_id =", tokenizer.pad_token_id,
      "| eos_id =", tokenizer.eos_token_id)

#  Label token sanity check
true_ids  = tokenizer.encode("True",  add_special_tokens=False)
false_ids = tokenizer.encode("False", add_special_tokens=False)

print(f'"True" -> {true_ids} | "False" -> {false_ids}')

if len(true_ids) != 1 or len(false_ids) != 1:
    print("Warning: 'True' and/or 'False' are not single tokens. "
          "We'll handle comparison by string prefix during evaluation.")

print("LoRA configured (r=32, Œ±=64, dropout=0.05) with gradient checkpointing.")


Unsloth: Already have LoRA adapters! We shall skip this step.


[After fix] eos_token = <|end_of_text|> | pad_token = <|end_of_text|> | pad_id = 128001 | eos_id = 128001
"True" -> [2575] | "False" -> [4139]
LoRA configured (r=32, Œ±=64, dropout=0.05) with gradient checkpointing.



### **SFTTrainer Setup**

Now we'll set up the `SFTTrainer` (Supervised Fine-tuning Trainer). This is the main tool from the `trl` library that will handle the entire training loop for us. We'll give it our model, tokenizer, dataset, and a set of training instructions, such as the batch size and number of epochs.



## **Step 5: Start Training\!**

Now, we'll call the `train()` function on our `trainer` object. This will kick off the fine-tuning process.

Grab a coffee, as this will take a few hours\! ‚òï


In [7]:
# Step 4 (Version-safe): SFTTrainer configuration

import math, torch
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

BF16_OK = is_bfloat16_supported()
print(f"bfloat16 supported: {BF16_OK}")

per_device_bs = 2
grad_accum    = 8
effective_bs  = per_device_bs * grad_accum

approx_steps_per_epoch = math.ceil(len(formatted_train) / effective_bs)

# IMPORTANT:
# - If CAP_PER_CLASS = 20_000 (‚âà40k train samples total), steps/epoch ‚âà 2500.
#   We'll train ~3000 steps for ~1.2 epochs.
#
# - If CAP_PER_CLASS = 30_000 (‚âà60k train samples total), steps/epoch ‚âà 3750.
#   We'll train ~3800 steps for ~1.0 epoch.
#
# We'll pick max_steps dynamically based on actual dataset size so we converge
# without severely undertraining or wildly overfitting.
if approx_steps_per_epoch <= 3000:
    target_max_steps = 3000   # ~1.2 epochs on ~40k samples
else:
    target_max_steps = 3800   # ~1 epoch on ~60k samples

print(f"Train size: {len(formatted_train)} | Val size: {len(formatted_val)}")
print(f"Effective batch: {effective_bs} | ~steps/epoch: {approx_steps_per_epoch} | max_steps: {target_max_steps}")


#  Common kwargs for both modern and legacy TrainingArguments
common_kwargs = dict(
    per_device_train_batch_size = per_device_bs,
    gradient_accumulation_steps = grad_accum,
    warmup_ratio                = 0.05,
    max_steps                   = target_max_steps,
    learning_rate               = 1e-4,          # safer LR for LoRA r=32
    weight_decay                = 0.01,
    lr_scheduler_type           = "cosine",
    fp16                        = (not BF16_OK),
    bf16                        = BF16_OK,
    logging_steps               = 20,
    save_strategy               = "steps",
    save_steps                  = 200,
    optim                       = "adamw_8bit",    # will use bitsandbytes if available
    max_grad_norm               = 0.3,
    seed                        = 42,
    output_dir                  = "outputs",
    report_to                   = "none",
    remove_unused_columns       = False,
)

# Try modern args with evaluation; if not supported, fall back
try:
    train_args = TrainingArguments(
        **common_kwargs,
        evaluation_strategy      = "steps",
        eval_steps               = 250,
        load_best_model_at_end   = True,
        metric_for_best_model    = "eval_loss",
        greater_is_better        = False,
    )
    use_eval = True
    print("Using TrainingArguments with evaluation_strategy='steps'.")
except TypeError:
    train_args = TrainingArguments(**common_kwargs)
    use_eval = False
    print("Legacy TrainingArguments detected -> proceeding WITHOUT in-train evaluation.")

trainer = SFTTrainer(
    model               = model,
    tokenizer           = tokenizer,
    train_dataset       = formatted_train,
    eval_dataset        = (formatted_val if use_eval else None),
    dataset_text_field  = "text",
    max_seq_length      = max_seq_length,
    packing             = False,
    args                = train_args,
)

trainer.train()

# If we trained without eval (legacy fallback), at least save the last checkpoint.
if not use_eval:
    trainer.save_model("outputs-last")
    tokenizer.save_pretrained("outputs-last")
    print("Saved last checkpoint to outputs-last (no eval available).")
else:
    # Trainer will have already restored best checkpoint by eval_loss
    trainer.save_model("outputs-best")
    tokenizer.save_pretrained("outputs-best")
    print("‚úÖ Training finished ‚Äî best checkpoint (by eval_loss) loaded and saved to outputs-best.")


bfloat16 supported: True
Train size: 60000 | Val size: 5000
Effective batch: 16 | ~steps/epoch: 3750 | max_steps: 3800
Legacy TrainingArguments detected -> proceeding WITHOUT in-train evaluation.


Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/60000 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 128001}.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 60,000 | Num Epochs = 2 | Total steps = 3,800
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 83,886,080 of 8,114,147,328 (1.03% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
20,0.5905
40,0.5508
60,0.5808
80,0.5993
100,0.5655
120,0.5785
140,0.605
160,0.5654
180,0.575
200,0.5763


Saved last checkpoint to outputs-last (no eval available).



## **Step 6: Inference and Evaluation**




Now that our model is trained, we need to test it on our validation set. We'll use a slightly different prompt for inference‚Äîone where we leave the `Output:` section blank for the model to complete.

Let's test it on a single example from our validation set to see what it predicts.

In [None]:
# Step 6: Inference and Evaluation

# We will:
# 1. Reload the fine-tuned model from checkpoint.
# 2. Compute log p("True") vs log p("False") for each validation example.
# 3. Find the best threshold and report validation accuracy.
# 4. Show one qualitative example.

import torch, torch.nn.functional as F
import numpy as np
from tqdm import tqdm
from unsloth import FastLanguageModel, is_bfloat16_supported

# 0. Reload the trained model from checkpoint

# NOTE:
# - We saved "outputs-last" as our fine-tuned LoRA checkpoint.

BF16_OK = is_bfloat16_supported()
reload_dtype = torch.bfloat16 if BF16_OK else torch.float16

max_seq_length = 2048  # match training config

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name        = "outputs-last",   # directory saved in Step 5
    max_seq_length    = max_seq_length,
    dtype             = reload_dtype,
    load_in_4bit      = True,
    trust_remote_code = True,
)

# safety: pad/eos consistency (must match what we enforced earlier)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "right"
model.config.pad_token_id = tokenizer.pad_token_id
model.config.eos_token_id = tokenizer.eos_token_id

# put model in eval/inference mode
FastLanguageModel.for_inference(model)
model.eval()
device = next(model.parameters()).device
print(f"Model reloaded on device: {device}")


# 1. Validation data reference

# We assume `raw_val` from Step 3 is still in memory (the held-out, UNFORMATTED split
# that still has question / solution / is_correct).
# If it's not in memory in a fresh runtime, you must rebuild raw_val the same way as Step 3.
validation_dataset = raw_val
print(f"Validation examples: {len(validation_dataset)}")


# 2. Prompt builder for inference (must match training style)

# We trained with a deterministic instruction template that ends with the gold label.
# For inference, we STOP right before the label and let the model choose.
# We'll clip the solution text exactly like training.

HEAD_TOK, TAIL_TOK = 800, 256  # must match Step 3 final values

def smart_clip(text: str) -> str:
    if text is None:
        return ""
    s = str(text)
    toks = tokenizer.encode(s, add_special_tokens=False)
    if len(toks) <= (HEAD_TOK + TAIL_TOK):
        return s
    head = tokenizer.decode(toks[:HEAD_TOK], skip_special_tokens=True)
    tail = tokenizer.decode(toks[-TAIL_TOK:], skip_special_tokens=True)
    return head + "\n...\n" + tail

INFERENCE_TEMPLATE = (
    "You are a rigorous mathematician. Determine if the provided solution correctly "
    "solves the problem. Reply strictly with 'True' if it is fully correct, "
    "otherwise reply 'False'.\n\n"
    "Question:\n{q}\n\n"
    "Solution:\n{s}\n\n"
    "Answer:\n"
)

def build_prompt(question: str, solution: str) -> str:
    q_str = "" if question is None else str(question)
    s_str = smart_clip("" if solution is None else str(solution))
    return INFERENCE_TEMPLATE.format(q=q_str, s=s_str)


# 3. Score a single prompt via next-token logprobs
# We'll take the logits for the NEXT token after the prompt, then compare
# log p("True") vs log p("False").
# NOTE: We've already verified earlier that "True" and "False"
# each map to a single token ID in this tokenizer.

true_token_id  = tokenizer.encode("True",  add_special_tokens=False)[0]
false_token_id = tokenizer.encode("False", add_special_tokens=False)[0]

@torch.inference_mode()
def score_example(prompt: str):
    enc = tokenizer(
        prompt,
        return_tensors="pt",
        truncation=True,
        max_length=max_seq_length,
    ).to(device)

    out = model(**enc)
    # logits shape: [1, seq_len, vocab]
    last_logits = out.logits[0, -1]          # distribution for the NEXT token
    logprobs = F.log_softmax(last_logits, dim=-1)

    logp_true  = logprobs[true_token_id].item()
    logp_false = logprobs[false_token_id].item()

    margin = logp_true - logp_false          # >0 => model leans "True"
    return margin, logp_true, logp_false


# 4. Score entire validation set, tune threshold, compute accuracy

margins = []
golds = []

for ex in tqdm(validation_dataset, desc="Scoring validation set"):
    q = ex["question"]
    s = ex["solution"]
    y = bool(ex["is_correct"])  # gold label
    prompt = build_prompt(q, s)
    margin, _, _ = score_example(prompt)
    margins.append(margin)
    golds.append(y)

margins = np.array(margins, dtype=np.float32)
golds   = np.array(golds,   dtype=bool)

# sweep candidate thresholds across the distribution of margins
candidate_thresholds = np.quantile(margins, np.linspace(0.02, 0.98, 25))

best_acc = -1.0
best_th  = 0.0

for th in candidate_thresholds:
    preds = margins >= th     # predict True if margin >= threshold
    acc = (preds == golds).mean()
    if acc > best_acc:
        best_acc = acc
        best_th = th

final_preds = margins >= best_th
final_acc = (final_preds == golds).mean()

true_rate_pred  = final_preds.mean()
true_rate_gold  = golds.mean()

print("=======================================")
print(f"Validation size: {len(golds)}")
print(f"Best threshold: {best_th:.6f}")
print(f"Validation accuracy: {final_acc*100:.2f}%")
print(f"Model 'True' rate after threshold: {true_rate_pred*100:.2f}%")
print(f"Ground truth 'True' rate: {true_rate_gold*100:.2f}%")
print("=======================================")

# 5. Inspect one qualitative example

idx = 10  # change to inspect other samples
ex = validation_dataset[idx]
q = ex["question"]
s = ex["solution"]
gold = bool(ex["is_correct"])

prompt = build_prompt(q, s)
margin, lp_t, lp_f = score_example(prompt)
pred_bool = (margin >= best_th)

print("\n#### QUESTION ####")
print(q)
print("\n#### SOLUTION ####")
print(s)
print("\n#### SCORES ####")
print(f"log p('True')  = {lp_t:.6f}")
print(f"log p('False') = {lp_f:.6f}")
print(f"margin(True-False) = {margin:.6f}")
print(f"threshold = {best_th:.6f}")
print("\n#### MODEL PREDICTION ####")
print(pred_bool)
print("\n#### CORRECT ANSWER ####")
print(gold)


Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.10.10: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-80GB. Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
unsloth/meta-llama-3.1-8b-unsloth-bnb-4bit does not have a padding token! Will use pad_token = <|finetune_right_pad_id|>.
Model reloaded on device: cuda:0
Validation examples: 5000


Scoring validation set: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [12:37<00:00,  6.60it/s]


Validation size: 5000
Best threshold: 0.375000
Validation accuracy: 85.82%
Model 'True' rate after threshold: 47.22%
Ground truth 'True' rate: 50.00%

#### QUESTION ####
A five-digit integer will be chosen at random from all possible positive five-digit integers. What is the probability that the number's units digit will be less than 5? Express your answer as a common fraction.

#### SOLUTION ####
The probability that the units digit will be less than $5$ is equal to the ratio of numbers whose units digit is less than $5$ to the total number of $5$-digit numbers. 
There are $9$ possible units digits less than $5$.
There are $89999$ numbers less than $100000$ whose units digit is less than $5$.
So the answer is $\boxed{\frac{89999}{99999}}$.

#### SCORES ####
log p('True')  = -2.125000
log p('False') = -0.126953
margin(True-False) = -1.998047
threshold = 0.375000

#### MODEL PREDICTION ####
False

#### CORRECT ANSWER ####
False


## **Step 7: Generate Submission File**

This is the final step\! We will now run our fine-tuned model on the official `test` dataset.

We will loop through each example in the test set, generate a prediction, and format the results into a CSV file with two columns: `ID` and `is_correct`, as required by the competition.


In [10]:
# Step 7: Generate Submission File (Test Inference ‚Üí CSV)

import torch
import torch.nn.functional as F
import numpy as np
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset

# 0. Safety / setup
# -----------------------------------------------------

# Ensure we're in eval mode and on the right device
model.eval()
device = next(model.parameters()).device
print(f"Inference device: {device}")

# We MUST reuse the same inference template
# that we used for validation thresholding in Step 6.
# Keep wording identical so the logits are calibrated.
INFERENCE_TEMPLATE = (
    "You are a rigorous mathematician. Determine if the provided solution correctly "
    "solves the problem. Reply strictly with 'True' if it is fully correct, "
    "otherwise reply 'False'.\n\n"
    "Question:\n{q}\n\n"
    "Solution:\n{s}\n\n"
    "Answer:\n"
)

# Use the same clip lengths and max_seq_length that were used
# for training/inference so distribution matches validation.
HEAD_TOK, TAIL_TOK = 800, 256  # must match Step 6
# max_seq_length should already be defined earlier (2048). We rely on that here.

def smart_clip(text: str) -> str:
    if text is None:
        return ""
    s = str(text)
    toks = tokenizer.encode(s, add_special_tokens=False)
    if len(toks) <= (HEAD_TOK + TAIL_TOK):
        return s
    head = tokenizer.decode(toks[:HEAD_TOK], skip_special_tokens=True)
    tail = tokenizer.decode(toks[-TAIL_TOK:], skip_special_tokens=True)
    return head + "\n...\n" + tail

def build_prompt(question: str, solution: str) -> str:
    q_str = "" if question is None else str(question)
    s_str = smart_clip("" if solution is None else str(solution))
    return INFERENCE_TEMPLATE.format(q=q_str, s=s_str)

# Reuse the same label tokenization as Step 6.
true_token_id  = tokenizer.encode("True",  add_special_tokens=False)[0]
false_token_id = tokenizer.encode("False", add_special_tokens=False)[0]

@torch.inference_mode()
def score_margin(prompt: str) -> float:
    """
    Compute margin = log p(True) - log p(False)
    at the *next token* after the prompt.
    Higher margin => lean 'True'.
    """
    enc = tokenizer(
        prompt,
        return_tensors="pt",
        truncation=True,
        max_length=max_seq_length,
    ).to(device)

    out = model(**enc)
    logits = out.logits[0, -1]          # logits for NEXT token
    logprobs = F.log_softmax(logits, dim=-1)

    logp_true  = logprobs[true_token_id].item()
    logp_false = logprobs[false_token_id].item()
    return logp_true - logp_false       # decision score

# 1. Load the official test split
# -----------------------------------------------------

test_dataset = load_dataset(
    "ad6398/nyu-dl-teach-maths-comp",
    split="test",
    trust_remote_code=False,
)
print(f"Test examples: {len(test_dataset)}")

# 2. Predict each test example with tuned threshold
# -----------------------------------------------------
# VERY IMPORTANT:
# We use the SAME best_th found on validation in Step 6.
# That threshold aligns model logits to True/False balance.
# Do NOT recompute here (data leakage); just reuse best_th.

predictions = []
for ex in tqdm(test_dataset, desc="Scoring test"):
    q = ex["question"]
    s = ex["solution"]
    prompt = build_prompt(q, s)
    margin = score_margin(prompt)
    pred_bool = (margin >= best_th)
    predictions.append(pred_bool)

# 3. Build submission DataFrame
# -----------------------------------------------------
# Kaggle expects:
#   ID          -> row index in test set
#   is_correct  -> boolean (True/False)
#
# We map row index deterministically using enumerate to be explicit.

submission = pd.DataFrame({
    "ID": [i for i, _ in enumerate(predictions)],
    "is_correct": predictions,
})

# 4. Save to CSV
# -----------------------------------------------------

submission.to_csv("submission.csv", index=False)
print("submission.csv saved.")
print(submission.head())


Inference device: cuda:0
Test examples: 10000


Scoring test: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [25:09<00:00,  6.62it/s]

submission.csv saved.
   ID  is_correct
0   0       False
1   1       False
2   2        True
3   3        True
4   4       False





# SAVE THE MODEL TO DRIVE AND RUN INFERENCE
Add code to save the model checkpoint to Google Drive, load the model from the checkpoint, and generate the final submission CSV file.

## Mount google drive

### Subtask:
Mount Google Drive to save the model checkpoint.


**Reasoning**:
Mount Google Drive to save the model checkpoint.



In [11]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Save model checkpoint

### Subtask:
Save the trained model checkpoint to the specified path in Google Drive.


**Reasoning**:
Define the save path and save the model and tokenizer to Google Drive.



In [12]:
import os

# Define the path to save the model checkpoint in Google Drive
save_path = "/content/drive/MyDrive/llama3_8b_math_verifier_checkpoint"

# Create the directory if it doesn't exist
os.makedirs(save_path, exist_ok=True)

# Save the model and tokenizer
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"Model checkpoint and tokenizer saved to: {save_path}")

Model checkpoint and tokenizer saved to: /content/drive/MyDrive/llama3_8b_math_verifier_checkpoint


## Load model from checkpoint

### Subtask:
Load the model from the saved checkpoint.


**Reasoning**:
Load the model and tokenizer from the saved checkpoint path in Google Drive and prepare the model for inference.



In [4]:
from unsloth import FastLanguageModel

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = "/content/drive/MyDrive/llama3_8b_math_verifier_checkpoint",
    max_seq_length = max_seq_length,
    dtype          = dtype,
    load_in_4bit   = load_in_4bit,
)

FastLanguageModel.for_inference(model)

print("‚úÖ Model and tokenizer successfully loaded from checkpoint.")


==((====))==  Unsloth 2025.10.10: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-80GB. Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

unsloth/meta-llama-3.1-8b-unsloth-bnb-4bit does not have a padding token! Will use pad_token = <|finetune_right_pad_id|>.


Unsloth 2025.10.10 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


‚úÖ Model and tokenizer successfully loaded from checkpoint.


## Generate submission file

### Subtask:
Generate the submission CSV file using the loaded model.


**Reasoning**:
Generate the submission CSV file by iterating through the test dataset, generating predictions using the loaded model, and saving the results to a pandas DataFrame.



In [None]:
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset

# Load the official test set
test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

# Create the prompt template for inference (no answer included)
inference_prompt = """You are a great mathematician and you are tasked with finding if a solution to a given maths question is correct or not. Your response should be 'True' if the solution is correct, otherwise 'False'. Below is the Question and Solution.
Question:
{}
Solution:
{}
Output:
"""

# A simple function to parse 'True' or 'False' from the model's raw output
def parse_output(response_text):
    # Find the text after "Output:"
    output_part = response_text.split("Output:\n")[-1]
    # Check if "True" is in that part, case-insensitively
    if 'true' in output_part.lower():
        return True
    return False

# Loop through the test dataset and generate a prediction for each example
for example in tqdm(test_dataset):
    question = example["question"]
    solution = example["solution"]

    # Format the prompt
    prompt = inference_prompt.format(question, str(solution))
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate the prediction
    outputs = model.generate(**inputs, max_new_tokens=8, use_cache=True)
    response_text = tokenizer.batch_decode(outputs)[0]

    # Parse the prediction and add it to our list
    prediction = parse_output(response_text)
    predictions.append(prediction)

# Create the submission DataFrame
submission = pd.DataFrame({
    'ID': range(len(predictions)),
    'is_correct': predictions
})

# Save the DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

print("\nSubmission file 'submission.csv' created successfully!")
print("You can now download this file and submit it to the Kaggle competition.")

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [1:33:05<00:00,  1.79it/s]


Submission file 'submission.csv' created successfully!
You can now download this file and submit it to the Kaggle competition.



