

 <h1>
Welcome to the Math Question Answer Verification Competition! 🚀

The goal is to fine-tune a Llama-3-8B model to predict if a given solution to a math problem is correct or not. Your model should output True if the solution is correct, and False otherwise.

This notebook is a starter guide designed to get you up and running quickly. We'll walk through a simplified training process using a small subset of the data (5,000 examples) and lightweight parameters. The main goal here is to understand the complete workflow, from loading data to generating a submission file, not to achieve a top score.

Good luck, and have fun! 🎉

## **Step 1: Install Necessary Libraries**

First, we need to install the required Python libraries. We'll be using the unsloth library, which provides highly efficient, memory-saving training methods for large language models, making it possible to fine-tune powerful models on a single free-tier GPU. We'll also install xformers for further optimization.


In [3]:
# %%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
#!pip install --no-deps "xformers<0.0.26" "trl<0.9.0" "peft<0.12.0" "accelerate<0.32.0" "bitsandbytes<0.44.0" "transformers<4.43.0"

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-ejm180tc/unsloth_dd5779072e9c4b2b843ea58da214aa12
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-ejm180tc/unsloth_dd5779072e9c4b2b843ea58da214aa12
  Resolved https://github.com/unslothai/unsloth.git to commit bcde35854b4840c9d8a6f6649b60cf25d9ffaeaf
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting unsloth_zoo>=2025.11.1 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading unsloth_zoo-2025.11.1-py3-none-any.whl.metadata (32 kB)
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.gi

In [4]:
# Clean install in Colab
%pip uninstall -y unsloth unsloth_zoo trl
%pip install --upgrade --force-reinstall --no-cache-dir "trl>=0.14,<0.17" unsloth unsloth_zoo

# IMPORTANT: After installs, Colab needs a restart for imports to use the new wheels
import IPython, sys
print("Installed. Python:", sys.version)
print("Please now go to: Runtime -> Restart runtime, then re-run the next cell.")

Found existing installation: unsloth 2025.11.1
Uninstalling unsloth-2025.11.1:
  Successfully uninstalled unsloth-2025.11.1
Found existing installation: unsloth_zoo 2025.11.1
Uninstalling unsloth_zoo-2025.11.1:
  Successfully uninstalled unsloth_zoo-2025.11.1
Found existing installation: trl 0.24.0
Uninstalling trl-0.24.0:
  Successfully uninstalled trl-0.24.0
Collecting trl<0.17,>=0.14
  Downloading trl-0.16.1-py3-none-any.whl.metadata (12 kB)
Collecting unsloth
  Downloading unsloth-2025.11.1-py3-none-any.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.5/61.5 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unsloth_zoo
  Downloading unsloth_zoo-2025.11.1-py3-none-any.whl.metadata (32 kB)
Collecting accelerate>=0.34.0 (from trl<0.17,>=0.14)
  Downloading accelerate-1.11.0-py3-none-any.whl.metadata (19 kB)
Collecting datasets>=3.0.0 (from trl<0.17,>=0.14)
  Downloading datasets-4.3.0-py3-none-any.whl.metadata (18 kB)
Collecting rich

Installed. Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
Please now go to: Runtime -> Restart runtime, then re-run the next cell.


In [None]:
import sys
print("Python:", sys.version)

import unsloth, transformers, trl
print("unsloth =", getattr(unsloth, "__version__", "?"))
print("transformers =", transformers.__version__)
print("trl =", trl.__version__)

# Optional: confirm the RL replacement map includes the needed key
import importlib
rl = importlib.import_module("unsloth.models.rl")
print("has align_logprobs_with_mask:", "align_logprobs_with_mask" in rl.RL_REPLACEMENTS)

Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.




🦥 Unsloth Zoo will now patch everything to make training faster!
unsloth = 2025.10.8
transformers = 4.56.2
trl = 0.16.1
has align_logprobs_with_mask: False


## **Step 2: Load the Model and Tokenizer**

Next, we'll load the Llama-3-8B model, which is the only model permitted for this competition. We'll use Unsloth's FastLanguageModel to handle this efficiently.

A key technique we'll use is 4-bit quantization (load_in_4bit = True). Think of this as compressing the model's knowledge into a much smaller file size. This significantly reduces the amount of GPU memory required, allowing us to fine-tune this large model even on a free platform like Google Colab.



In [1]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048  # Choose any sequence length
dtype = None  # This will auto-detect the best data type for your GPU
load_in_4bit = True  # Use 4-bit quantization to save memory

# Load the model and tokenizer from Hugging Face
# Note: We use the base model, not a 4-bit pre-quantized one,
# to ensure we start from the official weights.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B", # Competition-approved model
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.




🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.10.8: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## **Step 3: Prepare the Dataset**

This is a crucial step where we format our data into a structure the model can learn from. The process involves three parts:

1.  **Loading**: We'll load the official competition dataset from Hugging Face.
2.  **Splitting**: The full dataset is massive. For this starter notebook, we'll create a much smaller, more manageable version to speed things up: **5,000 samples for training** and **500 for validation**.
3.  **Prompting**: We will format each data sample into a clear instructional prompt. This helps the model understand its role as a mathematician verifying a solution.



In [2]:
from datasets import load_dataset

# Load the full training dataset
full_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="train")

# Shuffle the dataset for randomness and create our smaller splits
shuffled_dataset = full_dataset.shuffle(seed=42)
train_dataset = shuffled_dataset.select(range(30000))      # Use the first 20,000 for training
validation_dataset = shuffled_dataset.select(range(30000, 33000)) # Use the next 200 for validation

In [27]:
import re, unicodedata

# -------- Cleaning Function --------
def clean_text(x: str) -> str:
    if not isinstance(x, str):
        return ""
    s = x
    s = re.sub(r"</?llm-code>", " ", s)          # Remove the <llm-code> tag (preserving its contents)
    s = re.sub(r"```.*?```", " ", s, flags=re.S) # Remove the Markdown code block ```...```
    # Unified Common Mathematical/Unicode Symbols
    for k, v in {"½":"1/2","⅓":"1/3","¼":"1/4","¾":"3/4","⅔":"2/3","π":"pi","√":"sqrt"}.items():
        s = s.replace(k, v)
    s = unicodedata.normalize("NFKC", s)
    s = re.sub(r"[ \t]+", " ", s)                # Compress Spaces
    return s.strip()

def _process_row(ex):
    ex["question"] = clean_text(ex.get("question", ""))
    ex["solution"] = clean_text(ex.get("solution", ""))
    return ex

# -------- Clean only on the sampled subset --------
cleaned_train_dataset = train_dataset.map(_process_row)
cleaned_validation_dataset = validation_dataset.map(_process_row)

# -------- Quickly compare a sample --------
def _peek(orig, cleaned, idx=0):
    print("—— Before(question) ——\n", orig[idx]["question"][:200])
    print("\n—— After (question) ——\n", cleaned[idx]["question"][:200])
    print("\n—— Before(solution) ——\n", orig[idx]["solution"][:200])
    print("\n—— After (solution) ——\n", cleaned[idx]["solution"][:200])

print(f"[INFO] cleaned_train_dataset: {len(cleaned_train_dataset)}, cleaned_validation_dataset: {len(cleaned_validation_dataset)}")
_peek(train_dataset, cleaned_train_dataset, 0)

# Follow-up training: Replace the previously used train_dataset/validation_dataset with cleaned_train_dataset / cleaned_validation_dataset.

Map:   0%|          | 0/30000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

[INFO] cleaned_train_dataset: 30000, cleaned_validation_dataset: 3000
—— Before(question) ——
 A line is parameterized by
\[\begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} 2 \\ 3 \end{pmatrix} + t \begin{pmatrix} -1 \\ 5 \end{pmatrix}.\]A second line is parameterized by
\[\begin{pmatrix}

—— After (question) ——
 A line is parameterized by
\[\begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} 2 \\ 3 \end{pmatrix} + t \begin{pmatrix} -1 \\ 5 \end{pmatrix}.\]A second line is parameterized by
\[\begin{pmatrix}

—— Before(solution) ——
 First, we need to solve the system of equations
\[
\begin{aligned}
2 - t &= s\\
3 + 5t &= 7 + 4s
\end{aligned}
\]
by eliminating s.
We'll use sympy.
<llm-code>
from sympy import symbols, solve

# defi

—— After (solution) ——
 First, we need to solve the system of equations
\[
\begin{aligned}
2 - t &= s\\
3 + 5t &= 7 + 4s
\end{aligned}
\]
by eliminating s.
We'll use sympy.
 
from sympy import symbols, solve

# define the va


In [1]:
# The instructional prompt template for training
training_prompt = """You are a great mathematician and you are tasked with finding if a solution to a given maths question is correct or not. Your response should be 'True' if the solution is correct, otherwise 'False'. Below is the Question and Solution.
Question:
{}
Solution:
{}
Output:
{}"""

# We must add an End Of Sequence (EOS) token to tell the model when a completion is finished.
EOS_TOKEN = tokenizer.eos_token

# This function formats our data samples into the prompt template.
def formatting_prompts_func(examples):
    questions = examples["question"]
    solutions = examples["solution"]
    outputs = examples["is_correct"]
    texts = []
    for question, solution, output in zip(questions, solutions, outputs):
        # Format the prompt and add the EOS token
        text = training_prompt.format(question, str(solution), str(output)) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts }

# Apply the formatting function to our training dataset
formatted_train_dataset = cleaned_train_dataset.map(formatting_prompts_func, batched=True)

## **Step 4: Configure LoRA and Set Up the Trainer**

### **LoRA Configuration**

Instead of training the entire model (which has billions of parameters), we'll use a technique called **Lo**w-**R**ank **A**daptation (LoRA). 🎛️

Think of it like this: rather than rewriting an entire textbook, we're just adding small, efficient "sticky notes" (the LoRA adapters) to update the model's knowledge. This is much faster and requires significantly less memory. We'll use a small **rank** (`r = 8`) to keep the training process light and quick for this starter notebook.


In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # A small rank for lighter training
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 64, # A common practice is to set alpha = 2 * r
    lora_dropout = 0.02,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.02.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.10.8 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.



### **SFTTrainer Setup**

Now we'll set up the `SFTTrainer` (Supervised Fine-tuning Trainer). This is the main tool from the `trl` library that will handle the entire training loop for us. We'll give it our model, tokenizer, dataset, and a set of training instructions, such as the batch size and number of epochs.

We will train for just **one epoch** (a single pass over our 5,000-sample dataset) to keep this demonstration fast.

In [32]:
# ✅ Step 1: Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# ✅ Step 2: Save to Drive
OUTPUT_DIR = "/content/drive/MyDrive/llama3_math_training_ckpt"
LOG_DIR = f"{OUTPUT_DIR}/logs"

import os
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(LOG_DIR, exist_ok=True)

Mounted at /content/drive


In [33]:
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
from transformers import TrainingArguments

training_args_stage1 = TrainingArguments(
    per_device_train_batch_size = 8,
    gradient_accumulation_steps = 8,
    warmup_steps = 20,
    learning_rate = 7e-5,
    lr_scheduler_type = "cosine",
    num_train_epochs = 2,
    bf16 = torch.cuda.is_bf16_supported(),
    logging_steps = 20,
    optim = "adamw_8bit",
    weight_decay = 0.01,
    seed = 42,
    output_dir = OUTPUT_DIR,
    report_to = "none",
    save_strategy = "steps",
    save_steps = 100,
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = formatted_train_dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    args = training_args_stage1,
)


## **Step 5: Start Training\!**

Now, we'll call the `train()` function on our `trainer` object. This will kick off the fine-tuning process. Based on our settings, this will run for one full epoch over our 5,000 examples.

Grab a coffee, as this will take a few minutes\! ☕


In [34]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 30,000 | Num Epochs = 2 | Total steps = 938
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 8 x 1) = 64
 "-____-"     Trainable parameters = 83,886,080 of 8,114,147,328 (1.03% trained)


Step,Training Loss
20,0.6997
40,0.6991
60,0.6844
80,0.695
100,0.6764
120,0.6722
140,0.6749
160,0.6793
180,0.6736
200,0.6495


TrainOutput(global_step=938, training_loss=0.5954048587823473, metrics={'train_runtime': 10068.3447, 'train_samples_per_second': 5.959, 'train_steps_per_second': 0.093, 'total_flos': 1.3931322649191383e+18, 'train_loss': 0.5954048587823473})


## **Step 6: Inference and Evaluation**

Now that our model is trained, we need to test it on our validation set. We'll use a slightly different prompt for inference—one where we leave the `Output:` section blank for the model to complete.

Let's test it on a single example from our validation set to see what it predicts.

In [6]:
# Prepare the model for faster inference
FastLanguageModel.for_inference(model)

# Create the prompt template for inference (no answer included)
inference_prompt = """You are a great mathematician and you are tasked with finding if a solution to a given maths question is correct or not. Your response should be 'True' if the solution is correct, otherwise 'False'. Below is the Question and Solution.
Question:
{}
Solution:
{}
Output:
"""

# Select a sample from the validation set
example = validation_dataset[10] # You can change the index (e.g., to 1, 2, 50)
question = example["question"]
solution = example["solution"]

# Format the prompt with the validation data
inputs = tokenizer(
[
    inference_prompt.format(question, str(solution))
], return_tensors = "pt").to("cuda")

# Generate the model's response
outputs = model.generate(**inputs, max_new_tokens = 8, use_cache = True)
response = tokenizer.batch_decode(outputs)

# Print the results
print("#### QUESTION ####")
print(question)
print("\n#### SOLUTION ####")
print(solution)
print("\n#### MODEL'S PREDICTION ####")
# We process the output to show only the generated text
print(response[0].split("Output:\n")[1])
print("\n#### CORRECT ANSWER ####")
print(example["is_correct"])

#### QUESTION ####
Al, Betty, and Clare split $\$1000$ among them to be invested in different ways. Each begins with a different amount. At the end of one year they have a total of $\$1500$. Betty and Clare have both doubled their money, whereas Al has managed to lose $\$100$. What was Al's original portion?

#### SOLUTION ####
Let's write down the equations based on the given information using sympy and then solve them.
<llm-code>
from sympy import symbols, Eq, solve

# Al, Betty, Clare split 1000$ among them to be invested in different ways
# Each of them begin with a different amount
a, b, c = symbols('a b c')

# Equations:
equations = [
    Eq(a + b + c, 1000),
    Eq(a - 100 + 2 * (b + c), 1500)
]

# solve the equations
solutions = solve(equations, (a, b, c))
solutions[a]
</llm-code>
<llm-code-output>
400
</llm-code-output>
So Al's original portion is \boxed{400}.

#### MODEL'S PREDICTION ####
True<|end_of_text|>

#### CORRECT ANSWER ####
True


## **Step 6.1: Calculate Validation Accuracy**

Now that we have trained the model, let's evaluate its performance on the validation dataset to get an idea of how well it generalizes. We will calculate the accuracy of the model's predictions against the true labels in the validation set.

In [35]:
import glob
import os
import pandas as pd
from datasets import load_dataset
from tqdm.auto import tqdm
import numpy as np
from unsloth import FastLanguageModel

def run_optimized_validation(model, tokenizer, validation_dataset, inference_prompt, max_seq_length):
    print("Running optimized validation...")
    model.eval().to("cuda")
    tokenizer.padding_side = "left"
    if tokenizer.pad_token_id is None: tokenizer.pad_token = tokenizer.eos_token
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.set_float32_matmul_precision("high")
    USE_BF16 = torch.cuda.is_bf16_supported()
    MAX_INPUT = max_seq_length; MAX_NEW = 8; BATCH = 16; BUCKET_SZ = 128

    def build_prompt(ex): return inference_prompt.format(ex["question"], str(ex["solution"]))
    prompts = []; labels = []
    for ex in validation_dataset:
        prompts.append(build_prompt(ex))
        labels.append(str(ex["is_correct"]).strip().lower())
    lengths = [len(tokenizer(p, add_special_tokens=True)["input_ids"]) for p in tqdm(prompts, desc="Tokenizing lengths")]
    order = sorted(range(len(prompts)), key=lambda i: lengths[i])
    buckets = [order[i:i+BUCKET_SZ] for i in range(0, len(order), BUCKET_SZ)]
    correct = 0; total = 0
    amp_dtype = torch.bfloat16 if USE_BF16 else torch.float16
    autocast_ctx = torch.cuda.amp.autocast(enabled=True, dtype=amp_dtype)
    with torch.no_grad(), autocast_ctx:
        for bucket in tqdm(buckets, total=len(buckets), desc="Evaluating Batches"):
            for s in range(0, len(bucket), BATCH):
                idxs = bucket[s:s+BATCH]; batch_prompts = [prompts[i] for i in idxs]; batch_labels  = [labels[i] for i in idxs]
                enc = tokenizer(batch_prompts, return_tensors="pt", padding=True, truncation=True, max_length=MAX_INPUT).to("cuda", non_blocking=True)
                out = model.generate(**enc, max_new_tokens=MAX_NEW, do_sample=False, use_cache=True)
                in_len = enc["input_ids"].shape[1]; gen = out[:, in_len:]; texts = tokenizer.batch_decode(gen, skip_special_tokens=True)
                for pred_text, true_label in zip(texts, batch_labels):
                    pred_label = "true" if "true" in pred_text.lower() else "false"
                    if pred_label == true_label: correct += 1
                total += len(batch_labels)
    accuracy = correct / total * 100
    print(f"✅ Validation Accuracy: {accuracy:.2f}% ({correct}/{total})")
    return accuracy


In [36]:
output_dir_stage1 = OUTPUT_DIR
checkpoint_dirs = sorted(
    glob.glob(f"{output_dir_stage1}/checkpoint-*"),
    key=lambda x: int(x.split('-')[-1])
)
print(f"Find {len(checkpoint_dirs)} checkpoints: {checkpoint_dirs}")

results = {}
for checkpoint_dir in checkpoint_dirs:
    print(f"\n--- Evaluating {checkpoint_dir} ---")
    model_checkpoint, tokenizer_checkpoint = FastLanguageModel.from_pretrained(
        model_name = checkpoint_dir, max_seq_length = max_seq_length,
        dtype = dtype, load_in_4bit = load_in_4bit,
    )
    accuracy = run_optimized_validation(
        model_checkpoint, tokenizer_checkpoint, validation_dataset,
        inference_prompt, max_seq_length
    )
    results[checkpoint_dir] = accuracy

best_checkpoint_5k = max(results, key=results.get)
print(f"\n--- Finished! Best checkpoint_30k is : {best_checkpoint_5k} (acc: {results[best_checkpoint_5k]:.2f}%) ---")

--- [评估阶段一] 开始: 寻找 5k 数据的最佳模型... ---
找到 10 个检查点: ['/content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-100', '/content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-200', '/content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-300', '/content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-400', '/content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-500', '/content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-600', '/content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-700', '/content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-800', '/content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-900', '/content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-938']

--- 正在评估 /content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-100 ---
==((====))==  Unsloth 2025.10.8: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. C

Tokenizing lengths:   0%|          | 0/3000 [00:00<?, ?it/s]

  autocast_ctx = torch.cuda.amp.autocast(enabled=True, dtype=amp_dtype)


Evaluating Batches:   0%|          | 0/24 [00:00<?, ?it/s]

✅ Validation Accuracy: 70.20% (2106/3000)

--- 正在评估 /content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-200 ---
==((====))==  Unsloth 2025.10.8: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Running optimized validation...


Tokenizing lengths:   0%|          | 0/3000 [00:00<?, ?it/s]

Evaluating Batches:   0%|          | 0/24 [00:00<?, ?it/s]

✅ Validation Accuracy: 70.60% (2118/3000)

--- 正在评估 /content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-300 ---
==((====))==  Unsloth 2025.10.8: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Running optimized validation...


Tokenizing lengths:   0%|          | 0/3000 [00:00<?, ?it/s]

Evaluating Batches:   0%|          | 0/24 [00:00<?, ?it/s]

✅ Validation Accuracy: 70.60% (2118/3000)

--- 正在评估 /content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-400 ---
==((====))==  Unsloth 2025.10.8: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Running optimized validation...


Tokenizing lengths:   0%|          | 0/3000 [00:00<?, ?it/s]

Evaluating Batches:   0%|          | 0/24 [00:00<?, ?it/s]

✅ Validation Accuracy: 70.83% (2125/3000)

--- 正在评估 /content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-500 ---
==((====))==  Unsloth 2025.10.8: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Running optimized validation...


Tokenizing lengths:   0%|          | 0/3000 [00:00<?, ?it/s]

Evaluating Batches:   0%|          | 0/24 [00:00<?, ?it/s]

✅ Validation Accuracy: 71.17% (2135/3000)

--- 正在评估 /content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-600 ---
==((====))==  Unsloth 2025.10.8: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Running optimized validation...


Tokenizing lengths:   0%|          | 0/3000 [00:00<?, ?it/s]

Evaluating Batches:   0%|          | 0/24 [00:00<?, ?it/s]

✅ Validation Accuracy: 70.83% (2125/3000)

--- 正在评估 /content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-700 ---
==((====))==  Unsloth 2025.10.8: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Running optimized validation...


Tokenizing lengths:   0%|          | 0/3000 [00:00<?, ?it/s]

Evaluating Batches:   0%|          | 0/24 [00:00<?, ?it/s]

✅ Validation Accuracy: 71.00% (2130/3000)

--- 正在评估 /content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-800 ---
==((====))==  Unsloth 2025.10.8: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Running optimized validation...


Tokenizing lengths:   0%|          | 0/3000 [00:00<?, ?it/s]

Evaluating Batches:   0%|          | 0/24 [00:00<?, ?it/s]

✅ Validation Accuracy: 71.43% (2143/3000)

--- 正在评估 /content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-900 ---
==((====))==  Unsloth 2025.10.8: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Running optimized validation...


Tokenizing lengths:   0%|          | 0/3000 [00:00<?, ?it/s]

Evaluating Batches:   0%|          | 0/24 [00:00<?, ?it/s]

✅ Validation Accuracy: 71.57% (2147/3000)

--- 正在评估 /content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-938 ---
==((====))==  Unsloth 2025.10.8: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Running optimized validation...


Tokenizing lengths:   0%|          | 0/3000 [00:00<?, ?it/s]

Evaluating Batches:   0%|          | 0/24 [00:00<?, ?it/s]

✅ Validation Accuracy: 71.43% (2143/3000)

--- [阶段一] 评估完成! 最佳模型 (5k) 是: /content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-900 (准确率: 71.57%) ---


## **Step 7: Generate Submission File**

This is the final step\! We will now run our fine-tuned model on the official `test` dataset.

We will loop through each example in the test set, generate a prediction, and format the results into a CSV file with two columns: `ID` and `is_correct`, as required by the competition.


In [None]:
import pandas as pd
from tqdm import tqdm

# Load the official test set
test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

# A simple function to parse 'True' or 'False' from the model's raw output
def parse_output(response_text):
    # Find the text after "Output:"
    output_part = response_text.split("Output:\n")[-1]
    # Check if "True" is in that part, case-insensitively
    if 'true' in output_part.lower():
        return True
    return False

import torch
import pandas as pd
from datasets import load_dataset
from tqdm.auto import tqdm

# --------- Global Acceleration Settings ---------
torch.backends.cuda.matmul.allow_tf32 = True
torch.set_float32_matmul_precision("high")
USE_BF16 = torch.cuda.is_bf16_supported()

model.eval().to("cuda")
tokenizer.padding_side = "left"
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token

# --------- Parameters ---------
MAX_INPUT = 2048            # Consistent with training, avoid passive truncation.
MAX_NEW   = 4
BATCH     =  16
BUCKET_SZ = 64
DATASET   = "ad6398/nyu-dl-teach-maths-comp"

# --------- Load and pre-build prompts, estimate length ---------
ds = load_dataset(DATASET, split="test")

def build_prompt(ex):
    return inference_prompt.format(ex["question"], str(ex["solution"]))

prompts = [build_prompt(ex) for ex in ds]
# Perform a single “lightweight tokenization” to obtain the length, avoiding repeated overhead in the main loop.
lengths = [len(tokenizer(p, add_special_tokens=True)["input_ids"]) for p in prompts]

# --------- 2) Length-based bucketization/sorting to minimize padding waste ---------
order = sorted(range(len(ds)), key=lambda i: lengths[i])
buckets = [order[i:i+BUCKET_SZ] for i in range(0, len(order), BUCKET_SZ)]

preds, ids = [], []

# --------- 3) Batch Generation (bf16 autocast + left padding + truncation) ---------
amp_dtype = torch.bfloat16 if USE_BF16 else torch.float16
autocast_ctx = torch.cuda.amp.autocast(enabled=True, dtype=amp_dtype)

with torch.no_grad(), autocast_ctx:
    for bucket in tqdm(buckets, total=len(buckets), mininterval=0.5):
        for s in range(0, len(bucket), BATCH):
            idxs = bucket[s:s+BATCH]
            batch_prompts = [prompts[i] for i in idxs]

            enc = tokenizer(
                batch_prompts,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=MAX_INPUT,
            ).to("cuda", non_blocking=True)

            out = model.generate(
                **enc,
                max_new_tokens=MAX_NEW,
                do_sample=False,
                use_cache=True,
            )

            in_len = enc["input_ids"].shape[1]
            gen = out[:, in_len:]
            texts = tokenizer.batch_decode(gen, skip_special_tokens=True)

            preds.extend(parse_output(t) for t in texts)
            ids.extend(ds[i]["id"] if "id" in ds[i] else i for i in idxs)

# --------- 4) submission ---------
pd.DataFrame({"ID": ids, "is_correct": preds}).to_csv("submission.csv", index=False)
print(f"Saved full submission with {len(preds)} samples to 'submission.csv'")

  autocast_ctx = torch.cuda.amp.autocast(enabled=True, dtype=amp_dtype)


  0%|          | 0/157 [00:00<?, ?it/s]

Saved full submission with 10000 samples to 'submission.csv'


# SAVE THE MODEL TO DRIVE AND RUN INFERENCE
Add code to save the model checkpoint to Google Drive, load the model from the checkpoint, and generate the final submission CSV file.

## Mount google drive

### Subtask:
Mount Google Drive to save the model checkpoint.


**Reasoning**:
Mount Google Drive to save the model checkpoint.



In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Save model checkpoint

### Subtask:
Save the trained model checkpoint to the specified path in Google Drive.


**Reasoning**:
Define the save path and save the model and tokenizer to Google Drive.



In [6]:
import os

# Define the path to save the model checkpoint in Google Drive
save_path = "/content/drive/MyDrive/llama3_math_training_ckpt"

# Create the directory if it doesn't exist
os.makedirs(save_path, exist_ok=True)

# Save the model and tokenizer
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"Model checkpoint and tokenizer saved to: {save_path}")

Model checkpoint and tokenizer saved to: /content/drive/MyDrive/llama3_math_training_ckpt


## Load model from checkpoint

### Subtask:
Load the model from the saved checkpoint.


**Reasoning**:
Load the model and tokenizer from the saved checkpoint path in Google Drive and prepare the model for inference.



In [4]:
# Define the path where the model checkpoint was saved in Google Drive
save_path = "/content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-900"

# Load the model and tokenizer from the saved path
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = save_path,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# Prepare the loaded model for faster inference
FastLanguageModel.for_inference(model)

print(f"Model and tokenizer loaded from: {save_path}")

==((====))==  Unsloth 2025.10.8: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model and tokenizer loaded from: /content/drive/MyDrive/llama3_math_training_ckpt/checkpoint-900


## Generate submission file

### Subtask:
Generate the submission CSV file using the loaded model.


**Reasoning**:
Generate the submission CSV file by iterating through the test dataset, generating predictions using the loaded model, and saving the results to a pandas DataFrame.



In [7]:
import pandas as pd
from tqdm import tqdm

# Load the official test set
test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

# A simple function to parse 'True' or 'False' from the model's raw output
def parse_output(response_text):
    # Find the text after "Output:"
    output_part = response_text.split("Output:\n")[-1]
    # Check if "True" is in that part, case-insensitively
    if 'true' in output_part.lower():
        return True
    return False

import torch
import pandas as pd
from datasets import load_dataset
from tqdm.auto import tqdm

# --------- Global Acceleration Settings ---------
torch.backends.cuda.matmul.allow_tf32 = True
torch.set_float32_matmul_precision("high")
USE_BF16 = torch.cuda.is_bf16_supported()

model.eval().to("cuda")
tokenizer.padding_side = "left"
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token

# --------- Parameters ---------
MAX_INPUT = 2048            # Consistent with training, avoid passive truncation.
MAX_NEW   = 4
BATCH     =  16
BUCKET_SZ = 64
DATASET   = "ad6398/nyu-dl-teach-maths-comp"

# --------- Load and pre-build prompts, estimate length ---------
ds = load_dataset(DATASET, split="test")

def build_prompt(ex):
    return inference_prompt.format(ex["question"], str(ex["solution"]))

prompts = [build_prompt(ex) for ex in ds]
# Perform a single “lightweight tokenization” to obtain the length, avoiding repeated overhead in the main loop.
lengths = [len(tokenizer(p, add_special_tokens=True)["input_ids"]) for p in prompts]

# --------- 2) Length-based bucketization/sorting to minimize padding waste ---------
order = sorted(range(len(ds)), key=lambda i: lengths[i])
buckets = [order[i:i+BUCKET_SZ] for i in range(0, len(order), BUCKET_SZ)]

preds, ids = [], []

# --------- 3) Batch Generation (bf16 autocast + left padding + truncation) ---------
amp_dtype = torch.bfloat16 if USE_BF16 else torch.float16
autocast_ctx = torch.cuda.amp.autocast(enabled=True, dtype=amp_dtype)

with torch.no_grad(), autocast_ctx:
    for bucket in tqdm(buckets, total=len(buckets), mininterval=0.5):
        for s in range(0, len(bucket), BATCH):
            idxs = bucket[s:s+BATCH]
            batch_prompts = [prompts[i] for i in idxs]

            enc = tokenizer(
                batch_prompts,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=MAX_INPUT,
            ).to("cuda", non_blocking=True)

            out = model.generate(
                **enc,
                max_new_tokens=MAX_NEW,
                do_sample=False,
                use_cache=True,
            )

            in_len = enc["input_ids"].shape[1]
            gen = out[:, in_len:]
            texts = tokenizer.batch_decode(gen, skip_special_tokens=True)

            preds.extend(parse_output(t) for t in texts)
            ids.extend(ds[i]["id"] if "id" in ds[i] else i for i in idxs)

# --------- 4) submission ---------
pd.DataFrame({"ID": ids, "is_correct": preds}).to_csv("submission.csv", index=False)
print(f"Saved full submission with {len(preds)} samples to 'submission.csv'")

  autocast_ctx = torch.cuda.amp.autocast(enabled=True, dtype=amp_dtype)


  0%|          | 0/157 [00:00<?, ?it/s]

Saved full submission with 10000 samples to 'submission.csv'
