

 <h1>
Welcome to the Math Question Answer Verification Competition! 🚀

The goal is to fine-tune a Llama-3-8B model to predict if a given solution to a math problem is correct or not. Your model should output True if the solution is correct, and False otherwise.

This notebook is a starter guide designed to get you up and running quickly. We'll walk through a simplified training process using a small subset of the data (5,000 examples) and lightweight parameters. The main goal here is to understand the complete workflow, from loading data to generating a submission file, not to achieve a top score.

Good luck, and have fun! 🎉

## **Step 1: Install Necessary Libraries**

First, we need to install the required Python libraries. We'll be using the unsloth library, which provides highly efficient, memory-saving training methods for large language models, making it possible to fine-tune powerful models on a single free-tier GPU. We'll also install xformers for further optimization.


In [None]:
# %%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -U \
  "transformers<=4.57.2,!=4.57.0,>=4.56.0" \
  "trl==0.23.0" \
  "accelerate>=0.34.1" \
  "peft>=0.7.1,<0.18" \
  bitsandbytes

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-n2rfeq7u/unsloth_d84312b46bc94581945559e9b9cc07fb
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-n2rfeq7u/unsloth_d84312b46bc94581945559e9b9cc07fb
  Resolved https://github.com/unslothai/unsloth.git to commit d707bd43b4e883b521761d525be2fae428fe5980
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting unsloth_zoo>=2025.10.13 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading unsloth_zoo-2025.10.13-py3-none-any.whl.metadata (32 kB)
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.

## **Step 2: Load the Model and Tokenizer**

Next, we'll load the Llama-3-8B model, which is the only model permitted for this competition. We'll use Unsloth's FastLanguageModel to handle this efficiently.

A key technique we'll use is 4-bit quantization (load_in_4bit = True). Think of this as compressing the model's knowledge into a much smaller file size. This significantly reduces the amount of GPU memory required, allowing us to fine-tune this large model even on a free platform like Google Colab.



In [None]:

from unsloth import FastLanguageModel
import torch

max_seq_length = 1536  # Choose any sequence length
dtype = None  # This will auto-detect the best data type for your GPU
load_in_4bit = True  # Use 4-bit quantization to save memory

# Load the model and tokenizer from Hugging Face
# Note: We use the base model, not a 4-bit pre-quantized one,
# to ensure we start from the official weights.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B", # Competition-approved model
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.




🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.10.12: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

## **Step 3: Prepare the Dataset**

This is a crucial step where we format our data into a structure the model can learn from. The process involves three parts:

1.  **Loading**: We'll load the official competition dataset from Hugging Face.
2.  **Splitting**: The full dataset is massive. For this starter notebook, we'll create a much smaller, more manageable version to speed things up: **5,000 samples for training** and **500 for validation**.
3.  **Prompting**: We will format each data sample into a clear instructional prompt. This helps the model understand its role as a mathematician verifying a solution.



In [None]:
from datasets import load_dataset

# ========== 第 1 步：加载完整数据集 ==========
print("📦 Loading the dataset ...")
full_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="train")
print(f"✅ The length of whole dataset: {len(full_dataset):,} Samples")

# ========== 第 2 步：数据划分 ==========
# 为了训练速度，我们使用子集：
# - 训练集: 10,000 样本 (约占 10%)
# - 验证集: 1,000 样本 (用于监控训练效果)
# 💡 正式训练时建议使用更多数据（如 50% 或全部）

TRAIN_SIZE = 10000    # 训练集样本数
VAL_SIZE = 1000       # 验证集样本数
RANDOM_SEED = 42      # 随机种子，确保结果可复现

print(f"🔀 Shuffle the dataset...")
shuffled_dataset = full_dataset.shuffle(seed=RANDOM_SEED)

train_dataset = shuffled_dataset.select(range(TRAIN_SIZE))
validation_dataset = shuffled_dataset.select(range(TRAIN_SIZE, TRAIN_SIZE + VAL_SIZE))
print(f"✅ Done")
print(f"✅ The length of train dataset: {len(train_dataset):,} Samples")
print(f"✅ The length of validation dataset: {len(validation_dataset):,} Samples")

📦 Loading the dataset ...
✅ The length of whole dataset: 1,000,000 Samples
🔀 Shuffle the dataset...
✅ Done
✅ The length of train dataset: 10,000 Samples
✅ The length of validation dataset: 1,000 Samples


In [None]:
# The instructional prompt template for training
training_prompt = """You are a STRICT Math Answer Verifier.
Your job is to decide whether the provided SOLUTION correctly answers the QUESTION.
Think step by step internally, but ONLY print one token on the last line: True or False.
Using the
# Instructions (internal checklist; do not reveal)
1) Target match: Does the SOLUTION answer the exact quantity asked, with correct unit and, if specified, the required form (simplest radical, fraction, interval, etc.)?
2) Mathematical validity: Are key formulas, logic and computations correct? If code or code output appears, trust the mathematics and the problem requirements over raw prints. Recompute quickly when feasible.
3) Conditions and multipliers: Handle second coat, tax or discount, degree vs radian, unit conversions, boundary cases.
4) Consistency: Final conclusion must agree with intermediate reasoning. If there is a conflict, the answer is False.
5) Numerical equivalence: If no specific form is required, accept numerical equivalence within max(abs error, rel error) <= 1e-6.

# Output rule
Print exactly one line with either True or False. No other text.

# Data
Question:
{}
Solution:
{}
Output:
{}"""


# We must add an End Of Sequence (EOS) token to tell the model when a completion is finished.
EOS_TOKEN = tokenizer.eos_token

# This function formats our data samples into the prompt template.
def formatting_prompts_func(examples):
    questions = examples["question"]
    solutions = examples["solution"]
    outputs = examples["is_correct"]
    texts = []
    for question, solution, output in zip(questions, solutions, outputs):
        # Format the prompt and add the EOS token
        text = training_prompt.format(question, str(solution), str(output)) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts }

# Apply the formatting function to our training dataset
formatted_train_dataset = train_dataset.map(formatting_prompts_func, batched=True)

# Also format the validation dataset for evaluation during training
formatted_validation_dataset = validation_dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

## **Step 4: Configure LoRA and Set Up the Trainer**

### **LoRA Configuration**

Instead of training the entire model (which has billions of parameters), we'll use a technique called **Lo**w-**R**ank **A**daptation (LoRA). 🎛️

Think of it like this: rather than rewriting an entire textbook, we're just adding small, efficient "sticky notes" (the LoRA adapters) to update the model's knowledge. This is much faster and requires significantly less memory. We'll use a small **rank** (`r = 8`) to keep the training process light and quick for this starter notebook.


In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # A small rank for lighter training
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 32, # A common practice is to set alpha = 2 * r
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
)

Unsloth 2025.10.12 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.



### **SFTTrainer Setup**

Now we'll set up the `SFTTrainer` (Supervised Fine-tuning Trainer). This is the main tool from the `trl` library that will handle the entire training loop for us. We'll give it our model, tokenizer, dataset, and a set of training instructions, such as the batch size and number of epochs.

We will train for just **one epoch** (a single pass over our 5,000-sample dataset) to keep this demonstration fast.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = formatted_train_dataset,
    eval_dataset = formatted_validation_dataset,  # 添加验证集用于训练过程中评估
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    packing=False,
    args = TrainingArguments(
        per_device_train_batch_size = 12,
        per_device_eval_batch_size = 16,  # 评估时可以用更大的 batch size
        gradient_accumulation_steps = 4,
        warmup_steps = 50,  # 增加 warmup 步数
        num_train_epochs = 3,  # 使用 epoch 而不是固定步数，训练更充分
        learning_rate = 2e-4,  # 稍微提高学习率
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),

        # 评估策略：训练过程中定期在验证集上评估
        eval_strategy = "steps",
        eval_steps = 50,  # 每 50 步评估一次

        # 保存策略：保存最佳模型
        save_strategy = "steps",
        save_steps = 50,
        save_total_limit = 3,  # 只保留最好的 3 个 checkpoint
        load_best_model_at_end = True,  # 训练结束时自动加载最佳模型
        metric_for_best_model = "eval_loss",  # 根据验证损失选择最佳模型

        logging_steps=10,
        logging_first_step=True,
        disable_tqdm=False,
        optim = "adamw_torch_fused",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = 42,
        output_dir = "outputs",
        report_to = "none",
        dataloader_num_workers=4,
        dataloader_pin_memory=True,
        group_by_length=True,
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/10000 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/1000 [00:00<?, ? examples/s]

## **Step 5: Start Training\!**

Now, we'll call the `train()` function on our `trainer` object. This will kick off the fine-tuning process. Based on our settings, this will run for one full epoch over our 5,000 examples.

Grab a coffee, as this will take a few minutes\! ☕


In [None]:
trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 10,000 | Num Epochs = 3 | Total steps = 627
O^O/ \_/ \    Batch size per device = 12 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (12 x 4 x 1) = 48
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
50,0.3089,0.440728
100,0.2894,0.421779
150,0.2775,0.413986
200,0.2771,0.408363
250,0.319,0.397384
300,0.3123,0.39175
350,0.3051,0.387443
400,0.3059,0.383129
450,0.3208,0.382878
500,0.3121,0.379745


Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


TrainOutput(global_step=627, training_loss=0.41367708667043296, metrics={'train_runtime': 4501.0035, 'train_samples_per_second': 6.665, 'train_steps_per_second': 0.139, 'total_flos': 6.50026206020567e+17, 'train_loss': 0.41367708667043296, 'epoch': 3.0})

## **Step 6: Inference and Evaluation**
验证集 看下准确率
Now that our model is trained, we need to test it on our validation set. We'll use a slightly different prompt for inference—one where we leave the `Output:` section blank for the model to complete.

Let's test it on a single example from our validation set to see what it predicts.

In [None]:
# ============================================================
# 完整推理代码Deatailed inference code,  - 简化解析 + 保留验证集评估
# ============================================================

import torch
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset

# Prepare the model
FastLanguageModel.for_inference(model)

INFERENCE_PROMPT = """# System
You are a STRICT Math Answer Verifier.
Your job is to decide whether the provided SOLUTION correctly answers the QUESTION.
Think step by step internally, but ONLY print one token on the last line: True or False.
Using the
# Instructions (internal checklist; do not reveal)
1) Target match: Does the SOLUTION answer the exact quantity asked, with correct unit and, if specified, the required form (simplest radical, fraction, interval, etc.)?
2) Mathematical validity: Are key formulas, logic and computations correct? If code or code output appears, trust the mathematics and the problem requirements over raw prints. Recompute quickly when feasible.
3) Conditions and multipliers: Handle second coat, tax or discount, degree vs radian, unit conversions, boundary cases.
4) Consistency: Final conclusion must agree with intermediate reasoning. If there is a conflict, the answer is False.
5) Numerical equivalence: If no specific form is required, accept numerical equivalence within max(abs error, rel error) <= 1e-6.

# Output rule
Print exactly one line with either True or False. No other text.

# Data
Question:
{}
Solution:
{}
Output:
"""

def predict_one(question, solution):
    """对单个样本生成预测"""
    prompt = INFERENCE_PROMPT.format(question, str(solution))
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    with torch.inference_mode():
        outputs = model.generate(
            **inputs,
            max_new_tokens=8,
            use_cache=True,
            do_sample=False,
            temperature=0.0,
            pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id
        )

    # 只解码新生成的部分
    generated = tokenizer.decode(
        outputs[0][inputs["input_ids"].shape[1]:],
        skip_special_tokens=True
    ).strip()

    # 简化的解析：如果为空返回 False，否则取第一个词
    if not generated:
        return False

    first_word = generated.split()[0].rstrip('.,;:!?')
    return first_word == "True"


# ============================================================
# 1. Test single sample in the dataset
# ============================================================
print("="*60)
print("🔍 测试单个样本")
print("="*60)

test_example = validation_dataset[10]
test_pred = predict_one(test_example["question"], test_example["solution"])

print(f"Question: {test_example['question'][:100]}...")
print(f"Prediction: {test_pred}")
print(f"Actual: {test_example['is_correct']}")
print(f"Match: {'✅ 正确' if str(test_pred) == str(test_example['is_correct']) else '❌ 错误'}")
print()


# ============================================================
# 2. 验证集完整评估
# ============================================================
print("="*60)
print("📊 验证集评估")
print("="*60)

val_correct = 0
val_total = len(validation_dataset)
val_predictions = []
val_actuals = []

for i in tqdm(range(val_total), desc="评估进度"):
    example = validation_dataset[i]
    pred = predict_one(example["question"], example["solution"])
    actual = example["is_correct"]

    val_predictions.append(pred)
    val_actuals.append(actual)

    if str(pred) == str(actual):
        val_correct += 1

val_accuracy = val_correct / val_total

print(f"\n【结果】")
print(f"  准确率: {val_correct}/{val_total} = {val_accuracy:.4f} ({val_accuracy*100:.2f}%)")

# 统计预测分布
true_count = sum(val_predictions)
false_count = len(val_predictions) - true_count
print(f"\n【预测分布】")
print(f"  预测 True:  {true_count} ({true_count/val_total*100:.1f}%)")
print(f"  预测 False: {false_count} ({false_count/val_total*100:.1f}%)")

# 统计实际分布
true_actual = sum([1 for x in val_actuals if x == True])
false_actual = len(val_actuals) - true_actual
print(f"\n【实际分布】")
print(f"  实际 True:  {true_actual} ({true_actual/val_total*100:.1f}%)")
print(f"  实际 False: {false_actual} ({false_actual/val_total*100:.1f}%)")

print()



🔍 测试单个样本
Question: A circular spinner for a game has a radius of 10 cm. The probability of winning on one spin of this ...
Prediction: False
Actual: False
Match: ✅ 正确

📊 验证集评估


评估进度: 100%|██████████| 1000/1000 [03:28<00:00,  4.79it/s]


【结果】
  准确率: 821/1000 = 0.8210 (82.10%)

【预测分布】
  预测 True:  420 (42.0%)
  预测 False: 580 (58.0%)

【实际分布】
  实际 True:  405 (40.5%)
  实际 False: 595 (59.5%)






## **Step 7: Generate Submission File**

This is the final step\! We will now run our fine-tuned model on the official `test` dataset.

We will loop through each example in the test set, generate a prediction, and format the results into a CSV file with two columns: `ID` and `is_correct`, as required by the competition.

In [None]:
# ============================================================
# 3. 生成测试集提交文件
# ============================================================
print("="*60)
print("🚀 生成测试集提交")
print("="*60)

test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

for example in tqdm(test_dataset, desc="生成提交"):
    pred = predict_one(example["question"], example["solution"])
    predictions.append(pred)

# 创建提交文件
submission = pd.DataFrame({
    'ID': range(len(predictions)),
    'is_correct': predictions
})

submission.to_csv('submission.csv', index=False)

print(f"\n【提交文件统计】")
print(f"  文件名: submission.csv")
print(f"  预测数量: {len(predictions)}")
print(f"  True 数量: {sum(predictions)} ({sum(predictions)/len(predictions)*100:.1f}%)")
print(f"  False 数量: {len(predictions) - sum(predictions)} ({(len(predictions)-sum(predictions))/len(predictions)*100:.1f}%)")

print("\n" + "="*60)
print("✅ 全部完成！")
print("="*60)

🚀 生成测试集提交


生成提交: 100%|██████████| 10000/10000 [34:22<00:00,  4.85it/s]


【提交文件统计】
  文件名: submission.csv
  预测数量: 10000
  True 数量: 4032 (40.3%)
  False 数量: 5968 (59.7%)

✅ 全部完成！





In [None]:
def generate_training_report(metrics_path="training_metrics.json",
                             output_path="training_report.txt"):
    """
    从 JSON 生成可读的训练报告
    """
    with open(metrics_path, 'r') as f:
        metrics = json.load(f)

    report = []
    report.append("="*70)
    report.append("📊 训练报告")
    report.append("="*70)
    report.append(f"\n训练时间: {metrics['timestamp']}\n")

    # 基本信息
    report.append("【基本信息】")
    fm = metrics['final_metrics']
    report.append(f"  总步数: {fm['total_steps']}")
    report.append(f"  训练时长: {fm['training_time_formatted']}")
    report.append(f"  训练速度: {fm['steps_per_second']:.3f} it/s")

    # Loss 信息
    report.append("\n【Loss 指标】")
    report.append(f"  最终训练 Loss: {fm['final_train_loss']:.4f}")
    report.append(f"  最佳训练 Loss: {fm['best_train_loss']:.4f}")
    report.append(f"  最终验证 Loss: {fm['final_eval_loss']:.4f}")
    report.append(f"  最佳验证 Loss: {fm['best_eval_loss']:.4f}")
    report.append(f"  验证改善: {fm['eval_improvement']:.4f} ({fm['eval_improvement_percent']:.1f}%)")

    # 配置信息
    report.append("\n【训练配置】")
    cfg = metrics['training_config']
    report.append(f"  Batch Size: {cfg['per_device_train_batch_size']} × {cfg['gradient_accumulation_steps']} = {cfg['effective_batch_size']}")
    report.append(f"  Learning Rate: {cfg['learning_rate']}")
    report.append(f"  Epochs: {cfg['num_train_epochs']}")
    report.append(f"  Optimizer: {cfg['optimizer']}")

    # LoRA 配置
    report.append("\n【LoRA 配置】")
    lora = metrics['lora_config']
    report.append(f"  Rank: {lora['rank']}")
    report.append(f"  Alpha: {lora['alpha']}")

    # 数据集
    report.append("\n【数据集】")
    ds = metrics['dataset_info']
    report.append(f"  训练集: {ds['train_dataset_size']:,} 样本")
    report.append(f"  验证集: {ds['eval_dataset_size']:,} 样本")

    report.append("\n" + "="*70)

    report_text = "\n".join(report)

    # 保存到文件
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(report_text)

    # 打印
    print(report_text)
    print(f"\n报告已保存到: {output_path}")

    return report_text


# 生成报告
generate_training_report("training_metrics.json", "training_report.txt")

FileNotFoundError: [Errno 2] No such file or directory: 'training_metrics.json'

# SAVE THE MODEL TO DRIVE AND RUN INFERENCE
Add code to save the model checkpoint to Google Drive, load the model from the checkpoint, and generate the final submission CSV file.

## Mount google drive

### Subtask:
Mount Google Drive to save the model checkpoint.


**Reasoning**:
Mount Google Drive to save the model checkpoint.



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Save model checkpoint

### Subtask:
Save the trained model checkpoint to the specified path in Google Drive.


**Reasoning**:
Define the save path and save the model and tokenizer to Google Drive.



In [None]:
import os

# Define the path to save the model checkpoint in Google Drive
save_path = "/content/drive/MyDrive/llama3_8b_math_verifier_checkpoint"

# Create the directory if it doesn't exist
os.makedirs(save_path, exist_ok=True)

# Save the model and tokenizer
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"Model checkpoint and tokenizer saved to: {save_path}")

Model checkpoint and tokenizer saved to: /content/drive/MyDrive/llama3_8b_math_verifier_checkpoint


## Load model from checkpoint

### Subtask:
Load the model from the saved checkpoint.


**Reasoning**:
Load the model and tokenizer from the saved checkpoint path in Google Drive and prepare the model for inference.



In [None]:
# Define the path where the model checkpoint was saved in Google Drive
save_path = "/content/drive/MyDrive/llama3_8b_math_verifier_checkpoint"

# Load the model and tokenizer from the saved path
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = save_path,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# Prepare the loaded model for faster inference
FastLanguageModel.for_inference(model)

print(f"Model and tokenizer loaded from: {save_path}")

==((====))==  Unsloth 2025.10.8: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model and tokenizer loaded from: /content/drive/MyDrive/llama3_8b_math_verifier_checkpoint


## Generate submission file

### Subtask:
Generate the submission CSV file using the loaded model.


**Reasoning**:
Generate the submission CSV file by iterating through the test dataset, generating predictions using the loaded model, and saving the results to a pandas DataFrame.



In [None]:
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset

# Load the official test set
test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

# Create the prompt template for inference (no answer included)
inference_prompt = """You are a great mathematician and you are tasked with finding if a solution to a given maths question is correct or not. Your response should be 'True' if the solution is correct, otherwise 'False'. Below is the Question and Solution.
Question:
{}
Solution:
{}
Output:
"""

# A simple function to parse 'True' or 'False' from the model's raw output
def parse_output(response_text):
    # Find the text after "Output:"
    output_part = response_text.split("Output:\n")[-1]
    # Check if "True" is in that part, case-insensitively
    if 'true' in output_part.lower():
        return True
    return False

# Loop through the test dataset and generate a prediction for each example
for example in tqdm(test_dataset):
    question = example["question"]
    solution = example["solution"]

    # Format the prompt
    prompt = inference_prompt.format(question, str(solution))
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate the prediction
    outputs = model.generate(**inputs, max_new_tokens=8, use_cache=True)
    response_text = tokenizer.batch_decode(outputs)[0]

    # Parse the prediction and add it to our list
    prediction = parse_output(response_text)
    predictions.append(prediction)

# Create the submission DataFrame
submission = pd.DataFrame({
    'ID': range(len(predictions)),
    'is_correct': predictions
})

# Save the DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

print("\nSubmission file 'submission.csv' created successfully!")
print("You can now download this file and submit it to the Kaggle competition.")

100%|██████████| 10000/10000 [1:33:05<00:00,  1.79it/s]


Submission file 'submission.csv' created successfully!
You can now download this file and submit it to the Kaggle competition.



