# Math Question Answer Verification Competition

**Goal**: Fine-tune Llama-3-8B model to predict if a given solution to a math problem is correct or not.

**Try 2 Optimizations**:
- **50,000 training samples** (5% of full dataset) with stratified split (99.5/0.5)
- **LoRA rank 32** for good capacity
- **Max sequence length 2048** to prevent truncation
- **2 epochs** training (fast completion)
- **Learning rate 2e-4** (vs 1e-4) for faster convergence ‚¨ÜÔ∏è
- **Warmup 300 steps** (vs 500) for quick start ‚¨áÔ∏è
- **Improved prompt template** emphasizing solution reasoning
- **Constrained decoding** for reliable output parsing

**Note**: This notebook is optimized for Google Colab with A100 GPU

## **Step 1: Install Necessary Libraries**

First, we need to install the required Python libraries. We'll be using the unsloth library, which provides highly efficient, memory-saving training methods for large language models, making it possible to fine-tune powerful models on a single free-tier GPU. We'll also install xformers for further optimization.


In [3]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9.]{3,}", torch.__version__).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2

## **Step 2: Hugging Face Login**

For accessing gated models like Llama-3-8B, we need to authenticate with Hugging Face using your token.



In [4]:
from huggingface_hub import login

# Login to Hugging Face using token from Colab Secrets
# Set HF_TOKEN in Colab Secrets (Secrets ‚Üí Add Secret)
try:
    from google.colab import userdata
    hf_token = userdata.get('HF_TOKEN')
    login(token=hf_token, add_to_git_credential=False)
    print("‚úÖ Successfully logged in to Hugging Face")
except:
    print("‚ö†Ô∏è Warning: HF_TOKEN not found. Make sure to set it in Colab Secrets if the model is gated.")

‚úÖ Successfully logged in to Hugging Face


## **Step 3: Load the Model and Tokenizer**

Load Llama-3-8B using Unsloth's FastLanguageModel with optimized settings:
- **4-bit quantization**: Reduces GPU memory usage significantly
- **Max sequence length 2048**: Prevents truncation of long problems
- **Auto dtype detection**: Optimizes for your GPU



In [None]:
from unsloth import FastLanguageModel
import torch

# Try 2 settings
max_seq_length = 2048  # Increased from 1024 to prevent truncation
dtype = None  # Auto-detect best data type for GPU
load_in_4bit = True  # Use 4-bit quantization to save memory

# Load the model and tokenizer from Hugging Face
# Using Meta-Llama-3.1-8B
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

print(f"‚úÖ Model loaded with max_seq_length={max_seq_length}")

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.10.12: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-80GB. Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

‚úÖ Model loaded with max_seq_length=2048


## **Step 4: Prepare the Dataset**

This step prepares the training data for fine-tuning. It consists of three parts:

1. **Load Dataset**: Load the full 1M training samples from Hugging Face
2. **Split Dataset**: Create train/validation split using stratified sampling (maintains class balance)
3. **Format Prompts**: Convert data into instructional prompts with improved structure

---

### 4.1 Load and Split Dataset

**Why Stratified Split?**
- Ensures training and validation sets have the same True/False ratio
- Prevents bias in validation metrics
- Better representation of the overall dataset distribution

---

### 4.2 Format Training Prompts

**Key Improvements Over Baseline:**

1. **Includes Student Answer**: The baseline prompt only had Question and Solution, but the task requires comparing the student's answer with the correct solution. This is critical!
2. **Emphasizes Reasoning**: The prompt explicitly asks the model to analyze step-by-step reasoning from the solution
3. **Clear Task Definition**: Better structure helps the model understand what it needs to do

In [None]:
from datasets import load_dataset

# Load the full training dataset (1M samples)
full_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="train")

# Encode is_correct as ClassLabel for stratified split
# This ensures the train/val split maintains the same True/False ratio
full_dataset = full_dataset.class_encode_column("is_correct")

# üõ°Ô∏è Try 2: Limit to 50,000 samples for SAFE & fast training (5% of full dataset)
full_dataset = full_dataset.shuffle(seed=42).select(range(50000))

# Stratified split: 99.5% training, 0.5% validation (maintains class balance)
# Using smaller validation set for faster training
split_dataset = full_dataset.train_test_split(
    test_size=0.005,  # 0.5% for validation (~250 samples)
    seed=42,
    stratify_by_column="is_correct"
)

train_dataset = split_dataset["train"]  # ~49,750 samples
validation_dataset = split_dataset["test"]  # ~250 samples

print(f"‚úÖ Training samples: {len(train_dataset):,}")
print(f"‚úÖ Validation samples: {len(validation_dataset):,}")
print(f"üõ°Ô∏è Try 1.5: Using 50,000 samples (5% of full dataset) for SAFE training")


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

data/train-00001-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/3.65M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/1000000 [00:00<?, ? examples/s]

‚úÖ Training samples: 49,750
‚úÖ Validation samples: 250
üõ°Ô∏è Try 1.5: Using 50,000 samples (5% of full dataset) for SAFE training


In [7]:
# Improved prompt template emphasizing solution reasoning
training_prompt = """You are an expert mathematician verifying student answers.

Your task: Determine if the student's answer is correct by analyzing the problem, the provided solution's reasoning steps, and the student's answer.

Question:
{}

Correct Solution (with step-by-step reasoning):
{}

Student Answer:
{}

Based on the solution's reasoning, is the student's answer correct?
Respond with:
- 'True' if the answer is correct
- 'False' if the answer is incorrect

Output:
{}"""

# EOS token to mark completion
EOS_TOKEN = tokenizer.eos_token

# Format data samples into the prompt template
def formatting_prompts_func(examples):
    """Format dataset examples into training prompts.

    Args:
        examples: Batch of dataset examples containing question, solution, answer, is_correct

    Returns:
        Dictionary with formatted text prompts
    """
    questions = examples["question"]
    solutions = examples["solution"]
    answers = examples["answer"]  # Student answers
    outputs = examples["is_correct"]
    texts = []
    for question, solution, answer, output in zip(questions, solutions, answers, outputs):
        # Format the prompt with all components and add EOS token
        text = training_prompt.format(
            question,
            str(solution),
            str(answer),
            str(output)
        ) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts }

# Apply formatting to both training and validation datasets
formatted_train_dataset = train_dataset.map(formatting_prompts_func, batched=True)
formatted_validation_dataset = validation_dataset.map(formatting_prompts_func, batched=True)

print(f"‚úÖ Datasets formatted: {len(formatted_train_dataset):,} training, {len(formatted_validation_dataset):,} validation")

Map:   0%|          | 0/49750 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

‚úÖ Datasets formatted: 49,750 training, 250 validation



## **Step 5: Configure LoRA**

**LoRA (Low-Rank Adaptation)** allows us to efficiently fine-tune the model by training only a small number of adapter parameters instead of the full model.

**Try 2 Settings**:
- **Rank 32**: Good capacity for 50K dataset (balances speed, safety, and accuracy)
- **Alpha 64**: Typically set to 2√órank for optimal scaling
- **Dropout 0.05**: Light regularization to prevent overfitting

In [None]:
# Configure LoRA with Try 2 parameters
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,  # Good capacity
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 64,  # Typically 2√órank
    lora_dropout = 0.05,  # Light regularization
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
)

model.print_trainable_parameters()

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.10.12 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


trainable params: 83,886,080 || all params: 8,114,147,328 || trainable%: 1.0338


## **Step 6: Set Up SFTTrainer**

Configure the training process with **Try 2 Training Settings**:
- **2 epochs**: Full passes through 50K dataset for solid learning
- **Learning rate 2e-4** (vs 1e-4): Higher LR for faster convergence on smaller dataset ‚¨ÜÔ∏è
- **Warmup 300 steps** (vs 500): Shorter warmup to start learning faster ‚¨áÔ∏è
- **No validation during training**: For maximum speed
- **Save checkpoints at each epoch**: Keep last 2 checkpoints for model recovery

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
import torch

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = formatted_train_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 300,  # Try 2: Reduced from 500 for faster start
        num_train_epochs = 2,
        learning_rate = 2e-4,  # Try 2: Increased from 1e-4 for faster convergence
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 100,
        eval_strategy = "no",  # Disable validation for maximum speed
        save_strategy = "epoch",  # Save only at end of each epoch
        save_total_limit = 2,  # Keep only 2 checkpoints
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
        report_to = "none",
    ),
)

print("‚úÖ Try 1.5 Trainer configured: 50K samples, LR 2e-4, Warmup 300 (SAFE & FAST!)")

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/49750 [00:00<?, ? examples/s]

‚úÖ Try 1.5 Trainer configured: 50K samples, LR 2e-4, Warmup 300 (SAFE & FAST!)



## **Step 7: Start Training**

Train the model for **2 epochs** over **49,750 samples**.

In [10]:
# Start training
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 49,750 | Num Epochs = 2 | Total steps = 12,438
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 83,886,080 of 8,114,147,328 (1.03% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
100,0.905
200,0.6118
300,0.6185
400,0.6277
500,0.6198
600,0.605
700,0.5956
800,0.6
900,0.5929
1000,0.5745


TrainOutput(global_step=12438, training_loss=0.435036080580161, metrics={'train_runtime': 30742.5538, 'train_samples_per_second': 3.237, 'train_steps_per_second': 0.405, 'total_flos': 1.8071066875691336e+18, 'train_loss': 0.435036080580161, 'epoch': 2.0})

## **Step 8: Prepare for Inference**

Prepare the trained model for faster inference and test on a validation example to verify it's working correctly.


In [11]:
# Prepare model for inference
FastLanguageModel.for_inference(model)

# Improved inference prompt (matching training prompt structure)
inference_prompt = """You are an expert mathematician verifying student answers.

Your task: Determine if the student's answer is correct by analyzing the problem, the provided solution's reasoning steps, and the student's answer.

Question:
{}

Correct Solution (with step-by-step reasoning):
{}

Student Answer:
{}

Based on the solution's reasoning, is the student's answer correct?
Respond with:
- 'True' if the answer is correct
- 'False' if the answer is incorrect

Output:
"""

# Test on a validation example
example = validation_dataset[10]
question = example["question"]
solution = example["solution"]
answer = example["answer"]

# Format the prompt
inputs = tokenizer(
    [inference_prompt.format(question, str(solution), str(answer))],
    return_tensors="pt",
    truncation=True,
    max_length=max_seq_length
).to("cuda")

# Generate prediction
outputs = model.generate(**inputs, max_new_tokens=8, use_cache=True)
response = tokenizer.batch_decode(outputs)[0]

# Display results
print("#### QUESTION ####")
print(question)
print("\n#### SOLUTION ####")
print(solution[:200] + "..." if len(str(solution)) > 200 else solution)
print("\n#### STUDENT ANSWER ####")
print(answer)
print("\n#### MODEL'S PREDICTION ####")
output_part = response.split("Output:\n")[-1]
print(output_part[:50])
print("\n#### CORRECT ANSWER ####")
print(example["is_correct"])

#### QUESTION ####
Compute $\cos 150^\circ$.

#### SOLUTION ####
For this problem, we can simply rely on Python's mathematics libraries.
<llm-code>
from math import cos, radians

rad = radians(150)
cos(rad)
</llm-code>
<llm-code-output>
-0.8660254037844387
</llm-co...

#### STUDENT ANSWER ####
-0.8660254037844387

#### MODEL'S PREDICTION ####
0<|end_of_text|>

#### CORRECT ANSWER ####
1


## **Step 9: Generate Submission File**

Generate predictions for all test samples and create the submission CSV file.

In [12]:
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset
from transformers import LogitsProcessor, GenerationConfig
import torch

# Constrained Decoding: Force model to generate only "True"/"False" or "1"/"0" tokens
class AllowedTokensLogitsProcessor(LogitsProcessor):
    """Logits Processor that forces model to generate only allowed tokens"""
    def __init__(self, allowed_token_ids):
        self.allowed = set(allowed_token_ids)

    def __call__(self, input_ids, scores):
        # Set probability of disallowed tokens to -inf
        mask = torch.full_like(scores, float("-inf"))
        for tid in self.allowed:
            if tid < scores.shape[-1]:
                mask[..., tid] = 0.0
        return scores + mask

def get_allowed_token_ids(tokenizer):
    """Return token IDs for allowed tokens (True/False or 1/0)"""
    tokens = ["True", "False", "1", "0"]
    ids = []
    for token in tokens:
        token_id = tokenizer.convert_tokens_to_ids(token)
        if token_id != tokenizer.unk_token_id:
            ids.append(token_id)
    return ids

# Load the official test set
test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

# Setup constrained decoding
allowed_ids = get_allowed_token_ids(tokenizer)
logits_proc = AllowedTokensLogitsProcessor(allowed_ids)

# Optimized Generation configuration
gen_config = GenerationConfig(
    max_new_tokens=1,        # Reduced from 8 to 1 (faster and more accurate)
    do_sample=False,          # Deterministic generation
    temperature=0.0,          # Use probability distribution as-is
    top_p=1.0,
    eos_token_id=[tokenizer.eos_token_id],
    pad_token_id=tokenizer.eos_token_id,
)

# Generate predictions for all test samples
print(f"Generating predictions for {len(test_dataset):,} test samples...")
allowed_tokens = [tokenizer.convert_ids_to_tokens(tid) for tid in allowed_ids if tid != tokenizer.unk_token_id]
print(f"Using constrained decoding with allowed tokens: {allowed_tokens}")

for i, example in enumerate(tqdm(test_dataset)):
    question = example["question"]
    solution = example["solution"]
    answer = example["answer"]  # Student answer

    # Format the prompt
    prompt = inference_prompt.format(question, str(solution), str(answer))
    inputs = tokenizer(
        [prompt],
        return_tensors="pt",
        truncation=True,
        max_length=max_seq_length
    ).to("cuda")

    # Generate prediction with constrained decoding
    outputs = model.generate(
        **inputs,
        generation_config=gen_config,
        logits_processor=[logits_proc],  # Apply constrained decoding
        use_cache=True
    )

    # Decode only newly generated tokens (more accurate parsing)
    new_tokens = outputs[0, inputs["input_ids"].shape[1]:]
    response_text = tokenizer.decode(new_tokens, skip_special_tokens=True)

    # Check first character only (0 or 1, True/False)
    ch = response_text.strip()[:1] if response_text.strip() else "0"

    # Parse: "1" or first letter "T" of "True" means True
    if ch.lower() in ["1", "t"]:
        prediction = True
    else:
        prediction = False

    predictions.append(prediction)

# Create submission DataFrame
submission = pd.DataFrame({
    'ID': range(len(predictions)),
    'is_correct': predictions
})

# Save to CSV
submission.to_csv('submission.csv', index=False)

print(f"\n‚úÖ Submission file created successfully!")
print(f"   Total predictions: {len(predictions):,}")
print(f"   True predictions: {sum(predictions):,}")
print(f"   False predictions: {len(predictions) - sum(predictions):,}")
print("\nüìÅ File saved as 'submission.csv'")
print("   You can now download this file and submit it to the Kaggle competition.")

Generating predictions for 10,000 test samples...
Using constrained decoding with allowed tokens: ['True', 'False', '1', '0']


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [31:50<00:00,  5.24it/s]



‚úÖ Submission file created successfully!
   Total predictions: 10,000
   True predictions: 3,042
   False predictions: 6,958

üìÅ File saved as 'submission.csv'
   You can now download this file and submit it to the Kaggle competition.


## **Step 10: Validate Submission File**

Validate the submission CSV file format before submitting to Kaggle.

**Checks**:
- File has correct columns (`ID`, `is_correct`)
- All values are `True` or `False`
- No duplicate IDs
- Exactly 10,000 rows


In [13]:
# Validate submission file format
# This helps catch errors before submitting to Kaggle

import pandas as pd

submission_path = 'submission.csv'
df = pd.read_csv(submission_path)

# Display first few rows
print("First 5 rows:")
print(df.head())

# Check format
print(f"\nRows: {len(df)}")
print(f"Unique IDs: {df['ID'].nunique()}")
print(f"\nValue counts:")
print(df['is_correct'].value_counts(dropna=False))

# Validate format
assert set(df['is_correct'].unique()).issubset({True, False}), "is_correct must only contain True/False"
assert df['ID'].nunique() == len(df), "ID column must have unique values"
assert len(df) == 10000, f"Expected 10,000 rows, got {len(df)}"

print("\n‚úÖ Format validation passed!")
print("   Ready to submit to Kaggle competition.")


First 5 rows:
   ID  is_correct
0   0       False
1   1       False
2   2       False
3   3        True
4   4       False

Rows: 10000
Unique IDs: 10000

Value counts:
is_correct
False    6958
True     3042
Name: count, dtype: int64

‚úÖ Format validation passed!
   Ready to submit to Kaggle competition.
