# Training Notebook

ECE-GY 7123 / CS-GY 6953 / Deep Learning - Fall '25 - Midterm Project

**Team:** Spline Reticulator

**Author/Member:** Thanh Do (qd2121@nyu.edu)

## Step 1: Install Necessary Libraries

Instead of the original installation step in the starter notebook, I used the fix from Aryan to install dependencies from https://campuswire.com/c/GF164CBA5/feed/70:


In [None]:
# %%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2



## Step 2: Defining Model and Constants

In this step, we will load the base model as required by the competition. I also defined a few global constants here so that we can easily modify the training hyperparameters as required.

In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048  # 1024 is not enough for some questions
dtype = None  # This will auto-detect the best data type for your GPU
# We have a lot of VRAM on the A100 so this might save some quantization time
load_in_4bit = False

# Some constants
global_seed = 313337
global_inference_prompt = """You are an expert mathematician and a meticulous verifier.
Your task is to evaluate a proposed solution to a math problem and determine if it's correct or not.
Carefully read the Question and the Solution. Determine if the Solution is a correct reasoning process to solve the Question.
Your response should be 'True' if the solution is correct, otherwise 'False'.
Below is the Question and Solution.
Question:
{}
Solution:
{}
Output:
"""
global_training_prompt = global_inference_prompt + "{}" # include the correct answer
global_checkpoint_path = "/content/drive/MyDrive/llama3_8b_math_verifier_checkpoint_less_naive_8k5_steps"
global_max_steps = 8500 # Just right to use all of my remaining compute units :(

# Load the model and tokenizer from Hugging Face
# Note: We use the base model, not a 4-bit pre-quantized one,
# to ensure we start from the official weights.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B", # Competition-approved model
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.10.12: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

## Step 3: Prepare the Dataset

Like the starter notebook, the process here involves three parts:

1.  **Loading**: We'll load the official competition dataset from Hugging Face.
2.  **Splitting**: The full dataset is massive. We will shuffle the train split into two parts: **500 samples for validation** and **the rest for training**. The train split will be shuffled using our global seed.
3.  **Prompting**: We will format each data sample using the globally-defined training prompt.



In [None]:
from datasets import load_dataset

# Load the full training dataset
full_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="train")

# Shuffle the dataset for randomness and create our smaller splits
shuffled_dataset = full_dataset.shuffle(seed=global_seed)
n = len(full_dataset)
# Takes the last 500 samples for internal validation
n_validation = 500
train_dataset = shuffled_dataset.select(range(n - n_validation))
validation_dataset = shuffled_dataset.select(range(n - n_validation, n))

Here, I defined some common data clean up implementations as detailed in Section 5.3 in my midterm report. This is supposed to be used for both training and inference.

In [None]:
import re

def round_floats_in_text(text: str, n_digits: int = 4) -> str:
    """
    Finds all floating-point numbers in a string and rounds them
    to a specified number of decimal places.
    """

    # Matches an optional sign, digits, a decimal point, and more digits.
    float_pattern = re.compile(r"[-]?\d+\.\d+")

    # m.group(0) is the matched text (e.g., "0.666666667")
    def replacer(match):
        # Convert the matched string to a float
        number = float(match.group(0))
        # Round it to 'n_digits'
        rounded_number = round(number, n_digits)
        # Convert it back to a string
        return str(rounded_number)

    return float_pattern.sub(replacer, text)

def clean_text(text: str) -> str:
  return round_floats_in_text(text)

Prepare training:

In [None]:
# The instructional prompt template for training
training_prompt = global_training_prompt

# We must add an End Of Sequence (EOS) token to tell the model when a completion is finished.
EOS_TOKEN = tokenizer.eos_token

# This function formats our data samples into the prompt template.
def formatting_prompts_func(examples):
    questions = examples["question"]
    solutions = examples["solution"]
    outputs = examples["is_correct"]
    texts = []
    for question, solution, output in zip(questions, solutions, outputs):
        # Format the prompt and add the EOS token
        text = training_prompt.format(question, str(solution), str(output))
        # Clean data
        text = clean_text(text)
        text += EOS_TOKEN
        texts.append(text)
    return { "text" : texts }

# Apply the formatting function to our training dataset
formatted_train_dataset = train_dataset.map(formatting_prompts_func, batched=True)

## **Step 4: Configure LoRA and Set Up the Trainer**


### **LoRA Configuration**

Instead of training the entire model (which has billions of parameters), we'll use a technique called **Lo**w-**R**ank **A**daptation (LoRA). üéõÔ∏è

Think of it like this: rather than rewriting an entire textbook, we're just adding small, efficient "sticky notes" (the LoRA adapters) to update the model's knowledge. This is much faster and requires significantly less memory.


In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Seems to be a decent rank value
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 64, # A common practice is to set alpha = 2 * r
    lora_dropout = 0, # Unsloth will complain about perf if not zero
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = global_seed,
)

Unsloth 2025.10.12 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.



### **SFTTrainer Setup**

Now we'll set up the `SFTTrainer` (Supervised Fine-tuning Trainer). This is the main tool from the `trl` library that will handle the entire training loop for us. We'll give it our model, tokenizer, dataset, and a set of training instructions, such as the batch size and number of epochs.

Here I also modified some arguments in order to utilize our A100 runtime environment better.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = formatted_train_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    args = TrainingArguments(
        num_train_epochs=1,
        # Use all the VRAMs!
        per_device_train_batch_size = 32,
        gradient_accumulation_steps = 1,
        # Also try use all the RAMs. This does not seem to be it though
        dataloader_num_workers = 8,
        warmup_steps = 5,
        max_steps = global_max_steps,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 10,
        # Since we're not using quantization, just use PyTorch's 16-bit impl
        optim = "adamw_torch",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = global_seed,
        output_dir = "outputs",
        report_to = "none",
    ),
)

## **Step 5: Start Training\!**

Now, we'll call the `train()` function on our `trainer` object. This will kick off the fine-tuning process.


In [None]:
trainer.train(resume_from_checkpoint = True)

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 999,500 | Num Epochs = 1 | Total steps = 8,500
O^O/ \_/ \    Batch size per device = 32 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (32 x 1 x 1) = 32
 "-____-"     Trainable parameters = 83,886,080 of 8,114,147,328 (1.03% trained)


Step,Training Loss
7510,0.3736
7520,0.3638
7530,0.3525
7540,0.3747
7550,0.3624
7560,0.3746
7570,0.3368
7580,0.3553
7590,0.3511
7600,0.3537


Unsloth: Will smartly offload gradients to save VRAM!


TrainOutput(global_step=8500, training_loss=0.041673962957718795, metrics={'train_runtime': 5513.8415, 'train_samples_per_second': 49.33, 'train_steps_per_second': 1.542, 'total_flos': 9.016436772337877e+18, 'train_loss': 0.041673962957718795, 'epoch': 0.2721306226988955})


## **Step 6: Inference and Evaluation**

Now that our model is trained, we need to test it on our validation set. We'll use a slightly different prompt for inference‚Äîone where we leave the `Output:` section blank for the model to complete.

Here, we'll run the model over our internal validation set to obtain a "validation accuracy" value, which we can use for evaluating our current approach.

In [None]:
# Prepare the model for faster inference
FastLanguageModel.for_inference(model)

# Create the prompt template for inference (no answer included)
inference_prompt = global_inference_prompt

# Evaluate accuracy on our validation set
num_samples = len(validation_dataset)
count_correct = 0
for i in range(num_samples):
  example = validation_dataset[i]
  question = example["question"]
  solution = example["solution"]
  inputs = tokenizer(
  [
      inference_prompt.format(question, str(solution))
  ], return_tensors = "pt").to("cuda")
  outputs = model.generate(**inputs, max_new_tokens = 8, use_cache = True)
  response = tokenizer.batch_decode(outputs)
  prediction: str = response[0].split("Output:\n")[1]
  if prediction.startswith(str(example["is_correct"])):
    count_correct += 1
print("Validation Accuracy =", count_correct / num_samples)

Validation Accuracy = 0.892


## **Step 7: Save the Model to Drive & Reload**



Save the model checkpoint to Google Drive, load the model from the checkpoint, and generate the final submission CSV file.

### Mount google drive

#### Subtask:
Mount Google Drive to save the model checkpoint.


**Reasoning**:
Mount Google Drive to save the model checkpoint.



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Save model checkpoint

#### Subtask:
Save the trained model checkpoint to the specified path in Google Drive.


**Reasoning**:
Define the save path and save the model and tokenizer to Google Drive.



In [None]:
import os

# Define the path to save the model checkpoint in Google Drive
save_path = global_checkpoint_path

# Create the directory if it doesn't exist
os.makedirs(save_path, exist_ok=True)

# Save the model and tokenizer
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"Model checkpoint and tokenizer saved to: {save_path}")

Model checkpoint and tokenizer saved to: /content/drive/MyDrive/llama3_8b_math_verifier_checkpoint_less_naive_8k5_steps


### Load model from checkpoint

#### Subtask:
Load the model from the saved checkpoint.


**Reasoning**:
Load the model and tokenizer from the saved checkpoint path in Google Drive and prepare the model for inference.



In [None]:
# Define the path where the model checkpoint was saved in Google Drive
save_path = global_checkpoint_path

# Load the model and tokenizer from the saved path
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = save_path,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# Prepare the loaded model for faster inference
FastLanguageModel.for_inference(model)

print(f"Model and tokenizer loaded from: {save_path}")

==((====))==  Unsloth 2025.10.12: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Model and tokenizer loaded from: /content/drive/MyDrive/llama3_8b_math_verifier_checkpoint_less_naive_8k5_steps


## Step 8: Generate submission file

### Subtask:
Generate the submission CSV file using the loaded model.


**Reasoning**:
Generate the submission CSV file by iterating through the test dataset, generating predictions using the loaded model, and saving the results to a pandas DataFrame.



In [None]:
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset

# Load the official test set
test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

# Create the prompt template for inference (no answer included)
inference_prompt = global_inference_prompt

# A simple function to parse 'True' or 'False' from the model's raw output
def parse_output(response_text):
    # Find the text after "Output:"
    output_part = response_text.split("Output:\n")[-1]
    # Check if "True" is in that part, case-insensitively
    if 'true' in output_part.lower():
        return True
    return False

# Loop through the test dataset and generate a prediction for each example
for example in tqdm(test_dataset):
    question = example["question"]
    solution = example["solution"]

    # Format the prompt
    prompt = inference_prompt.format(question, str(solution))
    # Use the same data cleaning step that we used previously during training
    prompt = clean_text(prompt)
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate the prediction
    outputs = model.generate(**inputs, max_new_tokens=8, use_cache=True)
    response_text = tokenizer.batch_decode(outputs)[0]

    # Parse the prediction and add it to our list
    prediction = parse_output(response_text)
    predictions.append(prediction)

# Create the submission DataFrame
submission = pd.DataFrame({
    'ID': range(len(predictions)),
    'is_correct': predictions
})

# Save the DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

print("\nSubmission file 'submission.csv' created successfully!")
print("You can now download this file and submit it to the Kaggle competition.")

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [31:28<00:00,  5.30it/s]



Submission file 'submission.csv' created successfully!
You can now download this file and submit it to the Kaggle competition.
