# Reproducer Notebook

ECE-GY 7123 / CS-GY 6953 / Deep Learning - Fall '25 - Midterm Project

**Team:** Spline Reticulator

**Author/Member:** Thanh Do (qd2121@nyu.edu)

**Purpose:** For regenerating validation accuracy and Kaggle response file for saved weight on Google Drive

## Step 1: Install Necessary Libraries

First, we need to install the required Python libraries. We'll be using the unsloth library, which provides highly efficient, memory-saving training methods for large language models, making it possible to fine-tune powerful models on a single free-tier GPU. We'll also install xformers for further optimization.


In [None]:
# %%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2



## Step 2: Defining Model and Constants

Change the `global_checkpoint_path` to the corresponding weights folder path in your Google Drive here. Please note that the prompts are not saved as part of the checkpoints, so you will have to remember it by yourself.



In [None]:
from unsloth import FastLanguageModel
import torch

global_checkpoint_path = "/content/drive/MyDrive/llama3_8b_math_verifier_checkpoint_naive"

max_seq_length = 2048  # 1024 is not enough for some questions
dtype = None  # This will auto-detect the best data type for your GPU
# We have a lot of VRAM on the A100 so this might save some quantization time
load_in_4bit = False

# Some constants
generate_submission_file = True # Run on the actual test dataset & create CSV
global_seed = 313337
global_inference_prompt = """You are an expert mathematician and a meticulous verifier.
Your task is to evaluate a proposed solution to a math problem and determine if it's correct or not.
Carefully read the Question and the Solution. Determine if the Solution is a correct reasoning process to solve the Question.
Your response should be 'True' if the solution is correct, otherwise 'False'.
Below is the Question and Solution.
Question:
{}
Solution:
{}
Output:
"""
global_training_prompt = global_inference_prompt + "{}" # include the correct answer

==((====))==  Unsloth 2025.10.12: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]



## Step 3: Prepare the Dataset


In [None]:
from datasets import load_dataset

# Load the full training dataset
full_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="train")

# Shuffle the dataset for randomness and create our smaller splits
shuffled_dataset = full_dataset.shuffle(seed=global_seed)
n = len(full_dataset)
# Takes the last 500 samples for internal validation
n_validation = 500
validation_dataset = shuffled_dataset.select(range(n - n_validation, n))

Lightweight cleaner to round floats in text:

In [None]:
import re

def round_floats_in_text(text: str, n_digits: int = 4) -> str:
    """
    Finds all floating-point numbers in a string and rounds them
    to a specified number of decimal places.
    """

    # Matches an optional sign, digits, a decimal point, and more digits.
    float_pattern = re.compile(r"[-]?\d+\.\d+")

    # m.group(0) is the matched text (e.g., "0.666666667")
    def replacer(match):
        # Convert the matched string to a float
        number = float(match.group(0))
        # Round it to 'n_digits'
        rounded_number = round(number, n_digits)
        # Convert it back to a string
        return str(rounded_number)

    return float_pattern.sub(replacer, text)

def clean_text(text: str) -> str:
  return round_floats_in_text(text)

## Mount google drive

### Subtask:
Mount Google Drive to save the model checkpoint.


**Reasoning**:
Mount Google Drive to save the model checkpoint.



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Load model from checkpoint

### Subtask:
Load the model from the saved checkpoint.


**Reasoning**:
Load the model and tokenizer from the saved checkpoint path in Google Drive and prepare the model for inference.



In [None]:
# Define the path where the model checkpoint was saved in Google Drive
save_path = global_checkpoint_path

# Load the model and tokenizer from the saved path
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = save_path,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# Prepare the loaded model for faster inference
FastLanguageModel.for_inference(model)

print(f"Model and tokenizer loaded from: {save_path}")

==((====))==  Unsloth 2025.10.12: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

In [None]:
# Create the prompt template for inference (no answer included)
inference_prompt = global_inference_prompt

# Evaluate accuracy on our validation set
num_samples = len(validation_dataset)
count_correct = 0
for i in range(num_samples):
  example = validation_dataset[i]
  question = example["question"]
  solution = example["solution"]
  inputs = tokenizer(
  [
      inference_prompt.format(question, str(solution))
  ], return_tensors = "pt").to("cuda")
  outputs = model.generate(**inputs, max_new_tokens = 8, use_cache = True)
  response = tokenizer.batch_decode(outputs)
  prediction: str = response[0].split("Output:\n")[1]
  if prediction.startswith(str(example["is_correct"])):
    count_correct += 1
print("Validation Accuracy =", count_correct / num_samples)

## Generate submission file

### Subtask:
Generate the submission CSV file using the loaded model.


**Reasoning**:
Generate the submission CSV file by iterating through the test dataset, generating predictions using the loaded model, and saving the results to a pandas DataFrame.



In [None]:
if generate_submission_file:

  import pandas as pd
  from tqdm import tqdm
  from datasets import load_dataset

  # Load the official test set
  test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
  predictions = []

  # Create the prompt template for inference (no answer included)
  inference_prompt = global_inference_prompt

  # A simple function to parse 'True' or 'False' from the model's raw output
  def parse_output(response_text):
      # Find the text after "Output:"
      output_part = response_text.split("Output:\n")[-1]
      # Check if "True" is in that part, case-insensitively
      if 'true' in output_part.lower():
          return True
      return False

  # Loop through the test dataset and generate a prediction for each example
  for example in tqdm(test_dataset):
      question = example["question"]
      solution = example["solution"]

      # Format the prompt
      prompt = inference_prompt.format(question, str(solution))
      # Use the same data cleaning step that we used previously during training
      prompt = clean_text(prompt)
      inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

      # Generate the prediction
      outputs = model.generate(**inputs, max_new_tokens=8, use_cache=True)
      response_text = tokenizer.batch_decode(outputs)[0]

      # Parse the prediction and add it to our list
      prediction = parse_output(response_text)
      predictions.append(prediction)

  # Create the submission DataFrame
  submission = pd.DataFrame({
      'ID': range(len(predictions)),
      'is_correct': predictions
  })

  # Save the DataFrame to a CSV file
  # Add a suffix to denote which weight file we're using
  suffix = global_checkpoint_path.split('/')[-1]
  submission.to_csv(f'submission_{suffix}.csv', index=False)

  # Calculate the submission hash for quick comparison
  import hashlib
  with open(f'submission_{suffix}.csv', 'rb') as f:
      submission_hash = hashlib.sha1(f.read()).hexdigest()
  print(f"Submission SHA1 hash: {submission_hash}")

  print("\nSubmission file 'submission.csv' created successfully!")
  print("You can now download this file and submit it to the Kaggle competition.")