# Project Summary: Fine-Tuning an LLM for Mathematical Problem Classification

The core objective of this project is to fine-tune a small, efficient Large Language Model (LLM) to classify mathematical word problems into three distinct categories based on their solvability.

The methodology is divided into three main phases:

### 1. Rigorous Dataset Generation via Code Formalization

The primary challenge is creating a high-quality, verifiably correct dataset. This is addressed by converting each natural language math problem (from a source like GSM8K) into a parameterized Python function.

*   A powerful generator LLM is used to translate the problem's text and step-by-step solution into a generalized `solve()` function.
*   This function acts as a formal, executable representation of the problem's underlying logic.
*   By making the problem's numerical values the function's arguments, the logic becomes testable and easy to manipulate.

### 2. Creating a Labeled Dataset with Three Solvability Classes

Using the verified Python functions from Phase 1, the final labeled dataset is constructed by programmatically modifying the original problems to fit into one of three classes:

*   **Class 1: Has a Unique Solution**
    *   This is the original, verified problem where all parameters are defined, leading to a single correct answer.

*   **Class 2: Has Multiple Solutions**
    *   Generated by taking a Class 1 problem and removing a key piece of numerical information from the problem statement. This makes the problem underspecified, as different values for the now-missing parameter would lead to different valid solutions.

*   **Class 0: Has No Solution**
    *   Generated by manipulating the parameters of the Python function to yield a logically or physically absurd result (e.g., a negative count of objects) or by introducing a direct contradiction into the problem statement.

### 3. Fine-Tuning the Classifier LLM

The resulting dataset, with its high-confidence labels, is used to fine-tune a smaller, more efficient LLM. The final model will be trained to take a new math problem as input and output its classification (Class 1, 2, or 0), having learned the underlying patterns of solvability, ambiguity, and contradiction from the generated data.

In [1]:
from code_generator import *

indices = [310, 3822, 7371]

In [3]:
code_strings = get_code_strings(indices)
for index, code_string in code_strings.items():
    print(f"Index: {index}")
    print("Code String:")
    print(code_string)
    print("-" * 40)

Index: 310
Code String:
def solve(
        num_employees: int = 6, # Janet hires six employees
        num_warehouse_workers: int = 4, # Four of them are warehouse workers
        num_managers: int = 2, # the other two are managers
        hourly_wage_warehouse: int = 15, # warehouse workers make $15/hour
        hourly_wage_manager: int = 20, # managers make $20/hour
        fica_tax_rate: float = 0.1, # FICA tax rate is 10%
        days_per_month: int = 25, # everyone works 25 days a month
        hours_per_day: int = 8 # everyone works 8 hours a day
    ):
    """Index: 310.
    Returns: the monthly total wage bill, including FICA taxes.
    """
    #: L1
    hours_per_month = days_per_month * hours_per_day

    #: L2
    monthly_wage_warehouse = hourly_wage_warehouse * hours_per_month

    #: L3
    total_wage_warehouse = monthly_wage_warehouse * num_warehouse_workers

    #: L4
    monthly_wage_manager = hourly_wage_manager * hours_per_month

    #: L5
    total_wage_manager = mon

In [4]:
example_with_code = format_prompt_query(index=310, 
                    code_strings=code_strings, 
                    with_code=True)
print(example_with_code)

*Index*: 
310

*Question*: 
Janet hires six employees. Four of them are warehouse workers who make $15/hour, and the other two are managers who make $20/hour. Janet has to pay 10% of her workers' salaries in FICA taxes. If everyone works 25 days a month and 8 hours a day, how much does Janet owe total for their wages and taxes for one month?

*Solution*: 
{"L1": "First figure out how many hours each worker works per month by multiplying the number of days they work by the number of hours a day they work: 25 days * 8 hours/day = [[25*8=200]]200 hours", "L2": "Then calculate how much one warehouse worker makes per month by multiplying their hourly rate by the number of hours they work: 200 hours * $15/hour = $[[200*15=3000]]3000", "L3": "Then multiply that number by 4 to find out how much all the warehouse workers make: $3000/worker * 4 workers = $[[3000*4=12000]]12,000", "L4": "Now multiply the hours each manager works (also 200) by their hourly wage to find out how much one manager mak

In [5]:
user_prompt = craft_user_prompt(
    index=4483, 
    example_indices=indices,
    code_examples=get_code_strings(indices)
)
print(user_prompt)

### Guidelines

0. **Output wrapping**
   Return the code inside a single ```python … ``` block, and nothing else.

1.  **Function Naming & Docstring:** The function must be named `solve`. It must begin with a docstring that has exactly two lines:
    *   The first line must be: "Index: [Index]." using the index from the task header.
    *   The second line must be a succinct, one-sentence description of what the function returns (e.g., "Returns: the total cost of wages and taxes.").

2.  **Function Arguments:** The function arguments must be derived from the 'Question' text. 
    *   Create a distinct argument for every numerical value that is directly stated in the text.
    *   The arguments should be created **in the same order in which they appear in the question**.
    *   **Note:** Some of these arguments may end up not being used in the function body. This is expected. Do not worry about this and leave the unused arguments in the function signature.

3.  **Argument Formatting:*

In [6]:
model_dict = \
{
  "anthropic": [
    "claude-3-5-haiku-20241022"
  ],
  "openai": [
    "o4-mini",
    "gpt-4.1-mini"
  ],
  "google": [
    "gemini-2.0-flash-thinking-exp",
    "gemini-2.5-flash-lite-preview-06-17"
  ]
}

In [7]:
df = generate_GSM8K_code(
    model_dict=model_dict,
    indices_to_generate=[3779, 4483, 6237, 1202, 2345],
    example_indices=indices)


Output directory: code_generation_outputs/3779
Crafting user prompt...

--- Calling Anthropic model: claude-3-5-haiku-20241022 ---
  Response received in 5.71 seconds.
  Successfully saved raw output to: code_generation_outputs/3779/anthropic_claude-3-5-haiku-20241022.txt

--- Calling Openai model: o4-mini ---
  Response received in 28.46 seconds.
  Successfully saved raw output to: code_generation_outputs/3779/openai_o4-mini.txt

--- Calling Openai model: gpt-4.1-mini ---
  Response received in 5.11 seconds.
  Successfully saved raw output to: code_generation_outputs/3779/openai_gpt-4.1-mini.txt

--- Calling Google model: gemini-2.0-flash-thinking-exp ---
  Response received in 3.49 seconds.
  Successfully saved raw output to: code_generation_outputs/3779/google_gemini-2.0-flash-thinking-exp.txt

--- Calling Google model: gemini-2.5-flash-lite-preview-06-17 ---
  Response received in 1.04 seconds.
  Successfully saved raw output to: code_generation_outputs/3779/google_gemini-2.5-flas

In [8]:
df

Unnamed: 0,provider,model,index,time_taken
0,anthropic,claude-3-5-haiku-20241022,3779,5.712822
1,openai,o4-mini,3779,28.46453
2,openai,gpt-4.1-mini,3779,5.113441
3,google,gemini-2.0-flash-thinking-exp,3779,3.486094
4,google,gemini-2.5-flash-lite-preview-06-17,3779,1.041422
5,anthropic,claude-3-5-haiku-20241022,4483,5.092986
6,openai,o4-mini,4483,17.325825
7,openai,gpt-4.1-mini,4483,2.743717
8,google,gemini-2.0-flash-thinking-exp,4483,3.169473
9,google,gemini-2.5-flash-lite-preview-06-17,4483,1.048883


In [11]:
model_dict_2 = \
{
  "anthropic": [
    "claude-3-5-haiku-20241022"
  ],
  "openai": [
    "o3-mini",
    "gpt-4.1"
  ],
  "google": [
    "gemini-2.5-pro",
    "gemini-2.5-flash",
  ]
}

df = generate_GSM8K_code(
    model_dict=model_dict_2,
    indices_to_generate=[3779, 4483, 6237, 1202, 2345],
    example_indices=indices)


Output directory: code_generation_outputs/3779
Crafting user prompt...

--- Calling Anthropic model: claude-3-5-haiku-20241022 ---
  Response received in 5.67 seconds.
  Successfully saved raw output to: code_generation_outputs/3779/anthropic_claude-3-5-haiku-20241022.txt

--- Calling Openai model: o3-mini ---
  Response received in 11.59 seconds.
  Successfully saved raw output to: code_generation_outputs/3779/openai_o3-mini.txt

--- Calling Openai model: gpt-4.1 ---
  Response received in 2.75 seconds.
  Successfully saved raw output to: code_generation_outputs/3779/openai_gpt-4.1.txt

--- Calling Google model: gemini-2.5-pro ---
  Response received in 22.52 seconds.
  Successfully saved raw output to: code_generation_outputs/3779/google_gemini-2.5-pro.txt

--- Calling Google model: gemini-2.5-flash ---
  Response received in 5.33 seconds.
  Successfully saved raw output to: code_generation_outputs/3779/google_gemini-2.5-flash.txt

Output directory: code_generation_outputs/4483
Craf

In [12]:
df

Unnamed: 0,provider,model,index,time_taken
0,anthropic,claude-3-5-haiku-20241022,3779,5.670871
1,openai,o3-mini,3779,11.586511
2,openai,gpt-4.1,3779,2.745864
3,google,gemini-2.5-pro,3779,22.51925
4,google,gemini-2.5-flash,3779,5.330314
5,anthropic,claude-3-5-haiku-20241022,4483,6.077964
6,openai,o3-mini,4483,9.689164
7,openai,gpt-4.1,4483,2.457451
8,google,gemini-2.5-pro,4483,16.586376
9,google,gemini-2.5-flash,4483,4.655331


In [14]:
model_dict_3 = \
{
  "anthropic": [
    "claude-3-5-haiku-20241022"
  ],
  "openai": [
    "o3-mini",
    "gpt-4.1",
    "gpt-4.1-mini"
  ],
  "google": [
    "gemini-2.5-pro",
    "gemini-2.5-flash",
    "gemini-2.5-flash-lite-preview-06-17",
    "gemini-2.0-flash-thinking-exp"
  ]
}

# [3779, 4483, 6237, 1202, 2345]

df_3 = generate_GSM8K_code(
    model_dict=model_dict_3,
    indices_to_generate=[3331, 1647, 636, 399, 4670, 5918, 1531, 7364, 5464, 1205, 3518, 6732],
    example_indices=indices)


Output directory: code_generation_outputs/3331
Crafting user prompt...

--- Calling Anthropic model: claude-3-5-haiku-20241022 ---
  Response received in 5.18 seconds.
  Successfully saved raw output to: code_generation_outputs/3331/anthropic_claude-3-5-haiku-20241022.txt

--- Calling Openai model: o3-mini ---
  Response received in 11.44 seconds.
  Successfully saved raw output to: code_generation_outputs/3331/openai_o3-mini.txt

--- Calling Openai model: gpt-4.1 ---
  Response received in 2.27 seconds.
  Successfully saved raw output to: code_generation_outputs/3331/openai_gpt-4.1.txt

--- Calling Openai model: gpt-4.1-mini ---
  Response received in 3.74 seconds.
  Successfully saved raw output to: code_generation_outputs/3331/openai_gpt-4.1-mini.txt

--- Calling Google model: gemini-2.5-pro ---
  Response received in 17.23 seconds.
  Successfully saved raw output to: code_generation_outputs/3331/google_gemini-2.5-pro.txt

--- Calling Google model: gemini-2.5-flash ---
  Response r

In [15]:
import os
import re
import shutil

def process_and_clean_outputs(
    source_dir: str = 'code_generation_outputs',
    dest_dir: str = 'code_generation_outputs_cleaned'
):
    """
    Traverses a source directory, cleans raw .txt model outputs, and saves
    them as .py files in a new destination directory with the same structure.

    This function will:
    1. Walk through all subdirectories of the source_dir.
    2. Find all files ending in .txt.
    3. Create a corresponding subdirectory in the dest_dir.
    4. Read the .txt content and extract the Python code from ```python ... ``` blocks.
    5. Save the cleaned code to a new file with a .py extension in the destination.
    6. The original source directory and its files will be left untouched.

    Args:
        source_dir: The top-level directory containing the raw generated outputs.
        dest_dir: The top-level directory where cleaned .py files will be saved.
    """
    print(f"Starting processing from '{source_dir}'...")
    print(f"Cleaned files will be saved in '{dest_dir}'.")
    files_processed = 0

    if not os.path.isdir(source_dir):
        print(f"Error: Source directory '{source_dir}' not found.")
        return

    # os.walk efficiently traverses the entire directory tree
    for dirpath, _, filenames in os.walk(source_dir):
        for filename in filenames:
            # Process only the .txt files
            if filename.endswith(".txt"):
                source_filepath = os.path.join(dirpath, filename)
                print(f"\nProcessing file: {source_filepath}")

                try:
                    # 1. Determine the new directory structure
                    # Replaces 'code_generation_outputs' with 'code_generation_outputs_cleaned'
                    # The '1' ensures it only replaces the first occurrence at the beginning
                    dest_subdir = dirpath.replace(source_dir, dest_dir, 1)
                    
                    # 2. Create the destination subdirectory if it doesn't exist
                    os.makedirs(dest_subdir, exist_ok=True)
                    
                    # 3. Read the raw content from the source file
                    with open(source_filepath, 'r', encoding='utf-8') as f:
                        raw_content = f.read()

                    # 4. Extract the code from the markdown block
                    cleaned_code = ""
                    code_match = re.search(r"```python\n(.*?)\n```", raw_content, re.DOTALL)
                    
                    if code_match:
                        cleaned_code = code_match.group(1).strip()
                        print("  Successfully extracted code from markdown block.")
                    else:
                        cleaned_code = raw_content.strip()
                        print("  Warning: No python markdown block found. Using raw content as code.")

                    # 5. Define the new .py filename and full path
                    base_name = os.path.splitext(filename)[0]
                    py_filename = f"{base_name}.py"
                    dest_filepath = os.path.join(dest_subdir, py_filename)

                    # 6. Write the cleaned code to the new .py file in the destination
                    with open(dest_filepath, 'w', encoding='utf-8') as f:
                        f.write(cleaned_code)
                    print(f"  Saved cleaned code to: {dest_filepath}")
                    
                    files_processed += 1

                except Exception as e:
                    print(f"  An error occurred while processing {source_filepath}: {e}")

    print(f"\nProcessing complete. Processed {files_processed} files.")

process_and_clean_outputs()

Starting processing from 'code_generation_outputs'...
Cleaned files will be saved in 'code_generation_outputs_cleaned'.

Processing file: code_generation_outputs/7364/google_gemini-2.5-flash-lite-preview-06-17.txt
  Successfully extracted code from markdown block.
  Saved cleaned code to: code_generation_outputs_cleaned/7364/google_gemini-2.5-flash-lite-preview-06-17.py

Processing file: code_generation_outputs/7364/google_gemini-2.5-pro.txt
  Saved cleaned code to: code_generation_outputs_cleaned/7364/google_gemini-2.5-pro.py

Processing file: code_generation_outputs/7364/openai_gpt-4.1.txt
  Successfully extracted code from markdown block.
  Saved cleaned code to: code_generation_outputs_cleaned/7364/openai_gpt-4.1.py

Processing file: code_generation_outputs/7364/openai_o3-mini.txt
  Successfully extracted code from markdown block.
  Saved cleaned code to: code_generation_outputs_cleaned/7364/openai_o3-mini.py

Processing file: code_generation_outputs/7364/anthropic_claude-3-5-haiku

In [9]:
"""
gsm8k_validator.py
==================

End-to-end checker for **three independently generated** `solve()` files
that formalise the same GSM8K problem.  The script enforces the _compact
format_ we finally adopted:

* 2-line docstring header (`Index: …`, `Returns: …`)
* Semantic argument names (no numeric prefix)
* Trace comments that start **exactly** with `#: L<n>`
* No calculator annotations inside the code
* Final answer stored in `answer` with `# FINAL ANSWER`

The pipeline performs four orthogonal tests:

UT-0  – Default run returns the official numeric answer  
UT-1  – Trace-comment lists are identical across files  
UT-2  – Arguments match in default literals **and** comment semantics  
UT-3  – Hypothesis fuzzing confirms functional equivalence

Files that pass all four tests are optionally rewritten into a *canonical*
form (`X1 …, Y1 …`) suitable for automated error-injection experiments.

Dependencies
------------
black                – whitespace-stable formatting  
libcst               – reliable CST traversal with comment access  
hypothesis           – property-based fuzzing  
sentence-transformers (mpnet-base) – SBERT cosine for comment semantics
"""

from __future__ import annotations

import hashlib
import importlib.util
import inspect
import re
import sys
from dataclasses import dataclass
from pathlib import Path
from typing import List, Dict, Any

import black
import hypothesis.strategies as st
import libcst as cst
from hypothesis import given, settings
from sentence_transformers import SentenceTransformer

# ---------------------------------------------------------------------- #
#  Global constants & regex helpers
# ---------------------------------------------------------------------- #

_MODEL = SentenceTransformer("all-mpnet-base-v2")
_COS_THRESHOLD = 0.90          # SBERT cosine ≥ 0.9 ⇒ semantic match
_FUZZ_EXAMPLES  = 60           # Hypothesis draws

_TRACE_RE = re.compile(r"^#: L(\d+)\b")          # matches trace comment
_DOC_INDEX_RE = re.compile(r"^Index:\s*(\d+)")   # first docstring line


# ---------------------------------------------------------------------- #
#  ParsedFile dataclass
# ---------------------------------------------------------------------- #

@dataclass
class ParsedFile:
    """
    Normalised representation of a generated code file.

    Attributes
    ----------
    index         : str
        Problem ID, e.g. ``"Q310"``.
    path          : pathlib.Path
        Path to the original file on disk.
    module_code   : str
        Black-formatted source code.
    trace_keys    : List[str]
        Ordered list like ``["L1", "L2", …]`` extracted from comments.
    arg_defaults  : List[str]
        Default literal strings in the function signature.
    arg_comments  : List[str]
        Trailing comments for each argument.
    func          : callable
        Dynamically imported ``solve`` function ready for execution.
    """
    index: str
    path: Path
    module_code: str
    trace_keys: List[str]
    arg_defaults: List[str]
    arg_comments: List[str]
    func: Any


# ---------------------------------------------------------------------- #
#  Parsing utilities
# ---------------------------------------------------------------------- #

def parse_file(path: Path) -> ParsedFile:
    """
    Parse one generated `.py` file and extract all information needed
    by the validators.

    Why a dedicated parser?
    -----------------------
    * Centralises Black formatting (whitespace normalisation).
    * Collects trace comments, argument literals, and their semantic
      comments in **one** traversal (libcst).
    * Dynamically imports the `solve()` function so UT-0 and UT-3 can
      execute it in memory without writing temp files.
    """
    src_raw = path.read_text()
    src_fmt = black.format_str(src_raw, mode=black.FileMode())

    # ------- docstring extraction & index check ------------------- #
    doc_match = re.search(r'"""(.*?)"""', src_fmt, re.S)
    if not doc_match:
        raise ValueError(f"{path}   » missing two-line docstring")
    first_line = doc_match.group(1).strip().splitlines()[0]
    idx_match = _DOC_INDEX_RE.match(first_line)
    if not idx_match:
        raise ValueError(f"{path}   » first docstring line must start 'Index:'")
    index = f"Q{idx_match.group(1)}"

    # ------- trace comment list ----------------------------------- #
    trace_keys = [
        f"L{m.group(1)}"
        for line in src_fmt.splitlines()
        if (m := _TRACE_RE.match(line.lstrip()))
    ]
    if not trace_keys:
        raise ValueError(f"{path}   » no trace comments found")

    # ------- AST walk for args ------------------------------------ #
    mod     = cst.parse_module(src_fmt)
    func_nd = next(
        n for n in mod.body
        if isinstance(n, cst.FunctionDef) and n.name.value == "solve"
    )
    arg_defaults, arg_comments = [], []
    for param in func_nd.params.params:
        arg_defaults.append(param.default.code if param.default else None)
        cmt = (
            param.trailing_whitespace.comment.value.lstrip("#").strip()
            if param.trailing_whitespace.comment else ""
        )
        arg_comments.append(cmt)

    # ------- dynamic import --------------------------------------- #
    spec = importlib.util.spec_from_loader(f"gsm8k_{index}_{hash(path)}", loader=None)
    mod_dyn = importlib.util.module_from_spec(spec)
    exec(src_fmt, mod_dyn.__dict__)

    return ParsedFile(
        index=index,
        path=path,
        module_code=src_fmt,
        trace_keys=trace_keys,
        arg_defaults=arg_defaults,
        arg_comments=arg_comments,
        func=mod_dyn.solve,
    )


# ---------------------------------------------------------------------- #
#  Unit Test 0 – default answer match
# ---------------------------------------------------------------------- #

def ut0_answer_match(files: List[ParsedFile], gold: int | float) -> List[ParsedFile]:
    """
    Keep only files whose `solve()` returns the official answer when
    called *with default arguments*.

    Rationale
    ---------
    Eliminates obvious errors: hard-coded wrong literals, mis-parsed
    numbers, or runtime exceptions.
    """
    ok = []
    for pf in files:
        try:
            if pf.func() == gold:
                ok.append(pf)
        except Exception as e:
            print(f"[UT-0] {pf.path.name} raised {e!r}", file=sys.stderr)
    return ok


# ---------------------------------------------------------------------- #
#  Unit Test 1 – identical trace lists
# ---------------------------------------------------------------------- #

def ut1_trace_lists_equal(files: List[ParsedFile]) -> bool:
    """
    Verify that every surviving file has the *same* ordered list of trace
    keys (`L1`, `L2`, …).

    Why:
    ----
    Ensures no model skipped or re-ordered steps; makes later
    canonicalisation deterministic.
    """
    hashes = {
        hashlib.sha1(",".join(p.trace_keys).encode()).hexdigest()
        for p in files
    }
    return len(hashes) == 1


# ---------------------------------------------------------------------- #
#  Unit Test 2 – argument semantic alignment
# ---------------------------------------------------------------------- #

def ut2_arg_semantics(files: List[ParsedFile]) -> bool:
    """
    Check two things:
    1. Argument *default literals* match exactly across files.
    2. Trailing comments are semantically equivalent (SBERT cosine ≥ 0.9).

    Rationale
    ---------
    Detects swapped wage vs. hours, wrong default numbers, or comment /
    code drift that numeric fuzzing may not catch immediately.
    """
    ref_def  = files[0].arg_defaults
    ref_emb  = _MODEL.encode(files[0].arg_comments, normalize_embeddings=True)

    for pf in files[1:]:
        if pf.arg_defaults != ref_def:
            return False
        emb = _MODEL.encode(pf.arg_comments, normalize_embeddings=True)
        if ((emb * ref_emb).sum(axis=1) < _COS_THRESHOLD).any():
            return False
    return True


# ---------------------------------------------------------------------- #
#  Unit Test 3 – property-based fuzz equivalence
# ---------------------------------------------------------------------- #

def ut3_fuzz_equivalent(files: List[ParsedFile], n=_FUZZ_EXAMPLES) -> bool:
    """
    Use Hypothesis to generate random overrides for **all arguments** and
    require that every file returns the same value.

    Rationale
    ---------
    Guards against coincidental agreement on default literals.  Divergent
    internal logic will surface when you vary inputs.
    """
    funcs = [pf.func for pf in files]
    sig   = inspect.signature(funcs[0])
    strat_map = {}
    for param in sig.parameters.values():
        literal = param.default
        strat_map[param.name] = (
            st.floats(1, 30) if isinstance(literal, float) else st.integers(1, 30)
        )

    @settings(max_examples=n, deadline=None)
    @given(st.fixed_dictionaries(strat_map))
    def _check(kwargs):
        ref = funcs[0](**kwargs)
        for fn in funcs[1:]:
            assert fn(**kwargs) == ref

    try:
        _check()
        return True
    except Exception as e:
        print(f"[UT-3] fuzz divergence: {e!r}", file=sys.stderr)
        return False


# ---------------------------------------------------------------------- #
#  Canonical renamer (optional post-step)
# ---------------------------------------------------------------------- #

_RE_IDENT = re.compile(r"\b([A-Za-z_]\w*)\b")

def canonicalise(source: str, arg_order: List[str], trace_keys: List[str]) -> str:
    """
    Return a *canonical* version of `source` where:

    * Arguments become `X1`, `X2`, … following their order in `arg_order`.
    * The variable assigned immediately after `#: L<n>` becomes `Y<n>`.

    This deterministic form is ideal for later error-injection because it
    decouples variable names from human semantics.

    Parameters
    ----------
    source     : str
        Original source code.
    arg_order  : List[str]
        Ordered list of argument names in the `solve` signature.
    trace_keys : List[str]
        Ordered list of trace keys extracted earlier.

    Returns
    -------
    str
        Source code with canonical identifiers.
    """
    # map args → Xi
    arg_map = {old: f"X{i+1}" for i, old in enumerate(arg_order)}

    # map intermediates via trace comments
    lines = source.splitlines()
    interm_map: Dict[str, str] = {}
    for idx, key in enumerate(trace_keys):
        # find the line number of #: L<n>
        for i, line in enumerate(lines):
            if line.lstrip().startswith(f"#: {key}"):
                # next line should have assignment
                lhs_match = re.match(r"\s*([A-Za-z_]\w*)\s*=", lines[i + 1])
                if lhs_match:
                    interm_map[lhs_match.group(1)] = f"Y{key[1:]}"
                break

    name_map = {**arg_map, **interm_map}
    return _RE_IDENT.sub(lambda m: name_map.get(m.group(1), m.group(1)), source)


# ---------------------------------------------------------------------- #
#  Orchestration function
# ---------------------------------------------------------------------- #

def validate_triplet(paths: List[Path], gold: int | float) -> List[Path]:
    """
    Run UT-0…UT-3 on a list of *three* files that claim to solve the same
    GSM8K item.  If at least two survive **all** tests, write canonical
    versions (`canon_<name>.py`) and return their original paths.

    Side-effects:
    -------------
    • Prints status lines for each stage.  
    • Writes canonical files next to originals.

    Returns
    -------
    List[Path]
        Paths of files that passed, or an empty list if quorum fails.
    """
    parsed = [parse_file(p) for p in paths]

    parsed = ut0_answer_match(parsed, gold)
    if len(parsed) < 2:
        print("Quorum failed at UT-0")
        return []

    if not ut1_trace_lists_equal(parsed):
        print("Trace lists differ (UT-1)")
        return []

    if not ut2_arg_semantics(parsed):
        print("Arg mismatch (UT-2)")
        return []

    if not ut3_fuzz_equivalent(parsed):
        print("Functional divergence (UT-3)")
        return []

    # optional canonicalise
    arg_order = list(inspect.signature(parsed[0].func).parameters.keys())
    for pf in parsed:
        canon = canonicalise(pf.module_code, arg_order, pf.trace_keys)
        (pf.path.parent / f"canon_{pf.path.name}").write_text(canon)

    print("All tests passed; canonical files written.")
    return [pf.path for pf in parsed]