### Cell 1: Setup and Validator Code

This first cell contains all the necessary imports and the full code for the validator. It also defines the configuration variables (`BASE_DIR`, `PROBLEM_INDEX`, `GOLD_ANSWER`) that will be used by the subsequent testing cells.

**Action:**
1.  Paste the entire content of your `gsm8k_validator.py` script into the marked section below.
2.  Set the `PROBLEM_INDEX` and `GOLD_ANSWER` for the problem you want to test.
3.  Run this cell once to set up your environment.

In [29]:
"""
gsm8k_validator_v2.py
======================

Performs a flexible, score-based validation of multiple LLM-generated `solve()`
functions for a given GSM8K problem. Instead of a rigid filter, this script
analyzes all pairs of models to find the most robust and comprehensive consensus.

The final output is a confidence score (0.0-1.0+) that reflects:
1.  The completeness of parameter alignment between models.
2.  The semantic clarity of the aligned parameters.
3.  The level of consensus (number of models in agreement).

Core Logic:
-----------
1.  **Parse & Pre-filter (UT-0):** All generated Python files for a problem are
    parsed and filtered, keeping only those that produce the correct answer
    for the default values.

2.  **Pairwise Alignment (UT-2):** For every pair of surviving models, find the
    best possible alignment between their function arguments based on semantic
    similarity (SBERT) and matching default values. Order is ignored.

3.  **Pairwise Fuzzing (UT-3):** For each aligned pair, fuzz-test for functional
    equivalence using the "Fuzz Aligned, Freeze Unaligned" strategy. This
    ensures logical soundness even with partially matching signatures.

4.  **Scoring & Consensus:**
    - Each successfully validated pair receives a `PairwiseQualityScore` based
      on its Alignment Ratio and Semantic Strength.
    - The script identifies the largest "clique" of models that are all
      mutually validated.
    - A final `ConfidenceScore` is computed from the clique's average quality,
      boosted by a bonus for the size of the consensus.

Dependencies
------------
black                – whitespace-stable formatting
libcst               – reliable CST traversal with comment access
hypothesis           – property-based fuzzing
sentence-transformers (mpnet-base) – SBERT cosine for comment semantics
numpy                - for fast vector operations
"""

from __future__ import annotations

# --- (inside gsm8k_validator.py) ---

import importlib.util
import inspect
import itertools
import json
import re
import sys
from dataclasses import dataclass, field
from pathlib import Path
from typing import List, Any, Tuple, Dict

import black
import hypothesis.strategies as st
import libcst as cst  # <--- ADD THIS LINE
import numpy as np
from hypothesis import given, settings, HealthCheck
from sentence_transformers import SentenceTransformer

import pprint

from collections import defaultdict

# ---------------------------------------------------------------------- #
#  Global constants & Configuration
# ---------------------------------------------------------------------- #

BASE_DIR = Path("code_generation_outputs_cleaned")

_MODEL = SentenceTransformer("all-mpnet-base-v2")
_COS_THRESHOLD = 0.70  # SBERT cosine ≥ 0.7 ⇒ semantic match
_FUZZ_EXAMPLES = 50  # Hypothesis draws
_MIN_ALIGNMENT_FOR_FUZZ = 1 # A pair must align on at least one arg to be fuzzed

# --- Scoring Weights --- #
W_ALIGNMENT = 0.7
W_SEMANTIC = 0.3

# --- Consensus Bonus Multipliers --- #
CONSENSUS_BONUS = {
    2: 1.0,  # Baseline for a pair
    3: 1.1,  # 10% bonus for a 3-way consensus
    4: 1.2,  # 20% bonus for a 4-way consensus
    5: 1.3,
}

_TRACE_RE = re.compile(r"^#: L(\d+)\b")
_DOC_INDEX_RE = re.compile(r"^Index:\s*(\d+)")


# ---------------------------------------------------------------------- #
#  Dataclasses for Structured Data
# ---------------------------------------------------------------------- #

@dataclass(frozen=True)
class Argument:
    """Represents a single argument from a function signature."""
    name: str
    arg_type: str  # <--- ADD THIS LINE
    default: Any
    comment: str


@dataclass(frozen=True)
class ParsedFile:
    """Normalised representation of a generated code file."""
    path: Path
    module_code: str
    func: Any
    args: List[Argument] = field(default_factory=list)


@dataclass(frozen=True)
class AlignmentResult:
    """Stores the result of aligning two ParsedFiles."""
    aligned_pairs: List[Tuple[Argument, Argument]]
    unaligned_A: List[Argument]
    unaligned_B: List[Argument]
    semantic_scores: List[float]


@dataclass
class PairwiseValidation:
    """Stores the full result of a successful pairwise validation."""
    file_A: ParsedFile
    file_B: ParsedFile
    alignment: AlignmentResult
    quality_score: float


# ---------------------------------------------------------------------- #
#  Core Logic Implementation
# ---------------------------------------------------------------------- #

# In your validator script, find and replace the parse_file function

def parse_file(path: Path) -> ParsedFile | None:
    """
    Parse one generated .py file into a structured ParsedFile object using
    a direct, regex-based approach.
    """
    try:
        src_raw = path.read_text(encoding="utf-8")
        src_fmt = black.format_str(src_raw, mode=black.FileMode())

        signature_match = re.search(r"def solve\s*\((.*?)\):", src_fmt, re.DOTALL)
        if not signature_match:
            raise ValueError("Could not find a 'def solve(...):' signature.")
        
        signature_content = signature_match.group(1)

        args = []
        # --- MODIFIED REGEX: Now captures the type hint in group(2) ---
        arg_pattern = re.compile(
            r"^\s*([a-zA-Z_]\w*)\s*:\s*(\w+)\s*=\s*(.*?)\s*,?\s*(?:#\s*(.*))?$"
        )

        for line in signature_content.splitlines():
            if not line.strip(): continue
            match = arg_pattern.match(line)
            if match:
                name = match.group(1)
                arg_type = match.group(2) # <--- CAPTURE TYPE
                default_str = match.group(3).strip()
                default_val = eval(default_str)
                comment = match.group(4).strip() if match.group(4) else ""
                
                args.append(Argument(
                    name=name,
                    arg_type=arg_type, # <--- STORE TYPE
                    default=default_val,
                    comment=comment
                ))

        spec = importlib.util.spec_from_loader(f"gsm8k_{path.stem}_{hash(path)}", loader=None)
        mod_dyn = importlib.util.module_from_spec(spec)
        exec(src_fmt, mod_dyn.__dict__)

        return ParsedFile(
            path=path,
            module_code=src_fmt,
            func=mod_dyn.solve,
            args=args
        )
    except (FileNotFoundError, StopIteration, SyntaxError, Exception) as e:
        print(f"[Parser Error] Skipping {path.name}: {e!r}", file=sys.stderr)
        return None


def ut0_answer_match(files: List[ParsedFile], gold: float) -> List[ParsedFile]:
    """Keep only files whose solve() returns the official answer with default args."""
    ok_files = []
    for pf in files:
        try:
            if np.isclose(pf.func(), gold):
                ok_files.append(pf)
        except Exception as e:
            print(f"[UT-0 Fail] {pf.path.name} raised {e!r}", file=sys.stderr)
    return ok_files

def find_best_alignment(file_A: ParsedFile, file_B: ParsedFile) -> AlignmentResult:
    """
    Finds the best argument alignment using a 'bucket and match' strategy.
    1. Groups args from each file into buckets by (type, default_value).
    2. Only performs semantic comparison on args within matching buckets.
    """
    # --- 1. Create buckets for each file's arguments ---
    buckets_A = defaultdict(list)
    for arg in file_A.args:
        buckets_A[(arg.arg_type, arg.default)].append(arg)
        
    buckets_B = defaultdict(list)
    for arg in file_B.args:
        buckets_B[(arg.arg_type, arg.default)].append(arg)

    aligned_pairs = []
    semantic_scores = []
    
    # --- 2. Iterate through buckets that exist in BOTH files ---
    common_keys = set(buckets_A.keys()) & set(buckets_B.keys())
    
    for key in common_keys:
        args_in_bucket_A = buckets_A[key]
        args_in_bucket_B = buckets_B[key]
        
        # --- 3. Perform semantic alignment ONLY within the current bucket ---
        texts_A = [arg.name.replace("_", " ") + " | " + arg.comment for arg in args_in_bucket_A]
        texts_B = [arg.name.replace("_", " ") + " | " + arg.comment for arg in args_in_bucket_B]
        
        embeddings_A = _MODEL.encode(texts_A, normalize_embeddings=True)
        embeddings_B = _MODEL.encode(texts_B, normalize_embeddings=True)
        similarity_matrix = embeddings_A @ embeddings_B.T

        # Use a greedy matching strategy within the bucket
        sorted_indices = np.argsort(similarity_matrix, axis=None)[::-1]
        flat_indices = np.atleast_1d(sorted_indices)
        rows, cols = np.unravel_index(flat_indices, similarity_matrix.shape)

        used_in_bucket_A = set()
        used_in_bucket_B = set()

        for i, j in zip(rows, cols):
            if i in used_in_bucket_A or j in used_in_bucket_B:
                continue

            similarity_score = similarity_matrix[i, j]
            if similarity_score >= _COS_THRESHOLD:
                aligned_pairs.append((args_in_bucket_A[i], args_in_bucket_B[j]))
                semantic_scores.append(similarity_score)
                used_in_bucket_A.add(i)
                used_in_bucket_B.add(j)

    # --- 4. Calculate the final unaligned sets ---
    final_aligned_A = {p[0] for p in aligned_pairs}
    final_aligned_B = {p[1] for p in aligned_pairs}
    unaligned_A = [arg for arg in file_A.args if arg not in final_aligned_A]
    unaligned_B = [arg for arg in file_B.args if arg not in final_aligned_B]

    return AlignmentResult(aligned_pairs, unaligned_A, unaligned_B, semantic_scores)


def fuzz_aligned_pair(alignment: AlignmentResult, func_A: callable, func_B: callable) -> bool:
    """Fuzz-test an aligned pair using the 'Fuzz Aligned, Freeze Unaligned' strategy."""
    if len(alignment.aligned_pairs) < _MIN_ALIGNMENT_FOR_FUZZ:
        return False

    strat_map = {}
    for i, (arg_A, _) in enumerate(alignment.aligned_pairs):
        literal = arg_A.default
        strat = st.floats if isinstance(literal, float) else st.integers
        strat_map[f"pair_{i}"] = strat(min_value=1, max_value=50)

    # Freeze unaligned args to their defaults
    frozen_kwargs_A = {arg.name: arg.default for arg in alignment.unaligned_A}
    frozen_kwargs_B = {arg.name: arg.default for arg in alignment.unaligned_B}

    @settings(max_examples=_FUZZ_EXAMPLES, deadline=None, suppress_health_check=[HealthCheck.too_slow])
    @given(st.fixed_dictionaries(strat_map))
    def _check(fuzzed_values):
        kwargs_A = frozen_kwargs_A.copy()
        kwargs_B = frozen_kwargs_B.copy()

        for i, (arg_A, arg_B) in enumerate(alignment.aligned_pairs):
            fuzzed_val = fuzzed_values[f"pair_{i}"]
            kwargs_A[arg_A.name] = fuzzed_val
            kwargs_B[arg_B.name] = fuzzed_val

        assert np.isclose(func_A(**kwargs_A), func_B(**kwargs_B))

    try:
        _check()
        return True
    except Exception:
        return False


def calculate_pairwise_score(alignment: AlignmentResult, file_A: ParsedFile, file_B: ParsedFile) -> float:
    """Calculate the quality score for a single validated pair."""
    num_aligned = len(alignment.aligned_pairs)
    
    # --- MODIFIED: A more robust Alignment Ratio calculation ---
    total_unique_args = num_aligned + len(alignment.unaligned_A) + len(alignment.unaligned_B)
    alignment_ratio = num_aligned / total_unique_args if total_unique_args > 0 else 1.0

    # Semantic Strength (no change needed here)
    semantic_strength = np.mean(alignment.semantic_scores) if alignment.semantic_scores else 1.0

    return (W_ALIGNMENT * alignment_ratio) + (W_SEMANTIC * semantic_strength)


# ---------------------------------------------------------------------- #
#  Orchestration and Reporting
# ---------------------------------------------------------------------- #

def analyze_problem_outputs(problem_dir: Path, gold_answer: float):
    """Main orchestrator to analyze all model outputs for a single problem."""
    print(f"\n{'='*20} Analyzing Problem: {problem_dir.name} {'='*20}")
    
    all_files = list(problem_dir.glob("*.py"))
    if not all_files:
        print("No Python files found in this directory.")
        return

    parsed_files = [pf for pf in [parse_file(p) for p in all_files] if pf is not None]
    print(f"Found and parsed {len(parsed_files)} files.")

    survivors_ut0 = ut0_answer_match(parsed_files, gold_answer)
    print(f"{len(survivors_ut0)} files passed UT-0 (correct default answer).")
    if len(survivors_ut0) < 2:
        print("Not enough models passed UT-0 to find a pair. Aborting.")
        return

    # --- Pairwise Validation ---
    validated_pairs: List[PairwiseValidation] = []
    for file_A, file_B in itertools.combinations(survivors_ut0, 2):
        alignment = find_best_alignment(file_A, file_B)
        
        if fuzz_aligned_pair(alignment, file_A.func, file_B.func):
            score = calculate_pairwise_score(alignment, file_A, file_B)
            validated_pairs.append(PairwiseValidation(file_A, file_B, alignment, score))
            print(f"  ✓ Validated Pair: ({file_A.path.name}, {file_B.path.name}), Score: {score:.3f}")

    if not validated_pairs:
        print("\nNo functionally equivalent pairs found after fuzzing.")
        return

    # --- Find Best Consensus Clique ---
    nodes = survivors_ut0
    adj = {pf.path.name: set() for pf in nodes}
    for vp in validated_pairs:
        adj[vp.file_A.path.name].add(vp.file_B.path.name)
        adj[vp.file_B.path.name].add(vp.file_A.path.name)

    best_clique_names = []
    # Check for cliques of decreasing size
    for size in range(len(nodes), 1, -1):
        # Use a list of names for combinations, not the ParsedFile objects
        node_names = [pf.path.name for pf in nodes]
        for combo_names in itertools.combinations(node_names, size):
            is_clique = all(
                combo_names[j] in adj[combo_names[i]] for i in range(size) for j in range(i + 1, size)
            )
            if is_clique:
                best_clique_names = list(combo_names)
                break
        if best_clique_names:
            break
    
    # --- Calculate Final Score and Report ---
    # This block now correctly handles the case where no clique is found
    if not best_clique_names and validated_pairs:
        # If no clique > 2 found, find the single best pair
        print("\nNo consensus clique found. Reporting score for the single best pair.")
        best_pair = max(validated_pairs, key=lambda vp: vp.quality_score)
        final_score = best_pair.quality_score
        clique_size = 2
        best_clique_names = sorted([best_pair.file_A.path.name, best_pair.file_B.path.name])
        avg_quality = final_score
        bonus = CONSENSUS_BONUS.get(clique_size, 1.0)
    elif best_clique_names:
        # A clique was found
        clique_size = len(best_clique_names)
        clique_pairs_scores = [
            vp.quality_score for vp in validated_pairs 
            if vp.file_A.path.name in best_clique_names and vp.file_B.path.name in best_clique_names
        ]
        avg_quality = np.mean(clique_pairs_scores) if clique_pairs_scores else 0
        bonus = CONSENSUS_BONUS.get(clique_size, max(CONSENSUS_BONUS.values()))
        final_score = avg_quality * bonus
    else:
        # This case handles when there are no validated pairs at all
        clique_size = 0
        avg_quality = 0
        bonus = 0
        final_score = 0

    print("\n" + "-"*50)
    print("                 VALIDATION SUMMARY")
    print("-"*50)
    print(f"Best Consensus Found: {clique_size}-way agreement")
    if best_clique_names:
        print(f"Models in Best Clique:")
        for name in sorted(best_clique_names):
            print(f"  - {name}")
    else:
        print("Models in Best Clique: None")

    print(f"\nAverage Pairwise Quality in Clique: {avg_quality:.4f}")
    print(f"Consensus Bonus Multiplier: x{bonus}")
    print(f"FINAL CONFIDENCE SCORE: {final_score:.4f}")
    print("-"*50)

### Cell 2: Test File Parsing

This cell finds all `.py` files in the directory for the configured `PROBLEM_INDEX` and runs the `parse_file` function on each one. It reports the total number of files found and how many were parsed successfully, listing their names. The result is stored in `all_parsed_files` for the next cell to use.

In [16]:
def run_parsing_test(base_dir_str: str, problem_index: int) -> list[ParsedFile]:
    """Finds and parses all source files for a given problem index."""
    print("\n" + "="*20 + " Test 1: File Parsing " + "="*20)
    problem_dir = Path(base_dir_str) / str(problem_index)
    if not problem_dir.is_dir():
        print(f"Error: Directory not found at {problem_dir}")
        return []
    
    all_source_files = list(problem_dir.glob("*.py"))
    print(f"Found {len(all_source_files)} files in '{problem_dir}'.")
    
    all_parsed_files = [pf for pf in [parse_file(p) for p in all_source_files] if pf]
    print(f"\nSuccessfully parsed {len(all_parsed_files)} files:")
    for pf in all_parsed_files:
        print(f"  - {pf.path.name}")
    return all_parsed_files

### Cell 3: Test UT-0 (Answer Match)

This cell takes the list of successfully parsed files from the previous step and runs the `ut0_answer_match` function. It filters the list, keeping only the files whose `solve()` function returns the correct `GOLD_ANSWER`. The result is stored in `survivors_ut0` for the next cell.

In [17]:
def run_ut0_test(parsed_files: list[ParsedFile], gold_answer: float) -> list[ParsedFile]:
    """Runs UT-0 (Answer Match) on a list of parsed files."""
    print("\n" + "="*20 + " Test 2: UT-0 (Answer Match) " + "="*20)
    if not parsed_files:
        print("No parsed files to test. Skipping.")
        return []
    
    print(f"--- Running UT-0 against Gold Answer: {gold_answer} ---")
    survivors = ut0_answer_match(parsed_files, gold_answer)
    print(f"\n{len(survivors)} files passed UT-0:")
    for pf in survivors:
        print(f"  - {pf.path.name}")
    return survivors

### Cell 4: Note on UT-1 (Trace Lists)

Our new flexible validation strategy does not use the `UT-1` (Trace List comparison) as a hard filter. The new philosophy prioritizes *functional equivalence* (proven by fuzzing) over the exact sequence of intermediate steps. Therefore, there is no dedicated test cell for UT-1.

### Cell 5: Test UT-2 (Argument Alignment)

This cell tests the core alignment logic (`find_best_alignment`) on the files that passed UT-0. It iterates through every possible pair, running the "bucket and match" strategy, and prints a detailed debug report showing which pairs were aligned and why.

In [18]:
def run_ut2_alignment_test(survivors_ut0: list[ParsedFile]):
    """Runs UT-2 (Argument Alignment) on all pairs of UT-0 survivors."""
    print("\n" + "="*20 + " Test 3: UT-2 (Argument Alignment) " + "="*20)
    if not survivors_ut0 or len(survivors_ut0) < 2:
        print("Fewer than 2 survivors from UT-0. Cannot perform alignment.")
        return

    print(f"--- Running alignment report on {len(survivors_ut0)} files ---")
    pp = pprint.PrettyPrinter(indent=2)
    
    for file_A, file_B in itertools.combinations(survivors_ut0, 2):
        print("\n" + "#"*60)
        print(f"### Aligning: {file_A.path.name} (A) vs. {file_B.path.name} (B)")
        print("#"*60)

        alignment_result = find_best_alignment(file_A, file_B)
        
        print("\n--- Summary of Alignment ---")
        print(f"Found {len(alignment_result.aligned_pairs)} aligned pairs:")
        pp.pprint(sorted([(p[0].name, p[1].name) for p in alignment_result.aligned_pairs]))
        
        print("\n--- Unaligned Arguments ---")
        print("Unaligned in A:", sorted([arg.name for arg in alignment_result.unaligned_A]))
        print("Unaligned in B:", sorted([arg.name for arg in alignment_result.unaligned_B]))

print("Setup complete. Test case functions are defined.")

Setup complete. Test case functions are defined.


In [23]:
PROBLEM_INDEX = 4483
GOLD_ANSWER = 100.0

# --- Execute Tests in Sequence ---
# 1. Parsing
parsed_files = run_parsing_test(BASE_DIR, PROBLEM_INDEX)

# 2. UT-0
ut0_survivors_4483 = run_ut0_test(parsed_files, GOLD_ANSWER)

# 3. UT-2
run_ut2_alignment_test(ut0_survivors)


Found 9 files in 'code_generation_outputs_cleaned/4483'.

Successfully parsed 9 files:
  - anthropic_claude-3-5-haiku-20241022.py
  - google_gemini-2.5-flash.py
  - google_gemini-2.5-flash-lite-preview-06-17.py
  - openai_o3-mini.py
  - openai_gpt-4.1.py
  - google_gemini-2.0-flash-thinking-exp.py
  - openai_o4-mini.py
  - openai_gpt-4.1-mini.py
  - google_gemini-2.5-pro.py

--- Running UT-0 against Gold Answer: 100.0 ---

9 files passed UT-0:
  - anthropic_claude-3-5-haiku-20241022.py
  - google_gemini-2.5-flash.py
  - google_gemini-2.5-flash-lite-preview-06-17.py
  - openai_o3-mini.py
  - openai_gpt-4.1.py
  - google_gemini-2.0-flash-thinking-exp.py
  - openai_o4-mini.py
  - openai_gpt-4.1-mini.py
  - google_gemini-2.5-pro.py

--- Running alignment report on 9 files ---

############################################################
### Aligning: anthropic_claude-3-5-haiku-20241022.py (A) vs. google_gemini-2.5-flash.py (B)
############################################################



In [24]:
PROBLEM_INDEX = 636
GOLD_ANSWER = 52.0

# --- Execute Tests in Sequence ---
# 1. Parsing
parsed_files_636 = run_parsing_test(BASE_DIR, PROBLEM_INDEX)

# 2. UT-0
ut0_survivors_636 = run_ut0_test(parsed_files_636, GOLD_ANSWER)

# 3. UT-2
run_ut2_alignment_test(ut0_survivors_636)


Found 8 files in 'code_generation_outputs_cleaned/636'.

Successfully parsed 8 files:
  - anthropic_claude-3-5-haiku-20241022.py
  - google_gemini-2.5-flash.py
  - google_gemini-2.5-flash-lite-preview-06-17.py
  - openai_o3-mini.py
  - openai_gpt-4.1.py
  - google_gemini-2.0-flash-thinking-exp.py
  - openai_gpt-4.1-mini.py
  - google_gemini-2.5-pro.py

--- Running UT-0 against Gold Answer: 52.0 ---

8 files passed UT-0:
  - anthropic_claude-3-5-haiku-20241022.py
  - google_gemini-2.5-flash.py
  - google_gemini-2.5-flash-lite-preview-06-17.py
  - openai_o3-mini.py
  - openai_gpt-4.1.py
  - google_gemini-2.0-flash-thinking-exp.py
  - openai_gpt-4.1-mini.py
  - google_gemini-2.5-pro.py

--- Running alignment report on 8 files ---

############################################################
### Aligning: anthropic_claude-3-5-haiku-20241022.py (A) vs. google_gemini-2.5-flash.py (B)
############################################################

--- Summary of Alignment ---
Found 3 aligned p

### Cell: Test UT-3 (Fuzzing Equivalence)

This cell tests the `fuzz_aligned_pair` function. It defines two test cases:

1.  **A "Passing" Case:** Two functions (`equivalent_A`, `equivalent_B`) that are mathematically identical but use different variable names. This pair should pass the fuzz test.
2.  **A "Failing" Case:** Two functions (`different_A`, `different_B`) that produce the same result for their default values but have different underlying logic. This pair should fail the fuzz test.

This will confirm that the fuzzer can correctly distinguish between truly equivalent and merely coincidentally correct functions.

In [21]:
# Cell for testing UT-3

def run_ut3_fuzzing_test(file_A: ParsedFile, file_B: ParsedFile):
    """
    A dedicated function to test the fuzzing logic on a given pair of ParsedFile objects.
    """
    print("\n" + "#"*60)
    print(f"### Fuzz-Testing: {file_A.path.name} vs. {file_B.path.name}")
    print("#"*60)

    # Step 1: Align the pair first, as this is a prerequisite for fuzzing.
    alignment = find_best_alignment(file_A, file_B)
    
    if len(alignment.aligned_pairs) < _MIN_ALIGNMENT_FOR_FUZZ:
        print("Alignment failed or found too few common arguments. Skipping fuzz test.")
        return

    print(f"Alignment successful. Found {len(alignment.aligned_pairs)} pairs to fuzz.")
    
    # Step 2: Run the fuzz test.
    is_equivalent = fuzz_aligned_pair(alignment, file_A.func, file_B.func)
    
    # Step 3: Report the result.
    if is_equivalent:
        print("\n✅ RESULT: PASSED. The functions are functionally equivalent.")
    else:
        print("\n❌ RESULT: FAILED. The functions have different logic.")

# --- Test Case 1: Functionally EQUIVALENT Pair (Should PASS) ---
print("="*20 + " Test Case 1: Equivalent Functions " + "="*20)

# Define two functions that are identical in logic but have different arg names
def equivalent_A(price: int = 10, num_items: int = 5, tax: int = 2):
    return price * num_items + tax

def equivalent_B(unit_cost: int = 10, quantity: int = 5, flat_fee: int = 2):
    return unit_cost * quantity + flat_fee

# Create mock ParsedFile objects for them
mock_file_A = ParsedFile(
    path=Path("mock_equivalent_A.py"), module_code="", func=equivalent_A,
    args=[
        Argument("price", "int", 10, "item price"),
        Argument("num_items", "int", 5, "number of items"),
        Argument("tax", "int", 2, "sales tax")
    ]
)
mock_file_B = ParsedFile(
    path=Path("mock_equivalent_B.py"), module_code="", func=equivalent_B,
    args=[
        Argument("unit_cost", "int", 10, "cost per unit"),
        Argument("quantity", "int", 5, "amount of items"),
        Argument("flat_fee", "int", 2, "service fee")
    ]
)

# Run the test on the equivalent pair
run_ut3_fuzzing_test(mock_file_A, mock_file_B)


# --- Test Case 2: Functionally DIFFERENT Pair (Should FAIL) ---
print("\n" + "="*20 + " Test Case 2: Different Functions " + "="*20)

# These functions produce the same result (20) with defaults, but have different logic.
def different_A(price: int = 10, quantity: int = 2):
    # Logic is price * quantity
    return price * quantity

def different_B(price: int = 10, quantity: int = 2):
    # Logic is price + price + ... (quantity times)
    # A different way to write price * 2
    return sum([price for _ in range(quantity)])

# Create mock ParsedFile objects
mock_file_C = ParsedFile(
    path=Path("mock_different_A.py"), module_code="", func=different_A,
    args=[
        Argument("price", "int", 10, "item price"),
        Argument("quantity", "int", 2, "number of items")
    ]
)

# This is a subtle logical error. When we fuzz quantity, it will fail.
# For example, if quantity=3, different_A returns 30, different_B returns 30.
# Let's make it more obvious.
def different_C(price: int = 10, quantity: int = 2):
    # This is price * 2, it IGNORES the quantity when quantity > 2
    return price * 2

mock_file_D = ParsedFile(
    path=Path("mock_different_C.py"), module_code="", func=different_C,
    args=[
        Argument("price", "int", 10, "item price"),
        Argument("quantity", "int", 2, "number of items")
    ]
)

# Run the test on the different pair
run_ut3_fuzzing_test(mock_file_C, mock_file_D)


############################################################
### Fuzz-Testing: mock_equivalent_A.py vs. mock_equivalent_B.py
############################################################
Alignment successful. Found 1 pairs to fuzz.

✅ RESULT: PASSED. The functions are functionally equivalent.


############################################################
### Fuzz-Testing: mock_different_A.py vs. mock_different_C.py
############################################################
Alignment successful. Found 2 pairs to fuzz.

❌ RESULT: FAILED. The functions have different logic.


### Test 4: UT-3 (Fuzzing for Functional Equivalence)

This cell defines and runs the final validation test. The `run_ut3_fuzzing_test` function takes the list of UT-0 survivors, tests every possible pair for functional equivalence, and returns a list of the pairs that passed all checks.

In [25]:
# Cell for UT-3 Fuzzing Test

def run_ut3_fuzzing_test(survivors_ut0: list[ParsedFile]) -> list[tuple[ParsedFile, ParsedFile]]:
    """
    Takes UT-0 survivors, finds all aligned pairs, and runs UT-3 fuzzing on them.
    Returns a list of all pairs that passed the fuzzing test.
    """
    print("\n" + "="*20 + " Test 4: UT-3 (Fuzzing Equivalence) " + "="*20)
    if not survivors_ut0 or len(survivors_ut0) < 2:
        print("Fewer than 2 survivors from UT-0. Cannot perform fuzzing.")
        return []

    print(f"--- Running fuzzing on all possible pairs from {len(survivors_ut0)} files ---")
    
    validated_pairs = []

    for file_A, file_B in itertools.combinations(survivors_ut0, 2):
        print("\n" + "#"*60)
        print(f"### Testing: {file_A.path.name} (A) vs. {file_B.path.name} (B)")
        print("#"*60)

        # 1. Align the pair
        alignment = find_best_alignment(file_A, file_B)
        if len(alignment.aligned_pairs) < _MIN_ALIGNMENT_FOR_FUZZ:
            print("  - SKIPPED: Alignment found too few common arguments.")
            continue
        
        print(f"  - Alignment found {len(alignment.aligned_pairs)} pairs. Proceeding to fuzz.")

        # 2. Run the fuzz test
        is_equivalent = fuzz_aligned_pair(alignment, file_A.func, file_B.func)
        
        # 3. Report result and store successful pairs
        if is_equivalent:
            print("  - ✅ RESULT: PASSED. The functions are functionally equivalent.")
            validated_pairs.append((file_A, file_B))
        else:
            print("  - ❌ RESULT: FAILED. The functions have different logic.")
            
    print("\n" + "="*60)
    print(f"Fuzzing complete. Found {len(validated_pairs)} functionally equivalent pairs.")
    print("="*60)
    
    return validated_pairs

In [26]:
# --- Execute the Fuzzing Test ---
# This uses the `ut0_survivors` variable created by the previous test cells.
if 'ut0_survivors' in locals() and ut0_survivors_4483:
    # This will run the fuzz test on all pairs from the currently configured problem
    # It might take a minute or two depending on the number of files and pairs.
    fully_validated_pairs = run_ut3_fuzzing_test(ut0_survivors_4483)
else:
    print("Variable 'ut0_survivors' not found. Please run the preceding test cells for a problem first.")


--- Running fuzzing on all possible pairs from 9 files ---

############################################################
### Testing: anthropic_claude-3-5-haiku-20241022.py (A) vs. google_gemini-2.5-flash.py (B)
############################################################
  - Alignment found 4 pairs. Proceeding to fuzz.
  - ✅ RESULT: PASSED. The functions are functionally equivalent.

############################################################
### Testing: anthropic_claude-3-5-haiku-20241022.py (A) vs. google_gemini-2.5-flash-lite-preview-06-17.py (B)
############################################################
  - Alignment found 3 pairs. Proceeding to fuzz.
  - ✅ RESULT: PASSED. The functions are functionally equivalent.

############################################################
### Testing: anthropic_claude-3-5-haiku-20241022.py (A) vs. openai_o3-mini.py (B)
############################################################
  - Alignment found 3 pairs. Proceeding to fuzz.
  - ✅ RESUL

In [27]:
# --- Execute the Fuzzing Test ---
# This uses the `ut0_survivors` variable created by the previous test cells.
if 'ut0_survivors' in locals() and ut0_survivors_636:
    # This will run the fuzz test on all pairs from the currently configured problem
    # It might take a minute or two depending on the number of files and pairs.
    fully_validated_pairs = run_ut3_fuzzing_test(ut0_survivors_636)
else:
    print("Variable 'ut0_survivors' not found. Please run the preceding test cells for a problem first.")


--- Running fuzzing on all possible pairs from 8 files ---

############################################################
### Testing: anthropic_claude-3-5-haiku-20241022.py (A) vs. google_gemini-2.5-flash.py (B)
############################################################
  - Alignment found 3 pairs. Proceeding to fuzz.
  - ❌ RESULT: FAILED. The functions have different logic.

############################################################
### Testing: anthropic_claude-3-5-haiku-20241022.py (A) vs. google_gemini-2.5-flash-lite-preview-06-17.py (B)
############################################################
  - Alignment found 3 pairs. Proceeding to fuzz.
  - ✅ RESULT: PASSED. The functions are functionally equivalent.

############################################################
### Testing: anthropic_claude-3-5-haiku-20241022.py (A) vs. openai_o3-mini.py (B)
############################################################
  - Alignment found 4 pairs. Proceeding to fuzz.
  - ❌ RESULT: FAIL

### Final Test: Run End-to-End Analysis and Scoring

This cell executes the entire validation pipeline for a single problem by calling the main `analyze_problem_outputs` function. It performs all steps automatically:

1.  Parses all files for the given problem index.
2.  Filters them with the UT-0 answer check.
3.  Tests every possible pair for alignment (UT-2) and functional equivalence (UT-3).
4.  Calculates a `PairwiseQualityScore` for every successfully validated pair.
5.  Finds the best consensus clique among the models.
6.  Computes and prints the final `ConfidenceScore` in a summary report.

To test a new problem, you only need to change the `PROBLEM_INDEX` and `GOLD_ANSWER` variables in this cell and re-run it.

In [30]:
PROBLEM_INDEX = 4483
GOLD_ANSWER = 100.0

# --- Execute the Full Pipeline ---
problem_dir = BASE_DIR / str(PROBLEM_INDEX)
if not problem_dir.is_dir():
    print(f"Error: Directory not found at {problem_dir}")
else:
    # This single function call runs the entire validation and scoring process
    analyze_problem_outputs(problem_dir, GOLD_ANSWER)


Found and parsed 9 files.
9 files passed UT-0 (correct default answer).
  ✓ Validated Pair: (anthropic_claude-3-5-haiku-20241022.py, google_gemini-2.5-flash.py), Score: 0.817
  ✓ Validated Pair: (anthropic_claude-3-5-haiku-20241022.py, google_gemini-2.5-flash-lite-preview-06-17.py), Score: 0.678
  ✓ Validated Pair: (anthropic_claude-3-5-haiku-20241022.py, openai_o3-mini.py), Score: 0.656
  ✓ Validated Pair: (anthropic_claude-3-5-haiku-20241022.py, openai_gpt-4.1.py), Score: 0.970
  ✓ Validated Pair: (anthropic_claude-3-5-haiku-20241022.py, google_gemini-2.0-flash-thinking-exp.py), Score: 0.817
  ✓ Validated Pair: (anthropic_claude-3-5-haiku-20241022.py, openai_gpt-4.1-mini.py), Score: 0.391
  ✓ Validated Pair: (anthropic_claude-3-5-haiku-20241022.py, google_gemini-2.5-pro.py), Score: 0.958
  ✓ Validated Pair: (google_gemini-2.5-flash.py, google_gemini-2.5-flash-lite-preview-06-17.py), Score: 0.838
  ✓ Validated Pair: (google_gemini-2.5-flash.py, openai_o3-mini.py), Score: 0.609
  ✓ Va

In [31]:
PROBLEM_INDEX = 636
GOLD_ANSWER = 52.0

# --- Execute the Full Pipeline ---
problem_dir = BASE_DIR / str(PROBLEM_INDEX)
if not problem_dir.is_dir():
    print(f"Error: Directory not found at {problem_dir}")
else:
    # This single function call runs the entire validation and scoring process
    analyze_problem_outputs(problem_dir, GOLD_ANSWER)


Found and parsed 8 files.
8 files passed UT-0 (correct default answer).
  ✓ Validated Pair: (anthropic_claude-3-5-haiku-20241022.py, google_gemini-2.5-flash-lite-preview-06-17.py), Score: 0.784
  ✓ Validated Pair: (anthropic_claude-3-5-haiku-20241022.py, openai_gpt-4.1-mini.py), Score: 0.784
  ✓ Validated Pair: (google_gemini-2.5-flash.py, openai_o3-mini.py), Score: 0.726
  ✓ Validated Pair: (google_gemini-2.5-flash.py, openai_gpt-4.1.py), Score: 0.820
  ✓ Validated Pair: (google_gemini-2.5-flash.py, google_gemini-2.0-flash-thinking-exp.py), Score: 0.963
  ✓ Validated Pair: (google_gemini-2.5-flash.py, google_gemini-2.5-pro.py), Score: 0.743
  ✓ Validated Pair: (google_gemini-2.5-flash-lite-preview-06-17.py, openai_gpt-4.1-mini.py), Score: 0.953
  ✓ Validated Pair: (openai_o3-mini.py, openai_gpt-4.1.py), Score: 0.660
  ✓ Validated Pair: (openai_o3-mini.py, google_gemini-2.0-flash-thinking-exp.py), Score: 0.736
  ✓ Validated Pair: (openai_o3-mini.py, google_gemini-2.5-pro.py), Score: 0

In [33]:
from datasets import load_dataset

# Load the GSM8K dataset (train split)
gsm8k_train = load_dataset("gsm8k", "main", split="train")

def extract_gsm8k_answer(gsm8k_data, index):
    """
    Extracts the final numerical answer from a GSM8K sample.
    Args:
        gsm8k_data: The loaded GSM8K dataset (e.g., gsm8k_train).
        index: The integer index of the sample.
    Returns:
        The answer as a float if possible, else as a string.
    """
    answer_text = gsm8k_data[index]['answer']
    # The answer is after the last '####'
    if '####' in answer_text:
        answer = answer_text.split('####')[-1].strip()
    else:
        answer = answer_text.strip()
    # Try to convert to float or int
    try:
        return float(answer.replace(',', ''))
    except ValueError:
        return answer

In [37]:
ans = extract_gsm8k_answer(gsm8k_train, 4483)
print(ans)

100.0


In [38]:
output_indices = sorted([3331, 1647, 636, 399, 4670, 5918, 1531, 7364, 5464, 1205, 3518, 6732, 3779, 4483, 6237, 1202, 2345])

# --- Execute the Full Pipeline for all samples---

for PROBLEM_INDEX in output_indices:
    GOLD_ANSWER = extract_gsm8k_answer(gsm8k_train, PROBLEM_INDEX)
    print(f"\nProcessing Problem Index: {PROBLEM_INDEX}, Gold Answer: {GOLD_ANSWER}")

    problem_dir = BASE_DIR / str(PROBLEM_INDEX)
    if not problem_dir.is_dir():
        print(f"Error: Directory not found at {problem_dir}")
    else:
        # This single function call runs the entire validation and scoring process
        analyze_problem_outputs(problem_dir, GOLD_ANSWER)


Processing Problem Index: 399, Gold Answer: 30.0

Found and parsed 8 files.
6 files passed UT-0 (correct default answer).


[UT-0 Fail] google_gemini-2.5-flash-lite-preview-06-17.py raised ZeroDivisionError('float division by zero')
[UT-0 Fail] openai_gpt-4.1.py raised ZeroDivisionError('float division by zero')


  ✓ Validated Pair: (anthropic_claude-3-5-haiku-20241022.py, google_gemini-2.5-flash.py), Score: 0.721
  ✓ Validated Pair: (anthropic_claude-3-5-haiku-20241022.py, openai_o3-mini.py), Score: 0.866
  ✓ Validated Pair: (anthropic_claude-3-5-haiku-20241022.py, google_gemini-2.0-flash-thinking-exp.py), Score: 0.722
  ✓ Validated Pair: (anthropic_claude-3-5-haiku-20241022.py, openai_gpt-4.1-mini.py), Score: 0.868
  ✓ Validated Pair: (anthropic_claude-3-5-haiku-20241022.py, google_gemini-2.5-pro.py), Score: 0.730
  ✓ Validated Pair: (google_gemini-2.5-flash.py, google_gemini-2.0-flash-thinking-exp.py), Score: 0.682
  ✓ Validated Pair: (google_gemini-2.5-flash.py, openai_gpt-4.1-mini.py), Score: 0.659
  ✓ Validated Pair: (google_gemini-2.5-flash.py, google_gemini-2.5-pro.py), Score: 0.671
  ✓ Validated Pair: (openai_o3-mini.py, google_gemini-2.0-flash-thinking-exp.py), Score: 0.779
  ✓ Validated Pair: (openai_o3-mini.py, openai_gpt-4.1-mini.py), Score: 0.779
  ✓ Validated Pair: (openai_o3-min

[Parser Error] Skipping google_gemini-2.5-pro.py: InvalidInput('Cannot parse: 1:9: ```python')


Found and parsed 6 files.
5 files passed UT-0 (correct default answer).
  ✓ Validated Pair: (openai_o3-mini.py, openai_gpt-4.1.py), Score: 0.970
  ✓ Validated Pair: (openai_o3-mini.py, google_gemini-2.0-flash-thinking-exp.py), Score: 0.970
  ✓ Validated Pair: (openai_o3-mini.py, openai_gpt-4.1-mini.py), Score: 0.974
  ✓ Validated Pair: (openai_gpt-4.1.py, google_gemini-2.0-flash-thinking-exp.py), Score: 0.992
  ✓ Validated Pair: (openai_gpt-4.1.py, openai_gpt-4.1-mini.py), Score: 0.993
  ✓ Validated Pair: (google_gemini-2.0-flash-thinking-exp.py, openai_gpt-4.1-mini.py), Score: 0.986

--------------------------------------------------
                 VALIDATION SUMMARY
--------------------------------------------------
Best Consensus Found: 4-way agreement
Models in Best Clique:
  - google_gemini-2.0-flash-thinking-exp.py
  - openai_gpt-4.1-mini.py
  - openai_gpt-4.1.py
  - openai_o3-mini.py

Average Pairwise Quality in Clique: 0.9808
Consensus Bonus Multiplier: x1.2
FINAL CONFIDENCE 

[Parser Error] Skipping google_gemini-2.5-pro.py: InvalidInput('Cannot parse: 1:9: ```python')


Found and parsed 7 files.
6 files passed UT-0 (correct default answer).
  ✓ Validated Pair: (anthropic_claude-3-5-haiku-20241022.py, google_gemini-2.5-flash.py), Score: 0.748
  ✓ Validated Pair: (anthropic_claude-3-5-haiku-20241022.py, openai_o3-mini.py), Score: 0.793
  ✓ Validated Pair: (anthropic_claude-3-5-haiku-20241022.py, google_gemini-2.0-flash-thinking-exp.py), Score: 0.571
  ✓ Validated Pair: (google_gemini-2.5-flash.py, openai_o3-mini.py), Score: 0.658
  ✓ Validated Pair: (google_gemini-2.5-flash.py, openai_gpt-4.1.py), Score: 0.837
  ✓ Validated Pair: (google_gemini-2.5-flash.py, google_gemini-2.0-flash-thinking-exp.py), Score: 0.967
  ✓ Validated Pair: (google_gemini-2.5-flash.py, openai_gpt-4.1-mini.py), Score: 0.831
  ✓ Validated Pair: (openai_o3-mini.py, openai_gpt-4.1.py), Score: 0.756
  ✓ Validated Pair: (openai_o3-mini.py, google_gemini-2.0-flash-thinking-exp.py), Score: 0.641
  ✓ Validated Pair: (openai_o3-mini.py, openai_gpt-4.1-mini.py), Score: 0.747
  ✓ Validated 