## Welcome to the Second Lab - Week 1, Day 3

Today we will work with lots of models! This is a way to get comfortable with APIs.

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/stop.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Important point - please read</h2>
            <span style="color:#ff7800;">The way I collaborate with you may be different to other courses you've taken. I prefer not to type code while you watch. Rather, I execute Jupyter Labs, like this, and give you an intuition for what's going on. My suggestion is that you carefully execute this yourself, <b>after</b> watching the lecture. Add print statements to understand what's going on, and then come up with your own variations.<br/><br/>If you have time, I'd love it if you submit a PR for changes in the community_contributions folder - instructions in the resources. Also, if you have a Github account, use this to showcase your variations. Not only is this essential practice, but it demonstrates your skills to others, including perhaps future clients or employers...
            </span>
        </td>
    </tr>
</table>

In [1]:
# Start with imports - ask ChatGPT to explain any package that you don't know

import os
import json
from dotenv import load_dotenv
from openai import OpenAI
from anthropic import Anthropic
from IPython.display import Markdown, display

In [2]:
# Always remember to do this!
load_dotenv(override=True)

True

In [3]:
# Print the key prefixes to help with any debugging

openai_api_key = os.getenv('OPENAI_API_KEY')
anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
google_api_key = os.getenv('GOOGLE_API_KEY')
deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')
groq_api_key = os.getenv('GROQ_API_KEY')

if openai_api_key:
    print(f"OpenAI API Key exists and begins {openai_api_key[:8]}")
else:
    print("OpenAI API Key not set")
    
if anthropic_api_key:
    print(f"Anthropic API Key exists and begins {anthropic_api_key[:7]}")
else:
    print("Anthropic API Key not set (and this is optional)")

if google_api_key:
    print(f"Google API Key exists and begins {google_api_key[:2]}")
else:
    print("Google API Key not set (and this is optional)")

if deepseek_api_key:
    print(f"DeepSeek API Key exists and begins {deepseek_api_key[:3]}")
else:
    print("DeepSeek API Key not set (and this is optional)")

if groq_api_key:
    print(f"Groq API Key exists and begins {groq_api_key[:4]}")
else:
    print("Groq API Key not set (and this is optional)")

OpenAI API Key exists and begins sk-proj-
Anthropic API Key exists and begins sk-ant-
Google API Key exists and begins AI
DeepSeek API Key exists and begins sk-
Groq API Key exists and begins gsk_


In [4]:
request = "Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. "
request += "Answer only with the question, no explanation."
messages = [{"role": "user", "content": request}]

In [5]:
messages

[{'role': 'user',
  'content': 'Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. Answer only with the question, no explanation.'}]

In [7]:
openai = OpenAI()
response = openai.chat.completions.create(
    model="gpt-5-mini",
    messages=messages,
)
question = response.choices[0].message.content
display(Markdown(question))


Design a comprehensive, adversarial‑resistant test-suite that reliably distinguishes between black‑box LLMs that rely primarily on pattern matching and memorization and LLMs that demonstrate genuine compositional reasoning and robust generalization; for your test‑suite, provide (1) at least six concrete task types with example inputs and expected outputs that probe different aspects of compositional reasoning (e.g., systematic generalization, causal reasoning, abstraction, counterfactuals, planning, variable‑binding), (2) a clear scoring rubric for each task, (3) protocols to prevent or detect memorized answers and prompt‑engineering gaming (including generation of unseen variants and statistical controls), (4) proposed sample sizes and statistical tests to assert with high confidence a difference in capability, (5) possible adversarial behaviors the LLM might use to appear capable and how your design addresses them, and (6) an experimental validation plan including metrics for false positives/negatives and how to iterate on the suite based on results?

In [8]:
competitors = []
answers = []
messages = [{"role": "user", "content": question}]

## Note - update since the videos

I've updated the model names to use the latest models below, like GPT 5 and Claude Sonnet 4.5. It's worth noting that these models can be quite slow - like 1-2 minutes - but they do a great job! Feel free to switch them for faster models if you'd prefer, like the ones I use in the video.

In [None]:
# The API we know well
# I've updated this with the latest model, but it can take some time because it likes to think!
# Replace the model with gpt-5-mini if you'd prefer not to wait 1-2 mins

model_name = "gpt-5-mini"

response = openai.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

Below is a complete, practical test‑suite design for reliably distinguishing LLMs that primarily pattern‑match / memorize from LLMs that show genuine compositional reasoning and robust generalization. It is written to be implementable against black‑box LLMs (only I/O access). The suite mixes synthetic, randomized and structured tasks, uses statistical controls and adversarial checks, and includes validation and iteration plans.

Summary of components
- (1) Six task types (with concrete examples and expected outputs) that probe different aspects of compositional reasoning.
- (2) A clear scoring rubric for each task.
- (3) Protocols to prevent/detect memorized answers and prompt‑engineering gaming (randomization, held‑out vocab, paraphrase invariance, adversarial decoys, statistical controls).
- (4) Sample sizes and statistical tests for confident capability differences.
- (5) Likely adversarial behaviors and mitigations.
- (6) Experimental validation plan with metrics and iteration strategy.

1) Task types — descriptions, concrete inputs and expected outputs
Each task type includes rationale (what aspect it probes), a canonical task template, 2 example items and their expected outputs. All tasks are designed to support automatic grading by canonicalization or deterministic execution where feasible.

Important: In production tests, instantiate each template with many randomized variants (random tokens, symbol renamings, unseen combinations). Use synthetic vocabulary (random strings) in many items (see protocols below) to avoid memorization.

Task A — Systematic generalization (composition of primitive instructions)
- What it probes: ability to apply known primitive operations in novel combinations, i.e., SCAN‑style compositionality / systematicity.
- Template: Define a small set of primitive actions and composition operators. Ask model to produce ground‑truth composed actions.
- Example primitives (in an item):
  "Primitive actions: WALK (W) = move forward 1; JUMP (J) = move forward 2; TURN_LEFT (L) = rotate left. Composition: X and then Y means do X then Y; twice(X) means do X two times; opposite(X) means do TURN_LEFT twice then X then TURN_LEFT twice."
  Input 1: "Instruction: opposite(twice(JUMP)) then WALK."
  Expected output 1: The sequence of primitive actions expanded: "TURN_LEFT, TURN_LEFT, JUMP, JUMP, TURN_LEFT, TURN_LEFT, WALK" (canonicalized as comma‑sep).
  Input 2 (novel composition): "Instruction: twice(opposite(WALK)) then JUMP."
  Expected output 2: "TURN_LEFT, TURN_LEFT, WALK, TURN_LEFT, TURN_LEFT, WALK, JUMP"
- Why hard for memorization: create many primitives and composition rules and withhold many combinations from training set; test on novel nesting depths and permutations.

Task B — Variable binding and long‑range reference (symbolic variables)
- What it probes: ability to bind variables to values, carry and reuse bindings across steps, disambiguate references like "the one referenced two steps earlier".
- Template: Provide assignments, transformations and queries referencing earlier variables (use randomized variable names).
- Example:
  Input 1:
    "Let a = 7, b = a + 5, c = 2*b. Now update a = c - 3. What is b + a?"
  Expected output 1: compute stepwise:
    b = a + 5 (using original a=7) => b=12
    c = 2*b => c=24
    a = c - 3 => a=21
    b + a = 12 + 21 = 33
    Output: "33"
  Input 2 (using randomly named vars, novel referencing):
    "Let X1 = 4, Y_2 = X1 * 3, Z = Y_2 - X1. Then set X1 = Z + 1. What is Y_2 - X1?"
  Expected output 2:
    Y_2 = 12, Z = 8, X1 becomes 9. Y_2 - X1 = 3. Output: "3"
- Why hard for memorization: use long sequences (10+ assignments) and random token names; require maintaining bindings across updates.

Task C — Causal reasoning and counterfactuals (graph / structural causal model)
- What it probes: causal inference and do‑interventions, not mere correlation or associative retrieval.
- Template: Give a small causal graph or set of structural equations. Ask consequence of interventions or counterfactual statements.
- Example:
  Input 1:
    "Variables: A -> B (B = A + noise), B -> C (C = 2*B). If we set A = 0 (do(A=0)), what happens to C? Provide final numeric relation."
  Expected output 1:
    After do(A=0): B = 0 (+noise mean 0) → C = 0. Output: "C becomes 0 (given deterministic functions, C = 2*B, so C = 0)."
  Input 2 (counterfactual):
    "Structural model: X = 3*U, Y = X + V. Observed U=2, V=1 (so observed X=6, Y=7). Counterfactual: if U had been 1 (but V unchanged), what would Y have been?"
  Expected output 2:
    New X = 3*1 = 3. Y = X + V = 3 + 1 = 4. Output: "4"
- Why hard for memorization: Use freshly generated graphs and numeric values, ask do‑intervention vs observation differences; require understanding intervention semantics.

Task D — Abstraction and analogical mapping (learn mapping rule, apply to novel exemplars)
- What it probes: ability to induce abstract relations from examples and apply them to new tokens not seen in training.
- Template: Provide a few mapping examples using a tiny invented "language", then ask to map new items applying the same relation.
- Example:
  Input 1:
    "Rule examples: hefo -> jopi, goro -> luma. Now apply the same transformation to zeta."
    (Transformation pattern: prepend 'j' and substitute vowels mapping e->o, o->u — but hidden to the model; only examples show behavior.)
  Expected output 1:
    If pattern mapping produces j + (vowel shift) then zeta -> juta (example result). But to make deterministic and unambiguous, use explicitly definable transformations in generation pipeline so expected output can be computed.
  Input 2 (using symbols):
    "Examples: blim -> ba-lim, srun -> sa-run. Apply to krup."
    Expected output 2: "ka-krup" (or whatever canonical transformation defined by the example set).
- Why hard for memorization: use arbitrary synthetic token spaces and withhold large portions of mapping space; require generalizing relational rule.

Task E — Planning and hierarchical problem solving (constrained optimization)
- What it probes: ability to plan multi-step actions under constraints, to generalize plan construction rules to larger/new configurations.
- Template: Provide a grid or a pickup/delivery planning problem with constraints (capacity, order) and ask for a (near‑)optimal step sequence or cost.
- Example:
  Input 1:
    "Agent at (0,0) on 3x3 grid. Boxes at (1,0) (A), (2,2) (B). Deliver A then B to goal (0,2). Agent can carry 1 box. Moves: Up/Down/Left/Right cost 1; Pickup/Drop cost 0. Plan minimal steps."
  Expected output 1:
    One optimal plan, canonicalized: "Right, Pickup, Up, Up, Drop, Right, Right, Pickup, Left, Left, Up, Drop" — or a canonical minimal step count and sequence. Expected minimal length numeric also acceptable: "Minimal steps = 10 (sequence: ...)"
  Input 2 (scale-up generalization):
    Same rules but 5x5 grid and three boxes with constraint that order must be A then third then B. Ask for plan.
- Why hard for memorization: use random placements, require true planning, and scale tests to larger grids / more boxes than in training.

Task F — Nested recursion / compositional evaluation (interpreted mini‑language)
- What it probes: ability to parse and evaluate expressions under user‑defined semantics, including nested composition and recursion, and to generalize to deeper nesting than seen in examples.
- Template: Define a tiny functional language (e.g., inc(x) = x+1, dbl(x) = 2*x, swap(a,b) returns pair reversed), then evaluate nested expressions.
- Example:
  Input 1:
    "Definitions: inc(x) = x+1; dbl(x) = 2*x; compose(f,g)(x) = f(g(x)). Evaluate compose(inc, dbl)(3)."
  Expected output 1:
    dbl(3)=6; inc(6)=7. Output: "7"
  Input 2 (deeper / novel nesting with random function names):
    Randomly name functions: f_z(x)=3*x, g_q(x)=x-2. Evaluate f_z(g_q(g_q(10))).
    Expected output 2:
    g_q(10)=8; g_q(8)=6; f_z(6)=18. Output: "18"
- Why hard for memorization: use randomized function names and nesting depths beyond training examples.

Cross‑task consistency probes (meta‑checks)
- After initial answer, rephrase problem, permute variable names or ask logically equivalent forms (alpha‑renaming). A reasoning model should give consistent answers; a memorizer or pattern matcher will often fail.

2) Scoring rubrics
For each task we use a combination of exact match (for deterministic outputs), graded structural equivalence, partial credit for intermediate step correctness, and consistency checks. All scoring should be automated where possible.

General guidelines
- Normalize outputs: strip punctuation, collapse whitespace, canonicalize commutative orders if problem allows, convert numbers to canonical numeric form.
- Where multiple correct sequences exist (planning), accept any plan that meets constraints and minimal (or near‑minimal) length. Validate by deterministic simulator.
- Require explanation optionally as corroboration — explanations are graded separately and used to detect shallow patterning (see later).

Detailed rubrics per task

Task A — Systematic generalization
- 0/1 exact correctness: full credit (1.0) if produced canonical expanded action sequence exactly matches ground truth.
- Partial credit 0.5 if sequence is correct up to reordering of independent commuting actions or contains only superficial tokenization differences.
- 0 if wrong action types or wrong ordering violating composition semantics.
- Bonus +0.2 if model also outputs a correct short reasoning trace (e.g., shows stepwise expansion).

Task B — Variable binding
- Full credit (1.0) if final numeric (or symbolic) answer matches.
- Partial credit 0.5 if intermediate steps are internally inconsistent but final answer reachable by plausible alternative interpretation; give 0 if incorrect.
- Additionally, ask for an explanation of each assignment; give +0.1 if explanation justifies final answer.

Task C — Causal reasoning / counterfactuals
- Full credit (1.0) for correct intervention answer plus correct reasoning (identifying difference between observation and do()).
- Partial credit 0.5 if numeric outcome correct but model incorrectly describes the causal/non‑causal distinction.
- 0 for wrong or inconsistent counterfactual.

Task D — Abstraction / analogy
- Full credit (1.0) for correct mapped output on held‑out tokens.
- Partial credit 0.5 if pattern partially matched (e.g., one of two transformation components correct).
- Zero if incorrect mapping.

Task E — Planning
- Full credit (1.0) if plan meets all constraints and is optimal (or within predetermined optimality gap, e.g., +0 steps for optimal or +1 allowance for near‑optimal).
- Partial credit 0.75 if plan valid but suboptimal within allowed slack; 0.4 if partially valid or violates minor constraints; 0 if plan invalid/unexecutable.
- Also score plan length and feasibility automatically using a simulator.

Task F — Nested evaluation
- Full credit (1.0) if evaluated result correct.
- Partial credit 0.5 if calculation partially correct or correct for a different but plausible semantics.
- 0 if wrong.

Consistency / adversarial checks (applies across tasks)
- Self‑consistency score: ask the same question twice with variable renaming and paraphrase. Award extra trust when answers are stable. If model flips answers > threshold (e.g., >10% of items), mark suspicious.

Aggregate scoring
- Report per‑task accuracy and an overall composite score weighted equally or by task importance.
- Also report error types: arithmetic errors, reference errors, plausibility but contradiction, inconsistent explanation, etc.

3) Protocols to prevent/detect memorized answers and prompt‑engineering gaming
Use a multi‑layered approach: (A) avoid giving any canonical public benchmark examples in test items; (B) randomization and synthetic languages; (C) paraphrase invariance; (D) decoys and traps; (E) statistical controls and cross‑validation; (F) output verification via external simulators.

A. Synthetic, randomized, and held‑out elements
- Random tokens and names: generate variable, function, and symbol names randomly (e.g., strings of 4–6 chars drawn from letters not composing common words). Example: X1 -> "qerf", function "f_x".
- Random numeric seeds: numbers, positions, and graphs sampled uniformly from ranges beyond typical training corpora.
- Use on‑the‑fly generated domain rules (e.g., transformation rules) so items are unlikely to appear in model training.
- For each template, reserve a held‑out set of compositions (combinations of primitives, deeper nestings) not revealed anywhere else. Test on these held‑outs.

B. Paraphrase invariance & renaming checks
- For each item, probe with several paraphrases and alpha‑renamings (rename all symbols consistently). A reasoning model gives consistent answers; a pattern matcher that memorized specific surface forms will often fail on renamed forms.
- Examples:
  Item 1: original problem
  Item 1a: same problem with variable renaming
  Item 1b: logically equivalent restatement
  If answers disagree, down‑weight confidence.

C. Held‑out composition splits (systematic generalization protocol)
- Use compositional splits analogous to SCAN: train/evaluate pairs are constructed so primitives are seen in training but some combinations are held out and used only for testing. Similarly for mapping rules and function compositions. This tests systematic generalization rather than memorization.

D. Trap items and decoys
- Embed "publicly memorizeable" variants that look like standard benchmarks but with small modifications. A pattern‑matcher that only recognizes the public form will output the public memorized answer rather than adapt; detect by comparing to valid answer for the modified item.
- Adversarial decoys: ask for outputs in both canonical and unusual formats to catch prompt‑tuned models that only work in one style.

E. Explain‑and‑verify protocol
- Request both an answer and a concise 1–3 step justification. Use an automated verifier to check whether the explanation logically entails the answer (e.g., compute intermediate values or check stated steps). Pattern matchers often hallucinate plausible justifications; the verifier will catch mismatches between claimed steps and actual output.

F. Multiple independent samplings + self‑consistency
- Query the model multiple times with temperature sampling to observe variability. Pattern matching models tuned to be deterministic may produce the same memorized output; reasoning models may produce either stable or explainable variations. Use statistics of answer distribution to infer brittleness.

G. Cross‑model and cross‑item statistical controls
- Randomly split items into multiple forms; estimate item difficulty using item‑response theory (IRT) to control for item variance when comparing models.

H. Avoid prompting that guides the model to memorize the test pattern
- Randomize prompt templates; do not leak examples from the test set in prompts. Use standardized prompt skeletons with random syntactic surfaces.

I. Time & leakage controls
- Run tests before and after major model updates; if possible, time‑stamp datasets and store seeds to detect leak. Rotate new item sets periodically.

4) Sample sizes and statistical tests
Goal: confidently assert a difference in capability between two black‑box LLMs (Model A and Model B).

Definitions and assumptions:
- Treat each model's response to an item as correct/incorrect (binary) for primary analysis. Use per‑task accuracy as primary metric; composite score as secondary.
- Tests are paired (same items evaluated by both models). Use paired tests (McNemar’s test or paired permutation) to leverage within‑item correlations.

Power & sample size rules of thumb
- For a two‑proportion comparison (unpaired) with modest effect sizes (difference = 10–20 percentage points), recommended sample per model per task ≈ 300–500 items to get power ≈ 0.8 at alpha = 0.05. But because tests are paired, required number of distinct items is lower.

Paired sample: approximate calculation
- If expecting Model A accuracy = 50% and Model B = 70% on a task (difference 20 ppt), and assuming moderate within‑item correlation, ~150–250 paired items suffices for 80% power (McNemar or paired permutation). If difference expected smaller (e.g., 10 ppt), need ~400–800 items.
- Conservative recommended sample sizes:
  - Per task: 400 items (distinct problems) sampled from randomized template variants.
  - Per composite suite (6 tasks): 6 * 400 = 2400 items total. You can distribute effort (e.g., 400 per high‑importance task, 200 per less critical).

Statistical tests and controls
- Primary test: paired permutation test on per‑item difference (nonparametric, robust) or McNemar’s test for binary paired data.
- Secondary tests:
  - Mixed‑effects logistic regression (item as random effect, model as fixed effect) to control for item difficulty and estimate model effect across heterogeneous items.
  - Bootstrapped confidence intervals (resample items) for per‑task accuracy difference.
  - Multiple comparisons correction: Benjamini–Hochberg when testing multiple tasks.
- Report effect sizes (difference in proportions, odds ratio), 95% CI, and p‑values.
- Power analyses: compute required sample size for expected minimal detectable effect size BEFORE running full test; pilot with small N to estimate item variance.

Decision thresholds
- Define thresholds for claiming superiority. Example: model A is better than B on task if:
  - Paired difference in accuracy > δ (e.g., δ = 0.10) AND
  - p < 0.01 after multiple comparisons correction AND
  - Effect robust to bootstrap and mixed‑effects regression controlling for item variance.

5) Possible adversarial behaviors and countermeasures
Below are adversarial strategies a model might use to appear capable, and corresponding mitigations in the suite.

Adversarial: Surface retrieval / memorized sequence regurgitation
- Behavior: Model stores fixed outputs for common instructions; returns memorized outputs for superficially similar prompts.
- Mitigation: synthetic tokens, held‑out compositions, trap items that mimic public benchmarks with slight changes; paraphrase and renaming consistency checks; require correct computation on non‑natural synthetic data.

Adversarial: Template matching / prompt engineering (exploiting fixed prompt formats)
- Behavior: Model trained to respond well to specific prompt templates; game tests by matching those templates exactly.
- Mitigation: randomize prompt phrasing and structure; test with out‑of‑template paraphrases; hide test instructions across several rewordings.

Adversarial: Post‑hoc plausible explanations (hallucinated chain‑of‑thought)
- Behavior: Model produces a plausible‑looking explanation that does not correspond to internal computation.
- Mitigation: automatic verification of explanation steps; require computation that can be executed in a simulator and check consistency between explanation and final answer. Ask for explicit intermediate numeric values that can be validated.

Adversarial: Strategic stochasticity (choose most socially pleasing answer)
- Behavior: Model outputs an answer that appears reasonable but isn't derived by correct reasoning.
- Mitigation: use items with unique numeric/structural answers and built simulators to check exactness. Evaluate per‑answer consistency across multiple samples.

Adversarial: Learning to memorize test suite by repeated exposure (test leakage)
- Behavior: Providers fine‑tune model on leaked items and then pass tests.
- Mitigation: maintain a large pool of test items, rotate, and ensure many items are generated on the fly. Keep some items only as one‑time use. Audit and track tests over time to detect sudden accuracy jumps.

Adversarial: Exploiting world knowledge
- Behavior: Model uses web knowledge (e.g., named entities) instead of reasoning.
- Mitigation: use synthetic domains and names that aren't in corpus; prefer abstract variables and gibberish tokens.

Adversarial: Exploiting consistent wrong heuristics (e.g., always answer "no" to avoid errors)
- Behavior: Model learns a default safe answer that is occasionally correct by chance.
- Mitigation: measure baselines and chance rates and compare; include control items where a naive heuristic fails frequently.

6) Experimental validation plan, metrics for false positives/negatives, and iteration
Validation phases
- Phase 0 — Unit tests: small sample (N=50 per task) to verify item generation, canonicalization, and grading logic.
- Phase 1 — Pilot: evaluate 3–5 diverse models (known baselines: simple pattern models, older LLM, and an advanced recent LLM) on a larger pilot (N=200 per task). Use pilot to estimate item difficulties and variance.
- Phase 2 — Full evaluation: run full suite (recommended N=400 per task) across models under test.
- Phase 3 — Adversarial robustness checks: invite models purposely trained to game tests; analyze failures and iterate.

Metrics
- Primary: per‑task accuracy (binary), composite accuracy.
- Secondary: explanation correctness rate; self‑consistency rate (fraction of paraphrases with same answer); variability under sampling (entropy of answers).
- Discrimination metrics: Area Under ROC curve if defining continuous scoring; effect sizes and odds ratios.
- Error analysis metrics:
  - False positives (FP): model flagged as reasoning‑capable but actually relying on memorization. Estimate by hand‑inspecting suspicious items and adversarially designed holdouts. Also measure cases where model passes synthetic tasks but fails renamed/scrubbed variants.
  - False negatives (FN): model truly capable but fails tests (e.g., due to prompt format, minor ambiguity). Detect by analyzing cases where model yields correct reasoning traces but output form differs; retrain canonicalizer.

Estimating FP/FN rates
- Use validation with ground‑truth labeled model types:
  - Create/collect models we know: (i) a trained memorization baseline (e.g., n‑gram or memorization‑augmented seq2seq), (ii) a symbolic reasoner (deterministic oracle), (iii) a hybrid model.
  - Run full suite and measure classification (declared as reasoning vs not) vs known ground truth. Compute sensitivity (TPR) and specificity (1−FPR). Tune decision thresholds to maintain acceptable FP (e.g., <5%) while maximizing sensitivity.

Iterative refinement based on results
- Analyze item‑level statistics: identify low discrimination items (very easy or very hard) using IRT or item discrimination index. Remove or rework items with poor discrimination.
- Increase adversarial item proportion if many models game the suite.
- Expand synthetic vocabulary and composition depth if memorization leakage is observed.
- Automate item generation pipelines to produce fresh held‑out sets periodically.
- Maintain an audit log of test runs and items to detect potential leakage and overfitting by model providers.

Example iteration cycle
1. Pilot run → compute per‑task accuracies, item difficulty.
2. Identify items where memorizing baselines scored as well as strong reasoning oracle → mark for redesign (add randomization, deeper nesting, or rename tokens).
3. Add paraphrase/renaming tests for items showing potential prompt‑template exploitation.
4. Rebalance item pool to maintain discrimination and rerun.

Practical considerations and recommended implementation checklist
- Implement automatic generators for each template with reproducible random seeds.
- Implement deterministic simulators/validators for each task (action sequence executor, arithmetic evaluator, planner checker) so grading is exact.
- For each item store (seed, tokens, canonical answer, distractors).
- Maintain multiple prompt forms and sample at random per query.
- Log all outputs, probabilistic sampling states (temperature), and timestamps.
- Use paired testing (same items across all models) to maximize power and fairness.
- Publicly release small sanitized subsets for community verification but keep main test pools private to reduce overfitting.

Concluding guidance
- No single test proves genuine reasoning forever; use this suite as a robust, multi‑task battery that measures “behavioral evidence” of compositional reasoning and generalization.
- Emphasize cross‑task consistency, randomness/held‑out composition, and rigorous statistical testing.
- Make verification automated where possible (simulators) and include human review for ambiguous failures.
- Iterate the pool and keep fresh synthetic items to reduce memorization/leakage risk.

If you’d like, I can:
- Produce a reference implementation (Python) that generates randomized items for each template and includes canonicalizers and simulators for automated scoring.
- Compute precise sample size calculations for specific expected effect sizes you care about and provide a ready‑to‑run experimental script.

In [10]:
# Anthropic has a slightly different API, and Max Tokens is required

model_name = "claude-sonnet-4-5"

claude = Anthropic()
response = claude.messages.create(model=model_name, messages=messages, max_tokens=1000)
answer = response.content[0].text

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

# Comprehensive Test Suite for Distinguishing Pattern Matching from Compositional Reasoning in LLMs

## Executive Summary

This test suite employs six task categories designed to probe genuine compositional reasoning while resisting memorization and gaming. Each task incorporates systematic variation, procedural generation, and adversarial controls to ensure that success requires compositional understanding rather than pattern matching.

---

## 1. Task Types with Examples

### Task 1: Systematic Compositional Generalization (Novel Recombinations)

**Objective**: Test ability to combine known primitives in novel configurations never seen during training.

**Design Principle**: Use a microworld with defined rules, then test combinations statistically unlikely to appear in training data.

**Example Input**:
```
Rules of the GLORP system:
- FLIM(x) reverses the order of elements in x
- DRAK(x, y) interleaves elements from x and y
- ZOOP(x) duplicates every second element in x

Training examples:
FLIM([A, B, C]) = [C, B, A]
DRAK([1, 2], [3, 4]) = [1, 3, 2, 4]
ZOOP([P, Q, R]) = [P, Q, Q, R]

Test question:
What is FLIM(DRAK(ZOOP([X, Y]), [M, N]))?
```

**Expected Output**: 
```
Step-by-step reasoning:
1. ZOOP([X, Y]) = [X, Y, Y]
2. DRAK([X, Y, Y], [M, N]) = [X, M, Y, N, Y]
3. FLIM([X, M, Y, N, Y]) = [Y, N, Y, M, X]

Answer: [Y, N, Y, M, X]
```

**Scoring Rubric**:
- 4 points: Correct answer with valid step-by-step reasoning
- 3 points: Correct answer with minor reasoning errors
- 2 points: Incorrect answer but demonstrates understanding of composition
- 1 point: Partially correct intermediate steps
- 0 points: Incorrect with no valid reasoning

**Variants for Anti-Memorization**:
- Generate 10,000 unique microworlds with different operation names, symbols, and rule sets
- Use procedural generation with random seeds
- Operation names drawn from pronounceable non-words (e.g., BLICKET, WUGGY, FEPS)
- Vary depth of composition (2-5 levels)
- Test both symbolic and numeric domains

---

### Task 2: Causal Reasoning Under Intervention (Counterfactual Inference)

**Objective**: Distinguish correlation from causation and reason about interventions.

**Design Principle**: Present causal graphs implicitly through scenarios, then test counterfactual reasoning that requires understanding causal structure.

**Example Input**:
```
Scenario: In the town of Millbrook, the following patterns have been observed over 10 years:

- When the reservoir level is high, the water treatment plant runs at full capacity
- When the treatment plant runs at full capacity, downtown water pressure is strong
- When downtown water pressure is strong, the fountain in Central Park operates
- The reservoir level depends only on rainfall
- Rainfall also directly affects whether street cleaning happens (rain = no cleaning)

Historical data shows that on days when the fountain operates, streets are usually dirty.

Question 1: The town installs a new pump that allows the fountain to operate regardless of downtown water pressure. After this intervention, will the streets be cleaner or dirtier on days when the fountain operates, compared to before?

Question 2: Explain your reasoning using the causal structure.
```

**Expected Output**:
```
Answer: The streets will be CLEANER on fountain-operating days after the intervention.

Reasoning: 
Before intervention: Fountain operating → High reservoir → Recent rainfall → No street cleaning → Dirty streets

The correlation between fountain operation and dirty streets was due to a common cause (rainfall), not because the fountain caused dirty streets.

After intervention: The fountain can operate without rainfall, breaking the spurious correlation. Fountain operation is now independent of rainfall, so there's no reason to expect systematically dirtier streets.
```

**Scoring

In [13]:
gemini = OpenAI(api_key=google_api_key, base_url="https://generativelanguage.googleapis.com/v1beta/openai/")
model_name = "gemini-3-flash-preview"

response = gemini.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

This test-suite, titled **CORE-Eval (Compositional & Operational Reasoning Evaluation)**, is designed to bypass the "stochastic parrot" effect by focusing on out-of-distribution (OOD) tasks that require the dynamic manipulation of novel variables and rules.

---

### 1. Concrete Task Types

#### Task A: Recursive Nested State Tracking (Variable Binding)
*   **Concept:** Track the state of objects through recursive "if-then" swaps and containment changes.
*   **Input:** "There are three boxes: Red, Blue, and Green. Red contains a 'Glint'. Blue contains a 'Spark'. Green is empty. Rule 1: If an object is moved to an empty box, its name reverses. Rule 2: If two boxes swap, their contents swap. Step 1: Swap Red and Blue. Step 2: Move contents of Blue to Green. Step 3: Rule 1 applies to Green. What is in Green?"
*   **Expected Output:** "tnilG" (The 'Glint' moved from Red to Blue in Step 1, then from Blue to Green in Step 2, triggering Rule 1).

#### Task B: Counterfactual Physics Reasoning (Causal Reasoning)
*   **Concept:** Apply logical deductions in a world where one fundamental law of physics is altered.
*   **Input:** "In this world, gravity acts as a repellent for liquids but a vacuum for solids. If I tip a glass of water upside down over a table, and there is a wooden block on that table, what happens to the water and the block?"
*   **Expected Output:** The water moves upward (away from the center of mass/floor) and the block is pulled toward the ceiling (vacuum effect).

#### Task C: The "Zylophon" Syntax (Systematic Generalization)
*   **Concept:** Learn a 3-rule pseudo-grammar and apply it to a 20-word sentence.
*   **Input:** "Grammar: (1) Nouns end in '-ox'. (2) Verbs precede nouns. (3) Adjectives follow the noun they modify and must be repeated twice. Translate: 'The fast cat chases a small mouse' into Zylophon."
*   **Expected Output:** "Chases catox fast fast mouseox small small." (Requires consistent rule application over lexical substitution).

#### Task D: Strategic Pathfinding with Dynamic Obstacles (Planning)
*   **Concept:** Solve a grid-based navigation task where the "cost" of movement changes based on the history of moves.
*   **Input:** "Grid 4x4. Start (0,0), Goal (3,3). Moving East costs 1. Moving South costs 2. However, every time you move South, the cost of the next East move doubles. Provide the sequence of moves for the lowest cost."
*   **Expected Output:** A specific path (e.g., E, E, E, S, S, S) with a calculated total cost.

#### Task E: Functional Abstraction (Abstraction)
*   **Concept:** Identify a latent function from input-output pairs and apply it to a complex, non-obvious case.
*   **Input:** "f(apple) = 1, f(banana) = 3, f(kiwi) = 2. Rule: f(x) = count of vowels in word. Apply f to the result of f('pomegranate') expressed as a word."
*   **Expected Output:** "f(5) -> f('five') -> 2."

#### Task F: Adversarial Syllogisms (Logic vs. Pattern Matching)
*   **Concept:** Logical structures where the conclusion is factually false in the real world but logically valid within the prompt.
*   **Input:** "All surgeons are cats. Some cats are astronauts. Therefore, are some surgeons astronauts? Explain the necessity of the conclusion."
*   **Expected Output:** No. (Standard logical fallacy: the 'undistributed middle'). Memorization-based models often default to "Yes" because the semantic "vibes" of surgeons and astronauts are high-competence.

---

### 2. Scoring Rubric
Each task is scored on a 0-3 scale:
*   **0 (Failure):** Incorrect final answer and flawed logic or hallucinations.
*   **1 (Partial):** Correct logic but arithmetic/minor execution error; or correct answer but "hollow" explanation.
*   **2 (Successful):** Correct final answer and clear, step-by-step reasoning (Chain of Thought).
*   **3 (Robust):** Correct answer even when the prompt includes "distractor" information (e.g., "Note: the boxes are made of wood," which is irrelevant).

---

### 3. Protocols to Prevent Memorization/Gaming
*   **The DSL Generator:** Use a Domain Specific Language (DSL) script to generate 1,000 unique variations of each task. This ensures no specific instance exists in the training data.
*   **Nonce-Token Substitution:** Replace all semantic nouns (e.g., "cat") with random strings (e.g., "blorp-7"). This strips the model of "semantic priming" where it guesses based on word associations rather than logic.
*   **Permutation Testing:** For every prompt, provide a version where the order of rules is changed. A reasoning model will adapt; a pattern-matcher will often fail if it memorized a specific sequence.

---

### 4. Sample Size and Statistical Tests
*   **Sample Size:** $N=200$ unique prompts per task type (Total 1,200).
*   **Metric:** Accuracy (binary) and Mean Score (0-3).
*   **Statistical Test:** 
    *   **McNemar’s Test:** To compare two models on the same set of binary outcomes (to see if the performance difference is statistically significant).
    *   **Standard Deviation of Error:** If a model’s performance drops significantly when nouns are replaced with nonce-tokens, it is flagged as relying on **pattern matching** (High Semantic Sensitivity).

---

### 5. Adversarial Behaviors & Design Counter-Measures
*   **Behavior: Hedging/Refusal.** The model says "This depends on the laws of physics" to avoid commitment.
    *   *Counter-measure:* Explicitly state "Assume this hypothetical world is the only reality for this calculation."
*   **Behavior: Chain-of-Thought Mimicry.** The model writes "Step 1... Step 2..." but the steps don't lead to the conclusion.
    *   *Counter-measure:* Use **Conclusion-to-Step Verification**. Ask the model for the answer first, then the steps, then provide a *different* answer and ask it to find the error.
*   **Behavior: Semantic Drift.** The model starts with the counterfactual but drifts back to real-world logic mid-sentence.
    *   *Counter-measure:* **Long-Chain Dependency.** Ensure the final answer requires the output of Step 1 to be used in Step 10.

---

### 6. Experimental Validation & Iteration
*   **Gold Standard:** Run the suite on a "Ground Truth" group of human logic experts.
*   **False Positive (FP) Metric:** If a known "shallow" model (e.g., a 1B parameter model) passes, the task is too simple. Increase the recursive depth.
*   **False Negative (FN) Metric:** If humans fail the task, it is a measure of complexity, not reasoning. Simplify until human accuracy is $>95\%$.
*   **Iteration Loop:** 
    1.  Deploy $N=50$. 
    2.  Check for "Shortcut Cues" (e.g., if the answer is always the last item mentioned). 
    3.  If a model identifies a shortcut, re-randomize the prompt generator to eliminate the cue. 
    4.  Scale to $N=1,200$.

In [14]:
deepseek = OpenAI(api_key=deepseek_api_key, base_url="https://api.deepseek.com/v1")
model_name = "deepseek-chat"

response = deepseek.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

# **Adversarial‑Resistant Test Suite for Compositional Reasoning vs. Pattern‑Matching in Black‑Box LLMs**

## **1. Core Philosophy**
The suite is designed to be **variants‑rich, systematically generated, and dynamically updated** to prevent memorization. It tests **out‑of‑distribution generalization** by requiring recombination of known primitives in novel ways, and uses **control tasks** to detect shallow heuristics.

---

## **2. Six Task Types with Examples**

### **Task Type 1: Systematic Generalization (Syntax‑Semantics Mismatch)**
**Goal:** Test if the model can apply known syntactic rules to novel lexical items or semantic domains.

**Example Input:**
```
Rule: "Every X that Ys is Z." → "All Xs that Y are Z."
Apply to: "Every blimp that froods is tink."
```
**Expected Output:** `"All blimps that frood are tink."`

**Scoring Rubric:**
- 2 pts: Correct transformation with morphological adjustment (frood→frood, not froods→frood).
- 1 pt: Correct transformation but minor morphological error.
- 0 pts: Failure to apply rule, or parroting input.

**Adversarial Defense:** Use **nonce words** (blimp, frood, tink) with no web occurrence; generate 100 variants per test with random nonce triples.

---

### **Task Type 2: Causal Reasoning with Novel Mechanisms**
**Goal:** Test understanding of causal chains, distinguishing causation from correlation, and handling novel causal verbs.

**Example Input:**
```
In a novel system: "Grobling causes flumps to shank, but shanking prevents grobling. If you grobl a flump, what happens after two cycles?"
```
**Expected Output:** `"After one cycle: flump shanks. After two cycles: shanking prevents grobling, so grobling stops, flump no longer shanks."`

**Scoring Rubric:**
- 2 pts: Correct step‑by‑step trace with state updates.
- 1 pt: Correct final state but missing intermediate reasoning.
- 0 pts: Wrong final state or contradictory reasoning.

**Adversarial Defense:** Randomize causal graphs (A→B, B inhibits A, etc.) with random nonce names; require step‑by‑step explanation.

---

### **Task Type 3: Abstraction & Analogical Mapping**
**Goal:** Test ability to extract relational schema and apply it to a new domain.

**Example Input:**
```
"A glorp is to a fleeb as a trank is to a ______."  
Context: "Glorps modify fleebs; tranks modify…"
```
**Expected Output:** `"spindle"` (given novel mapping: tranks modify spindles).

**Scoring Rubric:**
- 2 pts: Correct answer + correct relation stated.
- 1 pt: Correct answer only.
- 0 pts: Wrong answer.

**Adversarial Defense:** Use **randomly generated analogies** with arbitrary relations (modifies, contains, opposes, etc.) and novel object names.

---

### **Task Type 4: Counterfactual Reasoning with Novel Premises**
**Goal:** Test reasoning about what would be true if a known fact were different.

**Example Input:**
```
"In world W: All zingers are fribble. Moops are not zingers. Are moops fribble?  
Now consider: If zingers were not fribble, would moops be fribble?"
```
**Expected Output:** `"In actual world: unknown (moops may or may not be fribble). In counterfactual: moops are not fribble (because only zingers were fribble, now nothing is)."`

**Scoring Rubric:**
- 2 pts: Correct both actual and counterfactual with justification.
- 1 pt: Correct counterfactual only.
- 0 pts: Wrong.

**Adversarial Defense:** Randomize logical forms (∀x(Z(x)→F(x)), ¬∀x(M(x)→Z(x)), etc.) with nonce predicates.

---

### **Task Type 5: Hierarchical Planning with Novel Constraints**
**Goal:** Test ability to decompose a goal into sub‑goals under new constraints.

**Example Input:**
```
"Goal: Get a sprocket from the grove. Actions: you can ‘zazz’ (moves sprockets to the grove if they are in the vault), ‘vinn’ (moves you to the grove), ‘lork’ (locks the vault if you are there). The sprocket is in the vault, which is locked. You are outside. What sequence works?"
```
**Expected Output:** `"Vinn to vault, lork to unlock, zazz sprocket to grove, vinn to grove."`

**Scoring Rubric:**
- 2 pts: Correct minimal‑step sequence.
- 1 pt: Correct but redundant steps.
- 0 pts: Incorrect or impossible sequence.

**Adversarial Defense:** Generate random **STRIPS‑like planning problems** with novel action names and preconditions; vary initial/goal states.

---

### **Task Type 6: Variable Binding & Quantifier Scope**
**Goal:** Test handling of quantifier scope ambiguities with novel predicates.

**Example Input:**
```
"Every blip is in a cloop. A cloop is damp. Does it follow that every blip is in a damp cloop? Explain."
```
**Expected Output:** `"No. Every blip is in some cloop, but possibly different cloops; not all those cloops must be damp."`

**Scoring Rubric:**
- 2 pts: Correct answer with correct logical explanation.
- 1 pt: Correct answer with vague explanation.
- 0 pts: Wrong answer.

**Adversarial Defense:** Randomize quantifier order (∀∃ vs ∃∀), use nonce predicates, ask for explanation to avoid guessing.

---

## **3. Protocols Against Memorization & Gaming**

### **Dynamic Variant Generation**
- Each task instance is generated from a **seed‑based procedural algorithm** with nonce words/relations.
- Maintain a **held‑out variant pool** not used during public benchmark releases.
- For each task type, generate **10,000 unique variants**; use random subsets per test session.

### **Statistical Controls**
- Include **“pattern‑matching catch trials”**: tasks that look superficially similar but require different reasoning; memorizing‑based models will fail.
- Example: A sentence with same surface form as a known training example but with reversed logic.

### **Answer Consistency Checks**
- For each task, ask **the same underlying question in two different surface forms** (paraphrase, different nonce words). A reasoning model should give consistent answers; a memorizing model may not.
- Use **temporal probing**: ask follow‑up questions that require maintaining variable bindings across turns.

### **Explanations Required**
- Force chain‑of‑thought (CoT) explanations. Use **explanation‑consistency scoring**: if the final answer contradicts the explanation, penalize heavily.

### **Adversarial Training Data Detection**
- Compare responses to **web‑search snippets** for nonce words; if any match appears, flag potential data contamination.
- Deploy **canary tokens** in public descriptions of the suite; if models output these tokens, they have seen the test suite in training.

---

## **4. Sample Sizes & Statistical Tests**

### **Per‑Task Design**
- **Minimum sample**: 200 instances per task type (1200 total), balanced across sub‑types.
- Each instance is scored 0‑2, yielding a **mean score per task**.

### **Statistical Tests**
- **Primary metric**: Mean composite score (sum over tasks) compared between two models.
- Use **Mann‑Whitney U test** (non‑parametric, doesn’t assume normality) for score distributions.
- **Confidence intervals** for per‑task score differences via bootstrap (1000 resamples).
- **Effect size**: Cliff’s delta for robust interpretation.

### **Power Analysis**
- To detect a difference of **0.3 points** (out of 12 total) with 80% power and α=0.05, need ~100 instances per task for two‑group comparison.
- Our 200 per task provides power to detect smaller effects.

---

## **5. Possible Adversarial Behaviors & Mitigations**

| **Adversarial Behavior** | **Mitigation in Our Design** |
|--------------------------|-------------------------------|
| **Prompt‑engineering to guess pattern** | Require explanation; randomize surface forms; use nonce words. |
| **Fine‑tuning on similar benchmarks** | Use dynamic generation; deploy canary tokens; test on held‑out variant pool. |
| **Retrieval‑augmented generation (RAG)** | Nonce words ensure no matching documents; novel compositions unseen. |
| **Over‑producing multiple answers** | Evaluate only first coherent answer; consistency checks across paraphrases. |
| **Exploiting task‑specific heuristics** | Include “catch trials” that break heuristics; vary task parameters widely. |
| **Memorizing from explanations in training** | Generate novel reasoning chains; randomize order of steps in explanations. |

---

## **6. Experimental Validation Plan**

### **Phase 1: Baselines**
- Test known **pattern‑matching‑heavy models** (e.g., smaller LLMs, n‑gram baselines) vs. **alleged reasoning models** (e.g., GPT‑4, Claude 3).
- Expect clear separation in scores.

### **Phase 2: Contamination Check**
- For each model, test on **canary instances**; if performance is perfect on those but poor on random variants, flag contamination.
- Compare performance on **public variant set** vs. **held‑out variant set**; a contaminated model will show a large drop.

### **Phase 3: Iteration**
- **False Positive Rate (FPR)**: Rate at which pattern‑matchers score above threshold. Estimate by testing on pure n‑gram/retrieval baselines.
- **False Negative Rate (FNR)**: Rate at which true reasoners score below threshold. Estimate by testing humans (experts) on a subset.
- **Iteration**: If FPR > 5% or FNR > 10%, analyze failure cases, add new task variants or new “catch” subtasks, and adjust scoring rubric.

### **Metrics for Suite Quality**
- **Discrimination index**: (Mean score of top known reasoners – mean score of pattern‑matchers) / pooled SD. Target > 2.0.
- **Internal consistency**: Cronbach’s alpha across task instances; target > 0.8.
- **Test‑retest reliability**: Same model tested on different variant sets; correlation target > 0.9.

### **Long‑Term Upkeep**
- **Quarterly updates** of nonce words and relation sets.
- **Community contribution** of new task templates via controlled repository.
- **Continuous adversarial red‑teaming** to find gaming strategies and patch them.

---

## **Conclusion**
This test suite is **generative, scalable, and adversarial‑resistant** by design. It focuses on **systematic generalization** with novel primitives, requires **explanatory consistency**, and uses **statistical controls** to separate memorization from reasoning. With proper deployment and iteration, it can serve as a robust benchmark for genuine compositional reasoning in black‑box LLMs.

In [15]:
# Updated with the latest Open Source model from OpenAI

groq = OpenAI(api_key=groq_api_key, base_url="https://api.groq.com/openai/v1")
model_name = "openai/gpt-oss-120b"

response = groq.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)


Below is a **complete, adversarial‑resistant test‑suite** that can be used to separate **“pattern‑matching / memorisation‑heavy”** language models from those that truly **compose, reason and generalise**.  
The design follows best‑practice guidelines from the psychometrics of AI evaluation (e.g., Item‑Response Theory, bootstrap hypothesis testing) and includes concrete items, scoring rules, anti‑gaming protocols, statistical power calculations, threat‑modelling, and a validation plan.

---

## 1. Six Concrete Task Types (with examples)

| # | Task Type | What it probes | Core compositional demand | Example Prompt (input) | Expected Answer (output) | Why a memorisation‑only model should fail |
|---|-----------|----------------|---------------------------|------------------------|--------------------------|------------------------------------------|
| 1 | **Systematic Generalisation (SCG)** – novel slot‑filling on a known grammar | Ability to apply a learned rule to unseen lexical items | “<subject> <verb‑past> the <object>” → “<object> was <verb‑past‑by‑subject>” | *Prompt:* “Translate to passive voice: *The *florists* *arranged* the *vases*.” | “The vases were arranged by the florists.” | The model must recombine known syntactic transformations with *new* nouns and verbs it has never seen together. |
| 2 | **Causal‑Chain Reasoning (CCR)** – multi‑step cause → effect inference | Understanding and chaining causal relations | *Prompt:* “If the thermostat is set to 22 °C, the heater turns on. If the heater turns on, the room warms up. What will happen if the thermostat is set to 22 °C?” | “The heater will turn on, and the room will warm up.” | Requires chaining two rules; a pure pattern‑matcher would need the exact whole‑sentence pattern in its training data, which is unlikely. |
| 3 | **Abstract Symbol Manipulation (ASM)** – variable binding & substitution in a tiny “programming” language | Binding variables, applying functions, preserving scope | *Prompt:* “In a language where `F(x)=x+2` and `G(y)=y*3`, compute `F(G(4))`.” | “22” | The model must treat `F` and `G` as *functions* and apply them compositionally, not retrieve a memorised answer for “F(G(4))”. |
| 4 | **Counterfactual Reasoning (CFR)** – “what‑if” world changes | Evaluating a scenario under a hypothetical change while keeping other facts constant | *Prompt:* “John is taller than Mary. If John were 5 cm shorter, would he still be taller than Mary? (Mary is 165 cm tall.)” | “Yes, because John would still be 166 cm tall (originally 171 cm).” | Requires keeping the original facts, applying the counterfactual transformation, and re‑evaluating the comparison. |
| 5 | **Planning & Constraint Satisfaction (PCS)** – generate a sequence that satisfies a set of constraints | Multi‑step planning, maintaining state, respecting constraints | *Prompt:* “Place three red, two blue and one green token on a line of six cells so that no two tokens of the same colour are adjacent.” | Any valid ordering, e.g., “R B R G R B”. | The answer must be **constructed**; a memorised list of valid strings is improbable because the colour‑counts are randomised per item. |
| 6 | **Relational Analogy with Variable Binding (RAVB)** – map relational structure from one domain to another | Abstract relational mapping, not surface similarity | *Prompt:* “In the story, the rabbit hides the carrot, and the fox steals the rabbit. Which of the following statements preserves the same relational pattern?  A) The cat eats the mouse, and the dog chases the cat.  B) The king crowns the queen, and the queen advises the king.  C) The painter paints a portrait, and the portrait hangs on the wall.” | “C) The painter paints a portrait, and the portrait hangs on the wall.” | Requires recognizing the *agent‑patient‑action* chain (A → B) and mapping it, not just picking the answer that shares surface words. |

**Key design properties**

* **Randomised lexical content** (nouns, verbs, numbers, colours) per item → eliminates exact‑match memorisation.  
* **Controlled grammar** – the underlying rule stays the same while surface tokens change.  
* **Multiple correct solutions** (e.g., PCS) → prevents a model from outputting a single memorised string.  

---

## 2. Scoring Rubric per Task

| Task | Scoring granularity | Full credit (1.0) | Partial credit (0.5) | Zero (0) | Comments |
|------|--------------------|-------------------|----------------------|----------|----------|
| SCG | Exact string match after normalisation (case, punctuation) | Exact passive‑voice transformation | Minor grammatical slip (e.g., “by the florists” missing) | Wrong voice / wrong nouns | Normalise via regex. |
| CCR | Proposition‑level correctness (two propositions) | Both cause‑effect statements correct | One of two statements correct | Neither correct or contradictory | Use automated entailment check (e.g., NLI model) + regex fallback. |
| ASM | Numeric answer | Exact integer | Off‑by‑1 (e.g., due to rounding) | Wrong number / non‑numeric | Evaluate with `int()` conversion. |
| CFR | Boolean + justification | Correct truth value **and** correct numeric comparison | Correct truth value but missing justification | Wrong truth value | Automatic parsing of “Yes/No” and numeric reasoning. |
| PCS | Constraint satisfaction check | All constraints satisfied | Exactly one adjacency violation | More than one violation or length mismatch | Verify with a simple script that checks adjacency & colour counts. |
| RAVB | Correct choice letter | Exact correct option | Selecting an answer that shares one relational component (e.g., same agent) | Incorrect relational mapping | Use string‑matching of relational pattern; optional human audit. |

**Overall Score** – sum of task scores (max = 6). For statistical analysis we treat each task as an **item** with a binary/partial score and compute an *Item‑Response* vector per model.

---

## 3. Protocols to Prevent / Detect Memorised Answers & Prompt‑Engineering Gaming  

| Threat | Counter‑measure (implementation) |
|--------|-----------------------------------|
| **Exact‑match memorisation** | *Dynamic item generation*: each evaluation run draws fresh random lexical items from a large curated pool (≥ 10 000 nouns, 5 000 verbs, etc.). The prompt is assembled on‑the‑fly; the exact string never appears in training data. |
| **Few‑shot prompt injection** | Disallow any demonstration examples in the test prompt. The test harness supplies *only* the task description and the input. If a model internally “hallucinates” a few‑shot context, it will be penalised (no credit for self‑generated examples). |
| **Self‑referential “I know the answer”** | Post‑hoc regex checks that the response does **not** contain the original prompt verbatim (or a near‑duplicate). Any such copy is flagged and the item is re‑issued. |
| **Exploiting temperature / stochasticity** | Run **three independent generations** per item (different seeds) and aggregate via majority vote. High variance across runs is recorded as a *stability* metric; models that rely on randomness to “guess” get lower stability scores. |
| **Prompt‑engineering to force chain‑of‑thought** | The evaluation harness **strips** any model‑generated “let me think step‑by‑step” text before scoring; only the final answer is considered. Optionally, a separate “chain‑of‑thought” sub‑task can be added to reward explicit reasoning, but it is **scored separately** from the core compositional test. |
| **Data‑leak via test‑set publication** | Keep the **item seed list** secret. Publish only the *generation code* (open‑source) but not the random seeds used for a particular evaluation round. After a round is finished, the seed list can be released for reproducibility, but not before the run. |
| **Model‑specific tokenisation tricks** | Normalise all outputs using Unicode NFKC, lower‑casing, and whitespace collapsing before scoring. This removes superficial token‑level tricks. |
| **Batch‑size / context‑window cheating** | Enforce a strict **single‑turn** interaction: the model receives only the current prompt, no history. The harness discards any hidden system‑prompt that the model might have inserted (e.g., via system‑prompt injection). |

**Statistical Controls**

* For each task generate **N = 200** items per evaluation round (see Section 4).  
* Randomly split the 200 items into **10 folds**; compute per‑fold scores to estimate variance and detect outliers (e.g., a fold where many items are accidentally repeated).  
* Use **bootstrapped 95 % confidence intervals** on the mean item score to verify that the observed performance is not due to chance.

---

## 4. Sample Sizes & Statistical Tests

### 4.1 Power analysis (binary/partial scores)

Assume we want to detect a **Δ = 0.15** absolute improvement in mean item score (e.g., 0.70 vs 0.55) with **α = 0.05** and **power = 0.90**.

* For a two‑sample **t‑test** on proportions (or on continuous scores 0‑1), the required per‑group sample size is roughly **n ≈ 140** items per model per task.  
* To be conservative and to allow per‑task breakdowns, we use **n = 200** items per task (≈ 1 200 total items across the six tasks).  

### 4.2 Hypothesis tests

| Comparison | Test | Rationale |
|------------|------|-----------|
| **Model A vs Model B** overall compositional ability | Two‑sample **Welch’s t‑test** on the aggregated item‑score vector (6 × 200 = 1 200 observations) | Handles unequal variances; works for continuous 0‑1 scores. |
| **Task‑specific advantage** (e.g., Model A better on PCS) | **McNemar’s test** on binary‑correct vs incorrect per‑item (paired because same items are given to both models) | Controls for item difficulty. |
| **Stability across runs** (low variance) | **Levene’s test** on per‑item variance across the three generation seeds | Detects heteroscedasticity. |
| **Effect of randomisation** (ensuring no memorisation) | **Permutation test**: shuffle model labels on the item‑score vector 10 000 times, compute empirical p‑value. | Non‑parametric safeguard. |

**Effect‑size reporting** – alongside p‑values, always report **Cohen’s d** (or Hedges g for unequal N) and **95 % CI**.

---

## 5. Possible Adversarial Behaviours & How the Suite Handles Them

| Adversarial behaviour | Why it could fool a naïve test | Mitigation in our suite |
|-----------------------|--------------------------------|--------------------------|
| **“Lookup‑table” memorisation** (store every possible SCG transformation) | If the lexical pool were tiny, the model could pre‑compute all combos. | Use a **large, randomly sampled lexical pool** (≥ 10 000 items) and **different random seeds** each run → combinatorial explosion makes full coverage impossible. |
| **Prompt injection of chain‑of‑thought** (model internally asks for examples) | Could produce step‑by‑step reasoning that looks correct without true compositional ability. | Score only the **final answer**; optional chain‑of‑thought credit is a separate metric. |
| **Self‑prompted few‑shot creation** (model fabricates a training example then uses it) | May artificially raise accuracy on tasks like PCS. | Run **three independent generations** and require **majority agreement**; fabricated examples rarely survive across seeds. |
| **Exploiting test‑set leaks** (model was fine‑tuned on a released seed list) | Could produce perfect scores. | Keep the **seed list secret** until after the evaluation round; publish only generation code. |
| **Using external tools (search, calculators)** via tool‑use APIs | Might compute ASM numerics perfectly. | For the pure‑LLM benchmark, **disable tool‑use**; run the model in a “no‑tools” sandbox. A separate “augmented‑LLM” track can be created where tool‑use is allowed, but the baseline must be tool‑free. |
| **Biasing temperature to output “I don’t know”** (to avoid penalty) | Could inflate partial‑credit by abstaining. | **Zero‑score** for any non‑answer (e.g., “I don’t know”, empty response). Also enforce a **minimum length** (≥ 1 token) to prevent empty output. |
| **Manipulating tokenisation to hide correct answer** (e.g., splitting “yes” into sub‑tokens) | Might bypass regex checks. | Normalise output using Unicode NFKC and **token‑agnostic string matching**; also run a **semantic classifier** (tiny NLI) to double‑check “yes/no” answers. |

---

## 6. Experimental Validation Plan  

### 6.1 Phases

| Phase | Goal | Procedure | Success criteria |
|-------|------|-----------|-------------------|
| **Pilot** | Verify that item generation, parsing and scoring pipelines work reliably. | Run a **small LLM (e.g., GPT‑2‑XL)** on 30 items per task. Manually audit 10 % of outputs for parsing errors. | < 5 % parsing failures; clear separation of correct/incorrect. |
| **Baseline Establishment** | Obtain performance distribution of *known pattern‑matching models*. | Evaluate 5 publicly available “large‑mem‑only” models (e.g., GPT‑2‑large, LLaMA‑7B, Falcon‑7B) on the full suite (200 × 6 items). | Mean score ≤ 0.55 (i.e., below chance for tasks requiring composition). |
| **Target Model Evaluation** | Test a candidate *compositional* model (e.g., a fine‑tuned T5‑XXL or a transformer with explicit modular architecture). | Same protocol as baseline, three independent runs per item. | Mean score ≥ 0.75 and statistically significantly higher than baselines (p < 0.01, d ≥ 0.8). |
| **Adversarial Stress Test** | Verify robustness to gaming strategies. | For each model, run an extra batch where the **temperature** is set to 0.9, and where we *prepend* a “few‑shot” instruction (e.g., “Answer step‑by‑step”). | Scores should not improve > 0.03 relative to the standard run; variance should increase, indicating instability. |
| **Generalisation Check** | Ensure the test is not over‑fitted to a particular set of lexical items. | Regenerate the entire suite with a **new random seed** (different nouns/verbs) and re‑run the top‑performing model. | Score drop ≤ 0.05, confirming true compositional ability. |

### 6.2 Metrics for False Positives / Negatives

| Metric | Definition | How it is measured |
|--------|------------|--------------------|
| **False Positive Rate (FPR)** – proportion of pattern‑matching models that achieve “high” compositional score (≥ 0.70). | Count models meeting threshold / total pattern‑matching models. | Desired FPR < 0.05. |
| **False Negative Rate (FNR)** – proportion of genuinely compositional models (e.g., models with known modular architectures) that fall below the threshold. | Count such models below threshold / total compositional models. | Desired FNR < 0.10. |
| **Stability Index** – average pairwise disagreement across the three seeds per item. | `1 - (agreement_rate)`. | Low instability (< 0.07) signals reliable reasoning rather than random guessing. |
| **Item Discrimination (I‑D)** – point‑biserial correlation between item score and overall model score. | Compute per‑item correlation; discard items with I‑D < 0.1. | Ensures each item differentiates models. |

### 6.3 Iteration Loop

1. **Collect results** → compute all metrics.  
2. **Identify weak items** (low discrimination, high variance, high memorisation suspicion).  
3. **Modify**: change lexical pool, increase constraint complexity, or add a small “distractor” clause.  
4. **Regenerate** the affected items and re‑run the pilot.  
5. **Repeat** until the suite shows **≥ 0.80** average discrimination and **≤ 0.05** FPR on a held‑out set of pattern‑matching models.

---

## 7. Summary Checklist (what to implement)

| Component | Implementation |
|-----------|----------------|
| **Item generator** (Python script) – random noun/verb pools, template filling, seed‑control. |
| **Scorer** – regex‑based normalisers, task‑specific validators, majority‑vote aggregator. |
| **Evaluation harness** – API wrapper that forces single‑turn interaction, three‑seed repeats, temperature logging. |
| **Statistical analysis notebook** – computes t‑tests, effect sizes, bootstrapped CIs, item discrimination. |
| **Adversarial‑behavior detector** – copy‑detection, output‑length filter, variance logger. |
| **Documentation** – protocol for secret seed handling, reproducibility instructions, reporting template (mean ± CI, d, FPR/FNR). |

With these six tasks, a rigorous scoring rubric, anti‑gaming safeguards, statistically powered sample sizes, and a clear validation pipeline, the suite can **reliably differentiate** between LLMs that merely stitch together memorised patterns and those that truly **compose, reason and generalise**.

In [16]:
# The API we know well
# I've updated this with the latest model, but it can take some time because it likes to think!
# Replace the model with gpt-4.1-mini if you'd prefer not to wait 1-2 mins

model_name = "gpt-4.1-mini"

response = openai.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

Certainly! Below is a comprehensive design of an adversarial-resistant test-suite to differentiate black-box large language models (LLMs) that rely primarily on pattern matching and memorization from those that demonstrate genuine compositional reasoning and robust generalization.

---

# 1. Task Types with Examples

Each task type probes a core aspect of compositional reasoning, using concrete example inputs and expected outputs.

---

### Task Type 1: **Systematic Generalization**  
**Goal:** Test ability to apply learned rules or compositions to novel combinations never seen during training.

**Example:**  
- **Prompt:**  
  "In a made-up language, the suffix '-lam' means plural and the prefix 'bri-' means 'red'.  
  What is the plural form of ‘bri-tak’?"  
- **Expected Answer:**  
  "bri-tak-lam"

**Explanation:** The model must systematically combine a novel prefix and suffix on a base word to form a correct compositional output.

---

### Task Type 2: **Causal Reasoning**  
**Goal:** Assess model's ability to infer cause-effect relationships that require multi-step reasoning beyond pattern matching.

**Example:**  
- **Prompt:**  
  "If the street is wet and it did not rain, what is a plausible cause?"  
- **Expected Answers (any one):**  
  "Someone sprayed water," OR "A pipe burst," OR "The sprinkler system was on."

---

### Task Type 3: **Abstraction and Variable Binding**  
**Goal:** Check if the model can manipulate variables and abstract concepts rather than fixed phrases.

**Example:**  
- **Prompt:**  
  "There are three boxes: A, B, and C. If box A is heavier than box B, and box B is heavier than box C, which box is lightest?"  
- **Expected Answer:**  
  "Box C"

---

### Task Type 4: **Counterfactual Reasoning**  
**Goal:** Test reasoning about alternate realities and consequences.

**Example:**  
- **Prompt:**  
  "If humans could fly naturally, how would cities be designed differently?"  
- **Expected Answer (any reasonable answer):**  
  "Cities would have multi-level buildings designed for landing and takeoff, with less need for roads."

---

### Task Type 5: **Planning and Multi-step Reasoning**  
**Goal:** Evaluate multi-step planning ability requiring chaining of actions.

**Example:**  
- **Prompt:**  
  "You want to bake a cake but forgot eggs. List the steps to bake the cake using a substitute."  
- **Expected Answer:**  
  A multi-step plan including identifying substitutes (like applesauce), mixing ingredients, baking, etc.

---

### Task Type 6: **Novel Compositional Logic Puzzles**  
**Goal:** Test the capability to combine logic and compositional understanding on novel puzzles.

**Example:**  
- **Prompt:**  
  "In a tribe, all who wear hats can speak a secret language. If Joe can speak the secret language, does he wear a hat?"  
- **Expected Answer:**  
  "Yes, Joe wears a hat."

---

# 2. Scoring Rubrics

| Task Type                  | Scoring Criteria                             | Points | Notes                                                  |
|----------------------------|---------------------------------------------|--------|--------------------------------------------------------|
| Systematic Generalization  | Exact compositional construction correctness | 0 or 1 | 1 if correct suffix/prefix placement; otherwise 0     |
| Causal Reasoning           | Plausibility and causal correctness          | 0-2    | 2 for fully valid cause; 1 for partial; 0 otherwise    |
| Abstraction/Variable Binding | Correct relational output                    | 0 or 1 | 1 if correct variable reference; 0 otherwise           |
| Counterfactual Reasoning   | Reasonableness & coherence of counterfactual | 0-2    | 2 for insightful, plausible; 1 for partial; 0 for nonsense |
| Planning                  | Completeness, logical order, and core elements | 0-3    | 3 full plan; 2 partial but logical; 1 minimal; 0 none  |
| Logic Puzzles              | Logical correctness                           | 0 or 1 | 1 if logically correct inference; else 0               |

- **Human raters** using rubric or automated semantic similarity with thresholds for partial credit.

---

# 3. Protocols Against Memorization & Gamings

**(a) Variant Generation:**  
- For each task, generate many variants by changing surface details while preserving reasoning demands, e.g., swap entity names, alter numbers, use paraphrases.  
- Use programmatic generation or crowd-sourcing for variants, generating >100 unique examples per task type unseen in training data.

**(b) Statistical Controls:**  
- Include “catch” memorized-items known from common benchmarks and obscured by paraphrase to detect rote memorization.  
- Randomly mix in distractor prompts asking unrelated questions to detect answer pattern repetition.

**(c) Cross-Prompt Paraphrasing:**  
- Use multiple prompt wordings per test instance to rule out prompt exercise exploitation.

**(d) Zero-shot and Few-shot Tests:**  
- Test variants with no or minimal in-context examples to check for reasoning independent of prompt engineering.

---

# 4. Sample Sizes and Statistical Tests

- **Sample size:**  
  - At least 100 unique test instances per task type to cover broad variant space.  
  - Multiple prompt formulations per instance (3+), total ~300 prompts per task type.

- **Statistical Test:**  
  - Conduct statistical comparisons between models using nonparametric tests (e.g., Mann-Whitney U) on scored outputs to detect capability difference.  
  - Compute Cohen’s d effect size to measure practical significance.

- **Confidence:**  
  - Power analysis to target ≥80% power detecting medium-large effect sizes (d=0.5) at α=0.05.

---

# 5. Addressing Adversarial Behaviors

| Adversarial Strategy                  | Mitigation Approach                               |
|-------------------------------------|--------------------------------------------------|
| Memorizing benchmark outputs         | Variant generation, paraphrasing, catch items    |
| Prompt-engineering to “game” test    | Multi-prompt design and zero-shot tests          |
| Producing generic plausible answers  | Scoring by detailed correctness, not plausibility alone |
| Copying large memorized paraphrases  | Novel, synthetic tasks with no prior examples    |
| Pattern matching on keywords only    | Require multi-step reasoning and combinatorial answers |
| Ignoring variable binding             | Task design enforcing explicit variable reference |

---

# 6. Experimental Validation Plan

**Step 1:** Baseline testing on known models with expected capability gradients (e.g., smaller pattern-memorization-based vs. advanced reasoning-based LLMs).

**Step 2:** Calculate:  
- **True positives (TP):** Correctly identified reasoning-capable models  
- **True negatives (TN):** Correctly identified pattern-matchers  
- **False positives (FP):** Pattern-matchers wrongly classified as reasoning-capable  
- **False negatives (FN):** Reasoning-capable models missed

**Metrics:**  
- Precision = TP / (TP + FP)  
- Recall = TP / (TP + FN)  
- F1 score = harmonic mean of precision and recall  
- ROC-AUC (if applicable to continuous scoring)

**Iteration:**  
- Analyze failure cases for FP and FN to identify unsound prompts or scoring ambiguities.  
- Refine and augment task variants or scoring rubric accordingly.  
- Re-test and measure performance improvements.

**Human-in-the-loop:**  
- Incorporate expert review of ambiguous or borderline answers.  
- Adjust rubric based on inter-rater consistency.

---

# Summary

The multi-dimensional test suite combines diverse, adversarially-generated tasks probing essential cognitive faculties in LLMs beyond memorization. Systematic statistical evaluation and iterative refinement ensure reliable distinction of genuine compositional reasoning capabilities from superficial pattern matching.

If you want, I can also provide sample code for generating prompt variants or scoring templates. Let me know!

## For the next cell, we will use Ollama

Ollama runs a local web service that gives an OpenAI compatible endpoint,  
and runs models locally using high performance C++ code.

If you don't have Ollama, install it here by visiting https://ollama.com then pressing Download and following the instructions.

After it's installed, you should be able to visit here: http://localhost:11434 and see the message "Ollama is running"

You might need to restart Cursor (and maybe reboot). Then open a Terminal (control+\`) and run `ollama serve`

Useful Ollama commands (run these in the terminal, or with an exclamation mark in this notebook):

`ollama pull <model_name>` downloads a model locally  
`ollama ls` lists all the models you've downloaded  
`ollama rm <model_name>` deletes the specified model from your downloads

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/stop.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Super important - ignore me at your peril!</h2>
            <span style="color:#ff7800;">The model called <b>llama3.3</b> is FAR too large for home computers - it's not intended for personal computing and will consume all your resources! Stick with the nicely sized <b>llama3.2</b> or <b>llama3.2:1b</b> and if you want larger, try llama3.1 or smaller variants of Qwen, Gemma, Phi or DeepSeek. See the <A href="https://ollama.com/models">the Ollama models page</a> for a full list of models and sizes.
            </span>
        </td>
    </tr>
</table>

In [17]:
!ollama pull llama3.2

[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠦ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest [K
pulling dde5aa3fc5ff: 100% ▕██████████████████▏ 2.0 GB                         [K
pulling 966de95ca8a6: 100% ▕██████████████████▏ 1.4 KB                         [K
pulling fcc5a6bec9da: 100% ▕██████████████████▏ 7.7 KB                         [K
pulling a70ff7e570d9: 100% ▕██████████████████▏ 6.0 KB                         [K
pulling 56bb8bd477a5: 100% ▕██████████████████▏   96 B                         [K
pulling 34bb5ab01051: 100% ▕██████████████████▏  561 B                         [K
verifying sha256 digest [K
writing manifest [K
success [K[?25h[?2026l


In [18]:
ollama = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
model_name = "llama3.2"

response = ollama.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

Designing a comprehensive test-suite to evaluate black-box LLMs requires careful consideration of various task types, scoring rubrics, and protocols to prevent memorized answers and prompt-engineering gaming. Here's a proposed design for such a test-suite:

**Task Types (6)**

1. **Classification of Analogies**: Given an analogy statement like "cats have whiskers," what category does [animal] belong to?
	* Example input: `[cat, whisker]`, `class: cat`
	* Expected output: `[animal], class: feline`
2. **Systematic Generalization**: Predict the result of a generalized version of a rule (e.g., "if A > B, then C = D").
	* Example input: "[mathematical expression]", "if A > 3 and C > 1"
	* Expected output:="[ mathematical expression substitution with A > C and new_value ]"
3. **Causal Reasoning**: Given a sequence of events, predict the next event (e.g., "what did John do after meeting Alice?").
	* Example input: `[John , met Alice]`
	* Expected output: `"he talked to Alice"`
4. **Abstraction**: Identify the underlying principle behind a concept or idea.
	* Example input: "[human emotion with synonyms]"
	* Expected output: "concept/emotion underlying it, e.g., [happiness]"

5. **Counterfactuals**: Predict an alternative outcome given a hypothetical scenario (e.g., what would have happened if John had not met Alice?).
	* Example input: `[ scenario ], variable John , predicted_action, consequence ]
	 * Expected answer : `[variable John prediction without consequences to that particular situation ]`
6. **Planning with Limited Information**: Plan a sequence of actions given incomplete or uncertain information (e.g., plan the best way for John to get from one city to another knowing only route lengths).
    Example input:   `[starting location], ending location, routes information and any other data related ]` 
Expected output: `"steps taken by starting location , going through various other relevant destinations, destination of final desired location]`

**Scoring Rubric**

For each task, a clear scoring rubric will be developed to assess the LLM's performance. The rubrics will consider factors such as:

* Accuracy
* Completeness
* Novelty (correctness of responses that don't correspond to expected solutions)
* Coherence

The scores for each answer will be calculated using statistical measures such as mean and standard deviation.

**Protocols to Prevent/ Detect Memorized Answers and Prompt-Engineering Gaming**

1.  **Exhaustive variation generation**: to cover unseen variants, multiple permutations of possible correct answers or solution paths 
2.  **Statistical analysis with variance control**: Implement a data analysis framework considering factors like frequency , consistency and patterns that may indicate cheating
3.  **Input normalization and randomization** : use randomized test examples for each task type so output comparisons between the testing scenarios are fair compared to when given same but different input, prompts.
4.  **Adversarial testing using carefully crafted inputs**: Develop a set of artificially created input variants that exploit the LLM's pattern-matching vulnerabilities while maintaining coherence and relevance.

**Sample Sizes and Statistical Tests**

To establish confidence in detecting a difference in capabilities between pattern-matching-like and genuinely compositional reasoning LLMs:

*   **Large sample sizes:** Evaluate at least 100-500 tests for each task type, ensuring sufficient statistical power to detect even small differences.
*   **Permutation tests or bootstrap samples:** Use techniques like permutation distributions or resampling to assess the probability of observing differences by chance.

**Adversarial Behaviors**

Potential adversarial behaviors include:

1.  **Pattern-filling**: The LLM may fill in patterns it has learned from training data, even when the task requires genuine reasoning.
2.  **Overfitting to specific prompts**: The AI model might perform well on a particular input but struggles with novel or similar prompts.
3.  **Information-seeking behavior**: The model could be designed to gather additional information that isn't present in the prompt.

**Designing Against these adversarial behaviors:**

1.   **Introducing 'distractor' subtasks:** Adding unrelated smaller questions between two question that could potentially confuse AI output by distracting patterns detection capabilities.
2.  **Adding 'unanswerable' responses**: To ensure the model will sometimes fail to answer due to lack of information or failure to recognize uncertainty, and if this fails to happen it would be indicative a pattern matching solution is on work.
3.     **Randomizing some answers as correct/ incorrect examples**:   This way you can detect AI trying to infer things its not supposed to.

**Testing Methodology Experimental Validation:**

A series of experiments including large test sets for each task type in this set, using a combination of human evaluators (evaluance the output) and automated testing protocols (checks if LLM outputs are correct or incorrect  ).

In [34]:
ollama = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
model_name = "gemma3:4b"

response = ollama.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

## Adversarial-Resistant Test Suite for Evaluating LLM Reasoning Capabilities

This design outlines a comprehensive test suite aimed at distinguishing between LLMs relying on pattern matching/memorization and those exhibiting genuine reasoning. The suite focuses on robustness through diverse tasks, strict controls, and statistical analysis.

**1. Task Types & Examples:**

Here are six task types designed to probe different aspects of reasoning:

| Task Type        | Description                               | Example Input                                                              | Expected Output (demonstrating reasoning)                                                                                                        |
|------------------|------------------------------------------|---------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| **1. Systematic Generalization (SGM)** | Extending patterns with quantitative rules. | "If a car travels at 60 mph for 2 hours, how far does it travel? If a train travels at 80 mph for 3.5 hours, how far does it travel?" | "60 mph * 2 hours = 120 miles. 80 mph * 3.5 hours = 280 miles." (Correct application of distance = speed * time) |
| **2. Causal Reasoning (CR)** | Predicting consequences of actions.      | "John spilled water on the table. What's likely to happen?"                | "The table will likely become wet. The floor may become wet if the water isn't cleaned up."                                                   |
| **3. Abstraction (AB)** |  Conceptual mapping and reformulation.   | “A robin is a bird. A sparrow is a bird. Which statement is most similar: ‘A robin is a songbird’ or ‘A sparrow is a songbird’?” | “‘A robin is a songbird’.” (Recognizing the abstraction of 'songbird' which applies to both robin and sparrow) |
| **4. Counterfactuals (CF)** | Considering alternative scenarios.        | “If the sky were green, what color would the grass be?”                      | “The grass would be its normal color (green).” (Requires understanding of natural laws and relationships – grass is green independently of sky color) |
| **5. Planning (PL)** |  Sequential reasoning with a goal.      | “I want to bake a cake. List the steps you would take.”                     | “1. Preheat the oven. 2. Gather ingredients (flour, sugar, eggs, etc.). 3. Mix the ingredients. 4. Pour batter into a pan. 5. Bake for [duration].”                     |
| **6. Variable Binding (VB)** | Handling variables and relationships. | “If I have 3 apples and give 1 away, how many do I have?” ; “If x = 5 and y = 2, what is x + y?” | “2 apples.” ; “7” (Correct use of arithmetic and/or symbolic manipulation)                                                                              |


**2. Scoring Rubric:**

Each task will use a three-point rubric:

* **3 Points:**  The response is entirely correct and demonstrates a clear understanding of the underlying reasoning.
* **1 Point:** The response demonstrates a partial understanding or contains a minor error, but the core logic is present.
* **0 Points:** The response is incorrect, nonsensical, or completely unrelated to the input.

**3. Controls & Prevention Protocols:**

* **Unseen Variant Generation:** Automatically generate variations of input prompts (synonyms, rephrasing, slightly altered contexts) to avoid memorized answers. Utilise paraphrasing models specifically trained for this purpose.
* **Statistical Controls:**
    * **Temperature Scaling:** Vary the LLM’s temperature setting (influencing randomness) across different runs. This adds a degree of noise and reduces reliance on memorized templates.
    * **Prompt Length Limits:** Strict limits on prompt length to discourage overly verbose, memorization-driven responses.
    * **Response Length Limits:** Similarly, set limits to discourage unnecessary elaboration.
* **“Un-promptable” Data Injection:** Introduce subtle, non-obvious constraints into the input – e.g., “The answer must be a prime number,” or “The answer cannot be a multiple of 5.”
* **Chain-of-Thought Debugging:** Integrate a Chain-of-Thought (CoT) model (e.g., a smaller, more reliable model) to automatically analyze the LLM’s reasoning chain and flag potential issues.
* **Self-Reflection:**  Prompt the LLM to critically assess its own response – “Is your answer logically sound?  Explain your reasoning in detail.” This can identify where memorization is masking incorrect understanding.



**4. Sample Sizes & Statistical Tests:**

* **Sample Size:**  Run each task at least 100 times for each LLM being evaluated.  Larger sample sizes (200-300) would improve statistical power.
* **Statistical Tests:**
    * **Chi-Square Test:** Compare the distribution of 3, 1, and 0 scores across the different LLMs.  A significant Chi-Square value indicates a difference in performance.
    * **Wilcoxon Signed-Rank Test:**  If the distributions are not normally distributed, use the Wilcoxon test for comparing paired samples (e.g., comparing the average score of LLM A to the average score of LLM B on the same task).
    * **Cohen's d:**  Provides an effect size measure of the difference between means.

**5. Adversarial Behaviors & Mitigation:**

| Adversarial Behavior           | Mitigation Strategy                                              |
|-------------------------------|------------------------------------------------------------------|
| **Rule-Following Template Replication** | Frequent variant generation, statistical temperature scaling. |
| **Obfuscated Reasoning**        | Chain-of-Thought debugging, “Self-Reflection” prompts.         |
| **Overly Detailed, but Incorrect Responses** | Response length limits, constrained input, outlier detection. |
| **Contextual Framing as Knowledge** | Unseen variant generation, tests specifically designed to expose this. |



**6. Experimental Validation Plan:**

* **Phase 1: Baseline Evaluation:** Conduct preliminary runs to identify prompt engineering vulnerabilities and refine the test suite.
* **Phase 2: Focused Testing:** Execute the full test suite across multiple LLMs (e.g., GPT-4, Claude, open-source models).
* **Phase 3: Iteration & Refinement:**  Analyze results from Phase 2 to identify where the suite needs improvement. Specifically, add tasks targeted at the weaknesses identified. Generate new unseen variants.
* **Metrics for False Positives/Negatives:** Track the percentage of times an LLM receives a 3 (correct) or 0 (incorrect) score. Analyze which task types lead to the highest error rates.
* **Continuous Monitoring:** Monitor the LLM’s responses over time to detect shifts in performance and potential gaming attempts.



**Key Considerations & Future Work:**

* **Ground Truth Data:** High-quality, expertly annotated ground truth data is crucial for accurate scoring.
* **Model Architectures:**  The suite’s effectiveness will be heavily influenced by the underlying LLM architectures being evaluated.
* **Dynamic Testing:**  Implement a system to automatically adapt the test suite based on LLM performance – focusing more on areas where the LLM is struggling.

This detailed design provides a robust framework for evaluating LLM reasoning capabilities. By combining diverse tasks, stringent controls, and rigorous statistical analysis, we can gain valuable insights into the true level of understanding exhibited by these powerful models.  The iterative validation plan ensures the test suite remains relevant and effective as LLMs continue to evolve.

In [35]:
ollama = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
model_name = "gemma3:12b"

response = ollama.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

## Adversarial-Resistant Test-Suite for LLM Reasoning Capabilities

This design focuses on distinguishing between "pattern-matching" LLMs and those exhibiting genuine compositional reasoning, emphasizing robustness and adversarial resistance.

**I. Task Types & Examples (Probing Compositional Reasoning)**

Here's six task types, each targeting a specific aspect of reasoning, with example inputs and expected outputs:

**1. Systematic Generalization (SG):** Tests ability to generalize patterns beyond observed instances.
    * **Input:** "A blue block sits on a red block. A green block sits on a blue block. What happens when a yellow block sits on a green block?"
    * **Expected Output:** "A yellow block sits on a green block." (Demonstrates transfer of the pattern 'X sits on Y')
    * **Variant:**  Changing block colors, shapes, and adding distraction statements.

**2. Causal Reasoning (CR):**  Tests understanding of cause-and-effect.
    * **Input:** "Rain often makes the ground wet. The ground is wet. Could it be raining?"
    * **Expected Output:** "It could be raining, but there might be other reasons (e.g., sprinklers)."  (Shows understanding of correlation vs. causation)
    * **Variant:** Introducing misleading information or complex causal chains.

**3. Abstraction (AB):** Tests extracting underlying principles and applying them to novel contexts.
    * **Input:**  Present a series of stories following a pattern (e.g., "The baker made bread, and the village had food. The builder built houses, and the town had shelter.").  Then ask: "The musician played music, what does the village now have?"
    * **Expected Output:** "The village now has joy/entertainment/art."  (Demonstrates abstracting the relationship between profession and societal benefit)
    * **Variant:**  Varying professions, social structures, and complexity of the relationship.

**4. Counterfactuals (CT):** Tests ability to reason about "what if" scenarios.
    * **Input:** "The cat chased the mouse. If the cat had been asleep, what would have happened?"
    * **Expected Output:** "The mouse might have escaped/found food/taken a different path." (Demonstrates mental simulation)
    * **Variant:**  Complex counterfactuals incorporating multiple variables and constraints.

**5. Planning (PL):** Tests formulating and executing a sequence of actions to achieve a goal.
    * **Input:** "You need to cross a river with a fox, a chicken, and a sack of grain. You have a boat that can only carry you and one other thing. What should you do?"
    * **Expected Output:** (A logical sequence) "1. Take the chicken across. 2. Return alone. 3. Take the fox across. 4. Bring the chicken back. 5. Take the grain across. 6. Return alone. 7. Take the chicken across."
    * **Variant:**  Increasing the number of items and complexities of constraints.

**6. Variable Binding (VB):**  Tests understanding and manipulation of variables in abstract statements.
    * **Input:**  "Consider objects A and B.  If A is larger than B, and B is larger than C, then is A larger than C?"
    * **Expected Output:** "Yes." (Demonstrates transitivity and variable relationship understanding.)
    * **Variant:**  Introducing complex relationships, inequalities, and quantifiers.



**II. Scoring Rubric (Example: Systematic Generalization)**

* **0 Points:** Completely incorrect or gibberish.
* **1 Point:** Identifies the presence of something on top. (e.g., "A block sits on another block"). Partial understanding.
* **2 Points:** Correctly identifies the block color. (e.g., "A yellow block sits on a green block"). Shows pattern recognition.
* **3 Points:** Correctly states the new configuration and acknowledges the inferred pattern.  (e.g. "A yellow block sits on a green block, because the pattern is that one block sits on another"). Demonstrates generalization.

Similar rubrics would be developed for each task, prioritizing *reasoning steps* and avoiding reward for mere superficial correctness.



**III. Adversarial Mitigation & Detection**

* **Prompt Engineering Resistance:**
    * **Unseen Variants:** Automatically generate numerous variants of each input using synonym replacement, paraphrasing, and structural alterations (e.g., reordering sentences, adding/removing context).  This challenges memorization reliance.
    * **Noise Injection:** Add noise to the input (typos, grammatical errors, irrelevant information) to test robustness to imperfections.
    * **Negative Constraints:** Explicitly forbid common "cheating" techniques in instructions (e.g., "Do not simply repeat information" or "Do not try to find a direct answer online").
* **Memorization Detection:**
    * **Statistical Controls:** Monitor next-token probabilities.  Low entropy or predictable next-token distributions strongly suggest memorization.
    * **External Search Correlation:** Examine likelihood of extracted phrases from LLM's response appearing verbatim within top search results for the input prompt. High correlation points to retrieved factoids.  Implement a "blurring" technique to remove exact matches.
    * **Cross-Task Consistency:**  Evaluate consistency between performance across different tasks probing similar underlying skills (e.g., Causal Reasoning & Counterfactuals). Large inconsistencies could show task-specific memorization.



**IV. Sample Size & Statistical Tests**

* **Sample Size:** Begin with **N=200-300** prompts per task. This allows for reasonable power (estimated using pilot data). Increase if variance is high.
* **Statistical Tests:**
    * **t-tests/Mann-Whitney U-tests:** Compare mean scores of LLMs for each task.
    * **ANOVA/Kruskal-Wallis tests:** Compare performance across multiple LLMs simultaneously.
    * **Bayesian Hypothesis Testing:**  Provides probability estimates of LLMs exhibiting superior reasoning ability, accommodating prior beliefs about the task difficulty and model complexity.
    * **Effect Size Calculation:** Cohen's d or similar metrics capture the practical significance of observed differences.



**V. Potential Adversarial Behaviors & Mitigation**

* **"Hallucinating" Justifications:** LLMs might produce superficially sound explanations while failing to reason correctly. (Mitigation: Rubric focuses on reasoning steps, not *only* output correctness.)
* **"Exploiting" Prompts:** LLMs might find a narrow prompt structure yielding high scores, not demonstrating true generalization. (Mitigation: Extensive prompt variants & noise injection)
* **"Circumventing" Instructions:**  LLMs might ignore negative constraints or semantic reinterpretation of the prompt. (Mitigation: Re-phrasing instruction and negative constraints frequently to avoid easy circumvention)

**VI. Experimental Validation & Iteration**

* **Human Baseline:** Establish a human baseline score (N=30) for each task.  LLMs must demonstrably surpass this baseline to be considered capable.
* **False Positive/Negative Rates:**
    * **False Positives:** LLMs achieving high scores despite lacking true reasoning skills. This necessitates rigorous rubric validation and adversarial testing.
    * **False Negatives:** LLMs possessing genuine reasoning skills being incorrectly classified as pattern-matchers. This requires re-evaluation of task design and scoring.
* **Iteration:**
    1. **Analysis of Failure Cases:**  Deep dive into why individual LLMs failed on specific prompts. Categorize failures by type (memorization, misunderstandings, etc.).
    2. **Task Enhancement:** Modify tasks to directly address weaknesses revealed in the analysis.
    3. **Rubric Refinement:**  Adjust the scoring rubric to be more granular and discriminatory.
    4. **Adversarial Prompt Expansion:** Generate new adversarial prompts specifically targeting identified vulnerabilities.



**Conclusion:**

This test-suite design goes beyond simple accuracy comparisons. By focusing on compositional reasoning through targeted tasks, rigorous adversarial mitigation, and thorough validation, it aims to reliably differentiate true reasoning capabilities from superficial pattern-matching in LLMs, ultimately driving progress in AI safety and alignment.  This is an iterative process, constantly adapting to new LLM innovations and adversarial strategies.

In [36]:
# So where are we?

print(competitors)
print(answers)


['gpt-5-mini', 'claude-sonnet-4-5', 'gemini-3-flash-preview', 'deepseek-chat', 'openai/gpt-oss-120b', 'gpt-4.1-mini', 'llama3.2', 'gemma3:4b', 'gemma3:12b']
['Below is a complete, practical test‑suite design for reliably distinguishing LLMs that primarily pattern‑match / memorize from LLMs that show genuine compositional reasoning and robust generalization. It is written to be implementable against black‑box LLMs (only I/O access). The suite mixes synthetic, randomized and structured tasks, uses statistical controls and adversarial checks, and includes validation and iteration plans.\n\nSummary of components\n- (1) Six task types (with concrete examples and expected outputs) that probe different aspects of compositional reasoning.\n- (2) A clear scoring rubric for each task.\n- (3) Protocols to prevent/detect memorized answers and prompt‑engineering gaming (randomization, held‑out vocab, paraphrase invariance, adversarial decoys, statistical controls).\n- (4) Sample sizes and statistic

In [37]:
# It's nice to know how to use "zip"
for competitor, answer in zip(competitors, answers):
    print(f"Competitor: {competitor}\n\nAnswer:\n{answer}")


Competitor: gpt-5-mini

Answer:
Below is a complete, practical test‑suite design for reliably distinguishing LLMs that primarily pattern‑match / memorize from LLMs that show genuine compositional reasoning and robust generalization. It is written to be implementable against black‑box LLMs (only I/O access). The suite mixes synthetic, randomized and structured tasks, uses statistical controls and adversarial checks, and includes validation and iteration plans.

Summary of components
- (1) Six task types (with concrete examples and expected outputs) that probe different aspects of compositional reasoning.
- (2) A clear scoring rubric for each task.
- (3) Protocols to prevent/detect memorized answers and prompt‑engineering gaming (randomization, held‑out vocab, paraphrase invariance, adversarial decoys, statistical controls).
- (4) Sample sizes and statistical tests for confident capability differences.
- (5) Likely adversarial behaviors and mitigations.
- (6) Experimental validation plan

In [38]:
# Let's bring this together - note the use of "enumerate"

together = ""
for index, answer in enumerate(answers):
    together += f"# Response from competitor {index+1}\n\n"
    together += answer + "\n\n"

In [39]:
print(together)

# Response from competitor 1

Below is a complete, practical test‑suite design for reliably distinguishing LLMs that primarily pattern‑match / memorize from LLMs that show genuine compositional reasoning and robust generalization. It is written to be implementable against black‑box LLMs (only I/O access). The suite mixes synthetic, randomized and structured tasks, uses statistical controls and adversarial checks, and includes validation and iteration plans.

Summary of components
- (1) Six task types (with concrete examples and expected outputs) that probe different aspects of compositional reasoning.
- (2) A clear scoring rubric for each task.
- (3) Protocols to prevent/detect memorized answers and prompt‑engineering gaming (randomization, held‑out vocab, paraphrase invariance, adversarial decoys, statistical controls).
- (4) Sample sizes and statistical tests for confident capability differences.
- (5) Likely adversarial behaviors and mitigations.
- (6) Experimental validation plan w

In [48]:
judge = f"""You are judging a competition between {len(competitors)} competitors.
Each model has been given this question:

{question}

Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.
Respond with JSON, and only JSON, with the following format:
{{"results": ["best competitor number", "second best competitor number", "third best competitor number", ...]}}

Here are the responses from each competitor:

{together}

Now respond with the JSON with the ranked order of the competitors, nothing else. Do not include markdown formatting or code blocks."""


In [49]:
print(judge)

You are judging a competition between 9 competitors.
Each model has been given this question:

Design a comprehensive, adversarial‑resistant test-suite that reliably distinguishes between black‑box LLMs that rely primarily on pattern matching and memorization and LLMs that demonstrate genuine compositional reasoning and robust generalization; for your test‑suite, provide (1) at least six concrete task types with example inputs and expected outputs that probe different aspects of compositional reasoning (e.g., systematic generalization, causal reasoning, abstraction, counterfactuals, planning, variable‑binding), (2) a clear scoring rubric for each task, (3) protocols to prevent or detect memorized answers and prompt‑engineering gaming (including generation of unseen variants and statistical controls), (4) proposed sample sizes and statistical tests to assert with high confidence a difference in capability, (5) possible adversarial behaviors the LLM might use to appear capable and how yo

In [50]:
judge_messages = [{"role": "user", "content": judge}]

In [51]:
# Judgement time!

openai = OpenAI()
response = openai.chat.completions.create(
    model="gpt-5-mini",
    messages=judge_messages,
)
results = response.choices[0].message.content
print(results)


{"results": ["1", "5", "4", "3", "9", "8", "6", "7", "2"]}


In [52]:
# OK let's turn this into results!

results_dict = json.loads(results)
ranks = results_dict["results"]
for index, result in enumerate(ranks):
    competitor = competitors[int(result)-1]
    print(f"Rank {index+1}: {competitor}")

Rank 1: gpt-5-mini
Rank 2: openai/gpt-oss-120b
Rank 3: deepseek-chat
Rank 4: gemini-3-flash-preview
Rank 5: gemma3:12b
Rank 6: gemma3:4b
Rank 7: gpt-4.1-mini
Rank 8: llama3.2
Rank 9: claude-sonnet-4-5


In [53]:
# Gemini 3 Flash as the judge

response = gemini.chat.completions.create(
    model="gemini-3-flash-preview",
    messages=judge_messages,
)
results = response.choices[0].message.content
print(results)

{"results": ["1", "4", "5", "3", "6", "9", "8", "7", "2"]}


In [54]:
results_dict = json.loads(results)
ranks = results_dict["results"]
for index, result in enumerate(ranks):
    competitor = competitors[int(result)-1]
    print(f"Rank {index+1}: {competitor}")

Rank 1: gpt-5-mini
Rank 2: deepseek-chat
Rank 3: openai/gpt-oss-120b
Rank 4: gemini-3-flash-preview
Rank 5: gpt-4.1-mini
Rank 6: gemma3:12b
Rank 7: gemma3:4b
Rank 8: llama3.2
Rank 9: claude-sonnet-4-5


<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/exercise.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Exercise</h2>
            <span style="color:#ff7800;">Which pattern(s) did this use? Try updating this to add another Agentic design pattern.
            </span>
        </td>
    </tr>
</table>

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/business.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#00bfff;">Commercial implications</h2>
            <span style="color:#00bfff;">These kinds of patterns - to send a task to multiple models, and evaluate results,
            are common where you need to improve the quality of your LLM response. This approach can be universally applied
            to business projects where accuracy is critical.
            </span>
        </td>
    </tr>
</table>