Skip to content

avnithv/psetter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

psetter (pset)

Turn a competitive-programming problem statement into a valid, usable problem package — statement, validator, checker, reference + brute solutions, real test data with answers, and a verdict table — by asking an LLM for each artifact in order and assembling the outputs.

The tool does no iteration. It generates each artifact once, compiles it, checks it against the statement's samples (which it treats as ground truth), and assembles the package. It stops only at mechanical gates or a ground-truth failure — never to "make the LLM try again." Reasoning about why something failed is intentionally left to later versions.


What it produces

A problem directory containing:

<problem>/
  statement.md                 # the input
  spec.json                    # parsed: limits, checker, constraints, num_tests
  samples/01.in, 01.out, …     # samples extracted verbatim from the statement
  validator/validator.cpp      # + compiled binary
  checker/checker.cpp          # standard testlib checker or a custom one
  solutions/reference.cpp      # source of truth for answers
  solutions/brute.cpp          # slow, obviously-correct second opinion
  generator/gen.cpp            # testlib generator
  tests/01.in, 01.ans, …       # validated tests + reference-generated answers
  tests/matrix.txt             # last verdict matrix
  manifest.json                # what's built, when, and whether it's verified

The gates (the only hard stops)

  • Gate A — compilation. Any generated C++ that doesn't compile stops the run and prints the compiler error.
  • Sample gates — ground truth. The samples in the statement are authoritative. A validator that rejects a sample, a checker that won't accept a sample's official output, or a solution that's wrong on a sample stops the run.
  • Gate B — test validity. Every generated test must pass the validator and be answerable by the reference; otherwise the run stops naming the offending test.

A stop prints STOP: … and exits non-zero. Fix the cause and re-run that step.


Requirements

  • Compiler: g++-15 (Homebrew GCC). Override with PSETTER_CXX.
  • testlib: ships in context/testlib/ (the header + the 19 standard checkers).
  • Python: 3.14 via uv; deps anthropic, google-genai, typer, pydantic, pytest.
  • API key: at least one of ANTHROPIC_API_KEY / GEMINI_API_KEY.

Setup

uv sync                       # install deps + register the `pset` command
cp .env.example .env          # then add your API key(s)

Verify the toolchain: bash tests/smoke/run.sh.

Run either way:

uv run pset <cmd> ...                 # no activation needed
source .venv/bin/activate && pset ... # or activate once

LLM provider & model

Default provider is gemini (gemini-2.5-flash); claude is also supported (default claude-opus-4-8). On a transient Gemini error the provider falls back through PSETTER_GEMINI_FALLBACKS (default gemini-2.5-flash-lite); any provider failure becomes a clean one-line STOP:.

Set the provider/model for future commands (persists to .env):

pset config provider claude              # use Claude
pset config model claude-haiku-4-5       # set the model for the current provider
pset config model gemini-2.5-pro -p gemini
pset config show                         # what will be used

Or override per run with -p/--provider, or set env vars directly (PSETTER_PROVIDER, PSETTER_CLAUDE_MODEL, PSETTER_GEMINI_MODEL). After each generating command, pset prints token usage and an estimated cost.


Commands

Every command except new operates on the current directory; target another with -C/--dir DIR. Generating commands take -p/--provider claude|gemini.

Command What it does
pset new <dir> [--from FILE] Create a problem dir; make statement.md (or seed it from FILE).
pset fill LLM-complete an under-specified statement.md: infer the missing IO format / constraints, marking each guess <!-- inferred -->. Backs up the original to statement_backups/ first; does not touch spec.json. Review the result, then run pset parse.
pset parse statement.mdspec.json + samples/. Extracts samples and constraints verbatim; stops if samples/IO-format are missing, a sample was invented, or any input variable lacks a complete constraint (suggests pset fill).
pset make validator [--no-verify] LLM-write the validator, compile, and verify it accepts every sample.
pset make checker [--no-verify] Use a standard testlib checker (chosen during parse) or LLM-write a custom one; compile; verify on samples.
pset make solutions [--no-verify] LLM-write reference + brute (one call); compile; verify both AC on every sample.
pset make generators [--no-verify] LLM-write the test generator; compile; smoke-run it.
pset verify <validator|checker|solutions|generators|tests|all> Re-run verification only — no LLM — on the source/data on disk.
pset generate tests Run the generator, validate every test (Gate B), and have the reference write each .ans. No LLM call.
pset run tests Run reference + brute on all tests at the real time/memory limits; print the verdict matrix.
pset all Do everything in order (parse → make … → generate tests → run tests), stopping at the first error.
pset config <provider|model|show> Set the LLM provider/model for future commands (writes .env).
pset status Table of what exists, whether it's verified, and last-built timestamps.
pset show <thing> Print spec | statement | samples | validator | checker | solutions | tests | matrix.

make vs verify

make X = LLM-generate the source, then verify it. --no-verify writes the raw source only (marked unverified) so you can hand-edit it. verify X = verify only, no LLM — compile + the ground-truth sample checks on whatever source is on disk. So you can write or edit validator/validator.cpp by hand and just pset verify validator.

Dependencies are verification-aware: a prerequisite counts as met only if its verification passed, so a --no-verify artifact blocks everything downstream until you verify it.


Typical run

pset new problems/two-sum --from ~/statements/two-sum.md
cd problems/two-sum
pset all                     # or run the steps individually:

# pset parse
# pset make validator
# pset make checker
# pset make solutions
# pset make generators
# pset generate tests
# pset run tests

Hand-edit instead of regenerating:

$EDITOR validator/validator.cpp
pset verify validator        # compile + sample checks, no LLM

Inspect anytime: pset status, pset show spec, pset show matrix.


How many tests?

generate tests produces num_tests generator tests (default 10) on top of the samples. It's a field in spec.jsonpset show spec to see it, edit the file to change it, then pset generate tests.

Reading the verdict matrix

reference is AC by construction — it generated the answer files — so that column is not signal. The brute column is what to read: it may diverge, TLE, or MLE on stronger vs. weaker tests. Nothing judges these; they're shown for your inspection.


Tests

uv run pytest -q                       # all network-free tests
uv run python tests/manual_llm_check.py   # LIVE provider check (needs .env keys)

tests/test_stages.py drives the whole pipeline offline with a fake provider (canned C++/JSON), exercising the gates, the generate/verify split, and parse's extract-or-fail. It compiles and runs real C++, so it needs g++-15 + testlib but no API keys.


What it deliberately does NOT do

No retries or repair loops, no verdict enforcement, no wrong-solution discrimination, no subtasks, no correctness cross-check. The MVP proves that LLM outputs can be mechanically assembled into a valid, answer-complete package; the reasoning layer that handles why an artifact is wrong is the next version's job.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors