Turn a competitive-programming problem statement into a valid, usable problem package — statement, validator, checker, reference + brute solutions, real test data with answers, and a verdict table — by asking an LLM for each artifact in order and assembling the outputs.
The tool does no iteration. It generates each artifact once, compiles it, checks it against the statement's samples (which it treats as ground truth), and assembles the package. It stops only at mechanical gates or a ground-truth failure — never to "make the LLM try again." Reasoning about why something failed is intentionally left to later versions.
A problem directory containing:
<problem>/
statement.md # the input
spec.json # parsed: limits, checker, constraints, num_tests
samples/01.in, 01.out, … # samples extracted verbatim from the statement
validator/validator.cpp # + compiled binary
checker/checker.cpp # standard testlib checker or a custom one
solutions/reference.cpp # source of truth for answers
solutions/brute.cpp # slow, obviously-correct second opinion
generator/gen.cpp # testlib generator
tests/01.in, 01.ans, … # validated tests + reference-generated answers
tests/matrix.txt # last verdict matrix
manifest.json # what's built, when, and whether it's verified
- Gate A — compilation. Any generated C++ that doesn't compile stops the run and prints the compiler error.
- Sample gates — ground truth. The samples in the statement are authoritative. A validator that rejects a sample, a checker that won't accept a sample's official output, or a solution that's wrong on a sample stops the run.
- Gate B — test validity. Every generated test must pass the validator and be answerable by the reference; otherwise the run stops naming the offending test.
A stop prints STOP: … and exits non-zero. Fix the cause and re-run that step.
- Compiler:
g++-15(Homebrew GCC). Override withPSETTER_CXX. - testlib: ships in
context/testlib/(the header + the 19 standard checkers). - Python: 3.14 via
uv; depsanthropic,google-genai,typer,pydantic,pytest. - API key: at least one of
ANTHROPIC_API_KEY/GEMINI_API_KEY.
uv sync # install deps + register the `pset` command
cp .env.example .env # then add your API key(s)Verify the toolchain: bash tests/smoke/run.sh.
Run either way:
uv run pset <cmd> ... # no activation needed
source .venv/bin/activate && pset ... # or activate onceDefault provider is gemini (gemini-2.5-flash); claude is also supported
(default claude-opus-4-8). On a transient Gemini error the provider falls back
through PSETTER_GEMINI_FALLBACKS (default gemini-2.5-flash-lite); any provider
failure becomes a clean one-line STOP:.
Set the provider/model for future commands (persists to .env):
pset config provider claude # use Claude
pset config model claude-haiku-4-5 # set the model for the current provider
pset config model gemini-2.5-pro -p gemini
pset config show # what will be usedOr override per run with -p/--provider, or set env vars directly
(PSETTER_PROVIDER, PSETTER_CLAUDE_MODEL, PSETTER_GEMINI_MODEL). After each
generating command, pset prints token usage and an estimated cost.
Every command except new operates on the current directory; target another
with -C/--dir DIR. Generating commands take -p/--provider claude|gemini.
| Command | What it does |
|---|---|
pset new <dir> [--from FILE] |
Create a problem dir; make statement.md (or seed it from FILE). |
pset fill |
LLM-complete an under-specified statement.md: infer the missing IO format / constraints, marking each guess <!-- inferred -->. Backs up the original to statement_backups/ first; does not touch spec.json. Review the result, then run pset parse. |
pset parse |
statement.md → spec.json + samples/. Extracts samples and constraints verbatim; stops if samples/IO-format are missing, a sample was invented, or any input variable lacks a complete constraint (suggests pset fill). |
pset make validator [--no-verify] |
LLM-write the validator, compile, and verify it accepts every sample. |
pset make checker [--no-verify] |
Use a standard testlib checker (chosen during parse) or LLM-write a custom one; compile; verify on samples. |
pset make solutions [--no-verify] |
LLM-write reference + brute (one call); compile; verify both AC on every sample. |
pset make generators [--no-verify] |
LLM-write the test generator; compile; smoke-run it. |
pset verify <validator|checker|solutions|generators|tests|all> |
Re-run verification only — no LLM — on the source/data on disk. |
pset generate tests |
Run the generator, validate every test (Gate B), and have the reference write each .ans. No LLM call. |
pset run tests |
Run reference + brute on all tests at the real time/memory limits; print the verdict matrix. |
pset all |
Do everything in order (parse → make … → generate tests → run tests), stopping at the first error. |
pset config <provider|model|show> |
Set the LLM provider/model for future commands (writes .env). |
pset status |
Table of what exists, whether it's verified, and last-built timestamps. |
pset show <thing> |
Print spec | statement | samples | validator | checker | solutions | tests | matrix. |
make X = LLM-generate the source, then verify it. --no-verify writes the
raw source only (marked unverified) so you can hand-edit it. verify X =
verify only, no LLM — compile + the ground-truth sample checks on whatever
source is on disk. So you can write or edit validator/validator.cpp by hand and
just pset verify validator.
Dependencies are verification-aware: a prerequisite counts as met only if its
verification passed, so a --no-verify artifact blocks everything downstream until
you verify it.
pset new problems/two-sum --from ~/statements/two-sum.md
cd problems/two-sum
pset all # or run the steps individually:
# pset parse
# pset make validator
# pset make checker
# pset make solutions
# pset make generators
# pset generate tests
# pset run testsHand-edit instead of regenerating:
$EDITOR validator/validator.cpp
pset verify validator # compile + sample checks, no LLMInspect anytime: pset status, pset show spec, pset show matrix.
generate tests produces num_tests generator tests (default 10) on top of
the samples. It's a field in spec.json — pset show spec to see it, edit the
file to change it, then pset generate tests.
reference is AC by construction — it generated the answer files — so that
column is not signal. The brute column is what to read: it may diverge, TLE,
or MLE on stronger vs. weaker tests. Nothing judges these; they're shown for your
inspection.
uv run pytest -q # all network-free tests
uv run python tests/manual_llm_check.py # LIVE provider check (needs .env keys)tests/test_stages.py drives the whole pipeline offline with a fake provider
(canned C++/JSON), exercising the gates, the generate/verify split, and parse's
extract-or-fail. It compiles and runs real C++, so it needs g++-15 + testlib but
no API keys.
No retries or repair loops, no verdict enforcement, no wrong-solution discrimination, no subtasks, no correctness cross-check. The MVP proves that LLM outputs can be mechanically assembled into a valid, answer-complete package; the reasoning layer that handles why an artifact is wrong is the next version's job.