LLM-assisted forms platform for government forms. Upload a PDF, extract a structured spec, deliver a form experience (static or conversational), and generate a filled PDF back out.
The core idea: separate what to collect from how to present it. A
DataCollectionSpec describes the fields and their semantics; a FormSpec
describes how they are presented. Swap the presentation (static page,
conversational chat, review layout) without touching the extraction pipeline,
and swap the extraction strategy without touching delivery. Every LLM-powered
step is a pluggable variant that can be selected per user at runtime.
Final project deployment for LLM Class 2026 Winter Cohort.
- Slide deck — https://ec2-34-197-222-16.compute-1.amazonaws.com/presentation
- Catalog — https://ec2-34-197-222-16.compute-1.amazonaws.com/catalog (architecture, decisions, experiments, personas, stories, design system)
- Application — https://ec2-34-197-222-16.compute-1.amazonaws.com/ (browse branch deployments)
- "Production deployment" — https://ec2-34-197-222-16.compute-1.amazonaws.com/main/ (main branch)
Each active branch is deployed at /<branch>/ alongside main.
Full write-ups live in the catalog. Headlines:
-
Hybrid-v1 Pareto-dominates prompt-only extraction.
sonnet-hybrid-v1(one short instruction, one inline exemplar, temperature=0) wins four of five metrics outright — precision 99.2%, recall 72.6%, sensitivity 51.1% — and ties on type accuracy, at the same cost as baseline Sonnet. It is now the production default. The prompt shape that topped the Assignment 10 tool-calling leaderboard on Mistral 8B reproduces on Claude Sonnet 4 for a completely different task. -
Tool-use is the structural precision/sensitivity lever.
tool-use-sonnetforces typed tool calls instead of free JSON: sensitivity accuracy jumps 27% → 79% (+51pp) and precision reaches 96.3%. Recall is step-limited at 20 rounds, so it shines on short-to-moderate forms. -
Nova Pro marks the non-Claude capability boundary for extraction.
nova-proscored 97% on the homework's 10-field tool-calling task but extracts at 0.6% recall here — it summarizes sections instead of enumerating fields. Prompt engineering does not recover this. Model selection dominates prompt engineering once the task is outside the model's capability range. -
Model size is not the dominant lever for shaping. Shaping model comparison: Opus/Sonnet/Haiku all cluster around 67-73% command-kind precision. Three intents are at ceiling across all three models; two fail across all three. Prompt disambiguation (e.g.
renamePagevsrenameGroup) is the bottleneck, not parameter count.
Suite indexes: PDF extraction · Shaping · Authoring pipeline · Roadmap
Prerequisites: Bun 1.x or later.
bun install
bun run dev # dev server at http://localhost:3000
bun test # tests
bun run check # lint + type check + tests (run before push)See CLAUDE.md for the full command reference, session workflow, deployment architecture, and contribution conventions.
src/
├── entrypoints/ # Hono servers and CLI
│ ├── app/ # Forms platform web app (routes, middleware, public)
│ ├── dashboard/ # Deployment dashboard (homepage service)
│ ├── webhook/ # GitHub webhook listener
│ ├── notify/ # Notification delivery
│ └── cli/ # CLI commands
├── services/ # Domain services (one public entry per service)
│ ├── data-collection/ # Core model: what to collect
│ ├── forms/ # Resolution, delivery, sessions, shaping, filling
│ ├── form-documents/ # PDF extraction, field mapping, filling
│ ├── extraction/ # Extraction variant registry
│ ├── evaluation/ # Evaluation harness and LLM-as-judge kinds
│ ├── projects/ # Project service and form-project git repo
│ ├── auth/ # GitHub OAuth, sessions
│ ├── deployment/ # Deploy orchestration
│ └── ...
├── design-system/ # flex-* components (server-rendered JSX)
└── shared/ # Pure utilities
catalog/ # Versioned catalog content (markdown/JSON)
├── personas/ # Who the system serves
├── stories/ # GitHub Issue copies (user stories)
├── architecture/ # System docs
├── decisions/ # ADRs
└── experiments/ # LLM experiment suites + runs
projects/ # Form project directories (specs, assets)
infrastructure/ # Pulumi (EC2) + NixOS (server config)
test/ # Bun test suite
Dependencies flow one way: shared → services/design-system → entrypoints.
Each service exposes its public API through src/services/<name>/index.ts
and is enforced by test/architecture/dependency-rule.test.ts. See the
architecture principles.
- Runtime: Bun
- Framework: Hono (server-rendered JSX, no client runtime)
- Language: TypeScript
- Testing: Bun test
- Linting: Biome + Stylelint (design-token enforcement)
- Persistence: Git-based — specs and catalog content live in the repo
- LLMs: Claude (Opus/Sonnet/Haiku) via Anthropic SDK and AWS Bedrock; Amazon Nova Pro for cross-provider comparison
- Deployment: Pulumi + NixOS on EC2, branch-per-deployment via GitHub webhook
Session commands (installed in .claude/commands/):
/create-story # New user story (GitHub issue + notes dir)
/start-story <description> # Initialize a session (worktree, context)
/finish-story # Run checks, code review, open PR
/review-story <PR> # Review another session's PRInfrastructure changes branch from main as infra/YYYY-MM-DD-description;
feature work branches as story-N/name or docs/<description>. See
CLAUDE.md for commit conventions, the stacked branch workflow,
and deployment details.
Class project for LLM Class 2026 Winter Cohort.
Development follows vertical slicing: each user story delivers a complete,
demoable capability through all layers. Tests are required for new
functionality; bun run check must pass before push.