Forms Lab

LLM-assisted forms platform for government forms. Upload a PDF, extract a structured spec, deliver a form experience (static or conversational), and generate a filled PDF back out.

The core idea: separate what to collect from how to present it. A DataCollectionSpec describes the fields and their semantics; a FormSpec describes how they are presented. Swap the presentation (static page, conversational chat, review layout) without touching the extraction pipeline, and swap the extraction strategy without touching delivery. Every LLM-powered step is a pluggable variant that can be selected per user at runtime.

Live demo

Final project deployment for LLM Class 2026 Winter Cohort.

Slide deck — https://ec2-34-197-222-16.compute-1.amazonaws.com/presentation
Catalog — https://ec2-34-197-222-16.compute-1.amazonaws.com/catalog (architecture, decisions, experiments, personas, stories, design system)
Application — https://ec2-34-197-222-16.compute-1.amazonaws.com/ (browse branch deployments)
"Production deployment" — https://ec2-34-197-222-16.compute-1.amazonaws.com/main/ (main branch)

Each active branch is deployed at /<branch>/ alongside main.

Key findings

Full write-ups live in the catalog. Headlines:

Hybrid-v1 Pareto-dominates prompt-only extraction. sonnet-hybrid-v1 (one short instruction, one inline exemplar, temperature=0) wins four of five metrics outright — precision 99.2%, recall 72.6%, sensitivity 51.1% — and ties on type accuracy, at the same cost as baseline Sonnet. It is now the production default. The prompt shape that topped the Assignment 10 tool-calling leaderboard on Mistral 8B reproduces on Claude Sonnet 4 for a completely different task.
Tool-use is the structural precision/sensitivity lever. tool-use-sonnet forces typed tool calls instead of free JSON: sensitivity accuracy jumps 27% → 79% (+51pp) and precision reaches 96.3%. Recall is step-limited at 20 rounds, so it shines on short-to-moderate forms.
Nova Pro marks the non-Claude capability boundary for extraction. nova-pro scored 97% on the homework's 10-field tool-calling task but extracts at 0.6% recall here — it summarizes sections instead of enumerating fields. Prompt engineering does not recover this. Model selection dominates prompt engineering once the task is outside the model's capability range.
Model size is not the dominant lever for shaping. Shaping model comparison: Opus/Sonnet/Haiku all cluster around 67-73% command-kind precision. Three intents are at ceiling across all three models; two fail across all three. Prompt disambiguation (e.g. renamePage vs renameGroup) is the bottleneck, not parameter count.

Suite indexes: PDF extraction · Shaping · Authoring pipeline · Roadmap

Quick start

Prerequisites: Bun 1.x or later.

bun install
bun run dev                # dev server at http://localhost:3000
bun test                   # tests
bun run check              # lint + type check + tests (run before push)

See CLAUDE.md for the full command reference, session workflow, deployment architecture, and contribution conventions.

Project layout

src/
├── entrypoints/           # Hono servers and CLI
│   ├── app/               # Forms platform web app (routes, middleware, public)
│   ├── dashboard/         # Deployment dashboard (homepage service)
│   ├── webhook/           # GitHub webhook listener
│   ├── notify/            # Notification delivery
│   └── cli/               # CLI commands
├── services/              # Domain services (one public entry per service)
│   ├── data-collection/   # Core model: what to collect
│   ├── forms/             # Resolution, delivery, sessions, shaping, filling
│   ├── form-documents/    # PDF extraction, field mapping, filling
│   ├── extraction/        # Extraction variant registry
│   ├── evaluation/        # Evaluation harness and LLM-as-judge kinds
│   ├── projects/          # Project service and form-project git repo
│   ├── auth/              # GitHub OAuth, sessions
│   ├── deployment/        # Deploy orchestration
│   └── ...
├── design-system/         # flex-* components (server-rendered JSX)
└── shared/                # Pure utilities

catalog/                   # Versioned catalog content (markdown/JSON)
├── personas/              # Who the system serves
├── stories/               # GitHub Issue copies (user stories)
├── architecture/          # System docs
├── decisions/             # ADRs
└── experiments/           # LLM experiment suites + runs

projects/                  # Form project directories (specs, assets)
infrastructure/            # Pulumi (EC2) + NixOS (server config)
test/                      # Bun test suite

Dependencies flow one way: shared → services/design-system → entrypoints. Each service exposes its public API through src/services/<name>/index.ts and is enforced by test/architecture/dependency-rule.test.ts. See the architecture principles.

Tech stack

Runtime: Bun
Framework: Hono (server-rendered JSX, no client runtime)
Language: TypeScript
Testing: Bun test
Linting: Biome + Stylelint (design-token enforcement)
Persistence: Git-based — specs and catalog content live in the repo
LLMs: Claude (Opus/Sonnet/Haiku) via Anthropic SDK and AWS Bedrock; Amazon Nova Pro for cross-provider comparison
Deployment: Pulumi + NixOS on EC2, branch-per-deployment via GitHub webhook

Workflow

Session commands (installed in .claude/commands/):

/create-story                # New user story (GitHub issue + notes dir)
/start-story <description>   # Initialize a session (worktree, context)
/finish-story                # Run checks, code review, open PR
/review-story <PR>           # Review another session's PR

Infrastructure changes branch from main as infra/YYYY-MM-DD-description; feature work branches as story-N/name or docs/<description>. See CLAUDE.md for commit conventions, the stacked branch workflow, and deployment details.

Contributing

Class project for LLM Class 2026 Winter Cohort. Development follows vertical slicing: each user story delivers a complete, demoable capability through all layers. Tests are required for new functionality; bun run check must pass before push.

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 959 Commits
.claude		.claude
.github/workflows		.github/workflows
.superpowers/brainstorm/84403-1775881113		.superpowers/brainstorm/84403-1775881113
catalog		catalog
fixtures		fixtures
infrastructure		infrastructure
notes		notes
scripts		scripts
src		src
test		test
.env.example		.env.example
.envrc		.envrc
.gitignore		.gitignore
.stylelintrc.json		.stylelintrc.json
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
biome.json		biome.json
bun.lock		bun.lock
bunfig.toml		bunfig.toml
deploy.json		deploy.json
flake.lock		flake.lock
flake.nix		flake.nix
knowledge-base.yaml		knowledge-base.yaml
package.json		package.json
playwright.config.ts		playwright.config.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Forms Lab

Live demo

Key findings

Quick start

Project layout

Tech stack

Workflow

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Forms Lab

Live demo

Key findings

Quick start

Project layout

Tech stack

Workflow

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages