Golden Demo: executable reference + behavioral drift detection (complementary to Architecture Guard) #3078

jasstt · 2026-06-21T13:33:27Z

jasstt
Jun 21, 2026

Problem

In Spec-Driven Development, specs guide code generation, but there's
little automatic assurance that the implemented code behaves exactly
as the spec intended. Existing extensions (Architecture Guard,
Security Review, CDD enforcement tools) catch architectural rule
violations, security anti-patterns, and documentation drift — all
static checks on text/code structure. None of them execute the code
and compare it against a runnable reference.

This isn't a hypothetical gap — see #1686, where a user describes
exactly this pain: no tool currently distinguishes "issue in the spec"
from "issue in the implementation," and asks whether something like
SRE's "golden signals" could exist for spec-to-implementation fidelity.

Proposed solution — v1 scope (intentionally narrow)

A "Golden Demo" extension that, for pure functions with explicit
input/output examples in spec.md (no I/O, no side effects, no
randomness/time dependency — out of scope for v1):

after_plan hook: generates a minimal, deterministic reference
implementation + test vectors from the spec's acceptance criteria.
after_implement hook: runs both the golden reference and the
real implementation against the same test vectors, and produces a
pass/fail drift report (no LLM-as-judge — deterministic execution
only, to avoid noisy false positives).

Scope is deliberately limited to pure functions for v1 to keep
generation reliable and avoid false-positive drift on legitimate
design changes. Side-effecting code, multi-service features, and
non-deterministic behavior are explicitly out of scope until the
core mechanism is validated.

How this differs from Architecture Guard / CDD

Those tools validate structure and documentation against rules.
This extension validates behavior against an executable reference —
complementary, not competing. Happy to coordinate scope with both
maintainers if there's interest in overlap.

@raccioly tagging you since this overlaps conceptually with your CDD (Canonical-Driven Development) work — curious if you'd see value in this as a separate extension or as an addition to yours.

Open questions for the community

Is .spec-kit/golden/ a reasonable place to store reference
artifacts, or should this live under .specify/?
Any prior art / failed attempts at this within the ecosystem I
should know about before building?

mnriem · 2026-06-22T12:49:46Z

mnriem
Jun 22, 2026
Maintainer

From the core SDD process perspective this was delivered by #3001 so I am closing this out

1 reply

jasstt Jun 22, 2026
Author

Thanks for pointing to #3001 — Converge is a great addition to the workflow. However, the gap I'm describing is complementary: converge checks whether the implementation matches the spec textually, while Golden Demo executes the code against test vectors derived from the spec's acceptance criteria. One is a structural review, the other is a behavioral oracle. Happy to clarify the distinction further if useful.

mnriem · 2026-06-22T21:13:28Z

mnriem
Jun 22, 2026
Maintainer

You've drawn a real distinction — converge does structural/textual fidelity, and what you're describing is behavioral execution. I shouldn't have implied #3001 fully subsumes it on its own, so let me sharpen the reasoning.

The behavioral oracle you're after is already reachable today through the supported TDD path + converge, not converge alone:

With TDD opted in, tasks.md generates test tasks derived directly from the spec's acceptance criteria, and implement executes those tests against the real code (the red/green loop). That's spec-derived behavioral assertions, actually executed.
converge then reasons over spec.md / plan.md / tasks.md and the codebase to close any remaining gap between intent and implementation.

Together that triad is a behavioral oracle for anyone who opts into testing. Stripped down, the genuinely new pieces Golden Demo adds on top are: (1) an auto-generated second reference implementation as a differential oracle, and (2) synthesizing and running these vectors even when the user hasn't opted into TDD.

That second piece is exactly why this shouldn't be a core process step: SDD treats tests as optional by design (tasks.md only emits test tasks on explicit/TDD request; implement runs the TDD path conditionally). Making behavioral test generation unconditional would impose a test-first requirement the core deliberately avoids. The differential-oracle idea (piece 1) is a nice enhancement — but an opt-in one.

The encouraging part: your design already fits the extension model perfectly — it's defined as after_plan / after_implement hooks, which is precisely the seam the extension system provides. I'd genuinely encourage building this as a community extension (see extensions/EXTENSION-DEVELOPMENT-GUIDE.md and the community catalog); that keeps it opt-in for users who want differential drift detection without changing the default flow. Happy to help review the hook design if you go that route.

0 replies

jasstt · 2026-06-22T21:33:09Z

jasstt
Jun 22, 2026
Author

That framing really helps — I'd been thinking about TDD and converge as
parallel tracks rather than a triad, so the mental model correction is
genuinely useful.

You're right that the two things Golden Demo would actually add on top are
the differential oracle and the no-TDD path. The second one is probably
the more interesting case to me — there's a real audience of people who
build with spec-kit but never opt into testing, and right now they have no
behavioral feedback loop at all.

Makes complete sense to keep that out of core though. Extension hooks are
exactly where I'd want this to live anyway.

I'll dig into the EXTENSION-DEVELOPMENT-GUIDE and start putting something
together. If you're still open to a look at the hook interface once I have
a rough draft, I'll post back here — that offer means a lot at this stage.

6 replies

jasstt Jun 22, 2026
Author

Oh that's a nice point — so the opt-in becomes almost invisible for the
user once the extension is installed, the hook just fires when the model
plays along. That actually removes the main friction I was worried about
(people forgetting to run it manually).

Good to know. That changes how I'd think about the UX of the thing.

jasstt Jun 22, 2026
Author

The draft is up at github.com/jasstt/spec-kit-golden-demo. Current state is exactly what we discussed: after_plan extracts acceptance criteria from spec.md / plan.md into .spec-kit/golden/test-vectors.md, after_implement does a dry-run report against those vectors. Both hooks are optional: true with explicit prompts — no unconditional execution.
A few things I'd specifically like your eyes on before I build the execution logic:

Hook interface — is after_plan / after_implement the right seam, or should after_tasks also be in the chain (e.g. to refine vectors based on generated tasks)?
Artifact location — I'm writing to .spec-kit/golden/test-vectors.md. Is .spec-kit/ the right namespace for extension artifacts, or is there a preferred path?
requires block — I dropped requires.commands since speckit.plan and speckit.implement are core commands the extension just hooks into, not dependencies. Confirmed this against the guide — let me know if I'm reading that wrong.

Happy to adjust anything before moving to production logic. Thanks again.

mnriem Jun 23, 2026
Maintainer

How you structure your extension is up to you with regards to where the artifacts are stored. Note that most extensions put their content in a sub directory in .specify but it is not a hard and fast rule. Note that since you are now hosting it yourself the only thing we (Spec Kit maintainers) require is the correct metadata a extension submission issue time

jasstt Jun 23, 2026
Author

That makes perfect sense. I appreciate the heads-up on the .specify convention — I've just updated the extension to output the artifacts to .specify/golden-demo/ instead to align with the ecosystem standard.

Since the metadata is ready and the hook interface is validated, I'll go ahead and open an Extension Submission Issue to get it added to the catalog. Thanks for your guidance throughout this

mnriem Jun 23, 2026
Maintainer

Looking forward to it :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Golden Demo: executable reference + behavioral drift detection (complementary to Architecture Guard) #3078

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Golden Demo: executable reference + behavioral drift detection (complementary to Architecture Guard) #3078

Uh oh!

jasstt Jun 21, 2026

Problem

Proposed solution — v1 scope (intentionally narrow)

How this differs from Architecture Guard / CDD

Open questions for the community

Replies: 3 comments · 7 replies

Uh oh!

mnriem Jun 22, 2026 Maintainer

Uh oh!

jasstt Jun 22, 2026 Author

Uh oh!

mnriem Jun 22, 2026 Maintainer

Uh oh!

jasstt Jun 22, 2026 Author

Uh oh!

jasstt Jun 22, 2026 Author

Uh oh!

jasstt Jun 22, 2026 Author

Uh oh!

mnriem Jun 23, 2026 Maintainer

Uh oh!

jasstt Jun 23, 2026 Author

Uh oh!

mnriem Jun 23, 2026 Maintainer

jasstt
Jun 21, 2026

Replies: 3 comments 7 replies

mnriem
Jun 22, 2026
Maintainer

jasstt Jun 22, 2026
Author

mnriem
Jun 22, 2026
Maintainer

jasstt
Jun 22, 2026
Author

jasstt Jun 22, 2026
Author

jasstt Jun 22, 2026
Author

mnriem Jun 23, 2026
Maintainer

jasstt Jun 23, 2026
Author

mnriem Jun 23, 2026
Maintainer