feat(prd): recursive decomposition stress test (#421)#446
Conversation
#421) Add `cf prd stress-test` command that recursively decomposes PRD goals using tri-state classification (atomic/composite/ambiguous) to surface requirements gaps and generate a technical specification. - New core module: `codeframe/core/prd_stress_test.py` - CLI command with --max-depth, --output, --interactive flags - Interactive mode resolves ambiguities and creates new PRD version - 23 tests covering models, core functions, renderers, and CLI
WalkthroughAdds a PRD stress-test feature: new headless core module for recursive LLM-guided decomposition with ambiguity surfacing and spec rendering, a CLI command (declared twice in the diff) to run it (including interactive resolution), plus tests and a small node-count helper. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/CLI
participant App as CLI App
participant Stress as stress_test_prd()
participant LLM as LLM Provider
participant Render as Renderer
User->>App: cf prd stress-test [--interactive]
App->>Stress: stress_test_prd(prd_content, provider, max_depth)
Stress->>LLM: extract_goals(prd_content)
LLM-->>Stress: [goals...]
loop per goal
Stress->>Stress: recursive_decompose(goal, depth=0)
loop decompose until leaf or max_depth
Stress->>LLM: classify_and_decompose(node, lineage, prd_context)
alt atomic
LLM-->>Stress: ATOMIC (leaf)
else composite
LLM-->>Stress: COMPOSITE + children
Stress->>Stress: recurse into children
else ambiguous
LLM-->>Stress: AMBIGUOUS + questions
Stress->>Stress: collect ambiguity
end
end
end
Stress->>Render: render_tech_spec(tree, ambiguities)
Render-->>Stress: markdown spec
Stress->>Render: render_ambiguity_report(ambiguities)
Render-->>Stress: ambiguity report
Stress-->>App: StressTestResult (spec + report + tree)
App->>User: display spec + ambiguity report
alt --interactive
User->>App: provide answers
App->>LLM: resolve_ambiguities_into_prd(answers)
LLM-->>App: updated PRD content
App->>App: emit PRD_UPDATED event
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
|
test comment |
|
Deleting test comment |
PR #446 Review - feat(prd): recursive decomposition stress testOverall this is a clean addition that fits the Think/Build/Prove/Ship pipeline well. The core module is properly headless, the tri-state classification extends task_tree.py nicely, and test coverage is solid. A few issues worth addressing. Bug: Patch target is wrong in CLI tests The CLI tests patch codeframe.adapters.llm.anthropic.AnthropicProvider, but the CLI does from codeframe.adapters.llm.anthropic import AnthropicProvider, which binds the name in codeframe.cli.app namespace. The patch will not intercept it there. The correct target is @patch("codeframe.cli.app.AnthropicProvider"). Affects test_stress_test_with_prd and test_stress_test_output_flag. These may pass today only due to missing API keys, but do not reliably validate mock substitution. Missing type hints on provider parameters All core functions accept an untyped provider arg. adapters/llm/base.py already defines a LLMProvider Protocol - using it makes the contract explicit and catches adapter mismatches earlier: from codeframe.adapters.llm.base import LLMProvider; def extract_goals(prd_content: str, provider: LLMProvider) -> list[str]: ... Silent failure on goal extraction If extract_goals returns [] (JSON parse error or empty LLM response), stress_test_prd returns an empty tree silently. The CLI prints Summary: 0 goals, 0 nodes, 0 ambiguities - looks like success but nothing was analyzed. A warning when goals is empty would prevent silent no-ops. Tech spec silently truncated at 2000 chars in CLI output console.print(Panel(result.tech_spec_markdown[:2000], ...)) silently cuts off the spec for any reasonably sized PRD. If --output is the recommended path, indicate when truncation occurs so users know to use it. Minor: misleading comment in test_max_depth_forces_atomic The comment says Should NOT have called the provider then immediately hedges it away. Either add the actual assertion (mock_provider.complete.assert_not_called()) or remove the comment - the test intent is clear from the assertion alone. Minor: missing tests for resolve_ambiguities_into_prd and resolved answer rendering resolve_ambiguities_into_prd has no unit test, and render_ambiguity_report with a resolved answer (the Resolved: branch) is also uncovered. The interactive PRD update flow depends on both. No concern: AnthropicProvider imported in CLI not core - Correct pattern, keeps core provider-agnostic per architecture rules. No concern: mutable ambiguities accumulator parameter - Fine as written, always called with an explicit [] and the accumulator intent is clear from all call sites. |
|
Please disregard this comment - posted in error. |
PR #446 Review — feat(prd): recursive decomposition stress testOverall this is a clean addition that fits the Think/Build/Prove/Ship pipeline well. The core module is properly headless, the tri-state classification extends task_tree.py nicely, and test coverage is solid. A few issues worth addressing: Bug: Patch target is wrong in CLI testsThe CLI tests patch Affects Missing type hints on
|
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@codeframe/cli/app.py`:
- Around line 1605-1609: The CLI option max_depth currently allows zero or
negative values which breaks recursive_decompose's expected invariant; add
validation at the option layer to reject values < 1. Implement a small validator
(e.g., validate_max_depth) and wire it into the typer.Option for max_depth (or
use a click/typer range type) so that passing <= 0 raises a Typer/Click
BadParameter with a clear message; keep recursive_decompose unchanged but ensure
its contract (depth >= max_depth) is respected by the CLI.
- Around line 1706-1735: After successfully creating a new PRD version with
prd_module.create_new_version (new_record), reload the updated PRD (e.g., fetch
record by new_record.id) and re-run the same generation/analysis step that
produced result so that result.tech_spec_markdown, result.tree and
result.ambiguities reflect the resolved content before printing/writing the
panel and summary; also handle the falsy create_new_version case by logging a
warning instead of silently continuing when interactive is true.
In `@codeframe/core/prd_stress_test.py`:
- Around line 253-257: The ambiguity map currently uses a.source_node_title as
the key which can collide for multiple nodes with the same title; change amb_map
to use a stable per-node unique key (e.g., a.node_id, a.source_node_path, or
a.lineage) so each ambiguity gets a distinct number and update usages that pass
amb_map into _render_spec_node (and any other call sites around lines 272-274)
to look up by that unique key instead of source_node_title; ensure the Ambiguity
objects (ambiguities) expose the chosen unique identifier and that the map
creation: amb_map = {<unique_key>: i + 1 for i, a in enumerate(ambiguities)} is
consistent with the lookup inside _render_spec_node.
- Around line 326-340: The function returns raw model output (response.content)
with no validation; add a guard after the provider.complete call that trims
whitespace and validates the returned string (e.g., non-empty, above a
configurable minimum length, and not obviously truncated—reject outputs that end
mid-sentence or with an ellipsis/partial token). If validation fails, either
retry the provider.complete call up to N times or raise/return a clear error so
downstream interactive mode doesn't accept a broken PRD; reference the local
variables response and provider.complete and the function
resolve_ambiguities_into_prd (or wherever this return occurs) when adding the
checks and retry/error logic.
- Around line 119-125: extract_goals() currently swallows JSON parse/type errors
and returns an empty list (and classify_and_decompose() similarly returns ATOMIC
on parse failure); instead either raise a clear exception with context (e.g.,
ValueError/ParseError including the raw response.content and the underlying
exception) so the run fails loudly, or return a clearly labeled synthesized
ambiguity token (e.g., ["UNPARSEABLE_MODEL_OUTPUT"] or an Ambiguity object) so
downstream logic can treat it as ambiguous; update extract_goals() and
classify_and_decompose() to implement one of these behaviors and include the raw
response content in the error/ambiguity payload for debugging.
In `@tests/core/test_prd_stress_test.py`:
- Around line 393-406: The tests patch AnthropicProvider but fail in CI because
the CLI checks ANTHROPIC_API_KEY before instantiation; update
test_stress_test_with_prd (and the other similar test at 409-426) to set a dummy
ANTHROPIC_API_KEY in the environment before calling runner.invoke — e.g., use
monkeypatch.setenv("ANTHROPIC_API_KEY","test") or
os.environ["ANTHROPIC_API_KEY"]="test" at the start of the test so the CLI
proceeds to use the patched AnthropicProvider class and the assert on
result.exit_code remains valid.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 289b1941-6534-41a7-99e6-9bcdab32b407
📒 Files selected for processing (3)
codeframe/cli/app.pycodeframe/core/prd_stress_test.pytests/core/test_prd_stress_test.py
| try: | ||
| goals = json.loads(response.content) | ||
| if isinstance(goals, list): | ||
| return [str(g) for g in goals] | ||
| except (json.JSONDecodeError, TypeError): | ||
| pass | ||
| return [] |
There was a problem hiding this comment.
Don't turn provider parse failures into "no issues found".
Invalid model output currently degrades to [] in extract_goals() and ATOMIC in classify_and_decompose(). That can make a broken run look like a clean PRD with zero ambiguities. Please surface a hard failure or synthesize an explicit ambiguity instead of silently succeeding.
Also applies to: 158-167
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@codeframe/core/prd_stress_test.py` around lines 119 - 125, extract_goals()
currently swallows JSON parse/type errors and returns an empty list (and
classify_and_decompose() similarly returns ATOMIC on parse failure); instead
either raise a clear exception with context (e.g., ValueError/ParseError
including the raw response.content and the underlying exception) so the run
fails loudly, or return a clearly labeled synthesized ambiguity token (e.g.,
["UNPARSEABLE_MODEL_OUTPUT"] or an Ambiguity object) so downstream logic can
treat it as ambiguous; update extract_goals() and classify_and_decompose() to
implement one of these behaviors and include the raw response content in the
error/ambiguity payload for debugging.
|
"""PRD stress test via recursive decomposition. Recursively decomposes PRD goals using a tri-state classification This module is headless — no FastAPI or HTTP dependencies. import json from codeframe.adapters.llm.base import Purpose ---------------------------------------------------------------------------Data Models---------------------------------------------------------------------------class Classification(str, Enum): @DataClass @DataClass @DataClass ---------------------------------------------------------------------------Prompt Constants---------------------------------------------------------------------------GOAL_EXTRACTION_SYSTEM = ( CLASSIFY_AND_DECOMPOSE_SYSTEM = """
Return a JSON object with these fields: Return ONLY valid JSON. No markdown wrapping.""" AMBIGUITY_RESOLUTION_SYSTEM = ( ---------------------------------------------------------------------------Core Functions---------------------------------------------------------------------------def extract_goals(prd_content: str, provider) -> list[str]: def classify_and_decompose( def recursive_decompose( ---------------------------------------------------------------------------Renderers---------------------------------------------------------------------------def render_tech_spec( def _render_spec_node( def render_ambiguity_report(ambiguities: list[Ambiguity]) -> str: def resolve_ambiguities_into_prd( ---------------------------------------------------------------------------Top-level Orchestrator---------------------------------------------------------------------------def stress_test_prd( |
|
Please disregard - posted in error. |
PR #446 Review — feat(prd): recursive decomposition stress testOverall this is a clean addition that fits the Think/Build/Prove/Ship pipeline well. The core module is properly headless, the tri-state classification extends task_tree.py nicely, and test coverage is solid. A few issues worth addressing: Bug: Patch target is wrong in CLI testsThe CLI tests patch @patch("codeframe.cli.app.AnthropicProvider")Affects Missing type hints on
|
- Validate --max-depth range (1-10) at CLI option layer - Re-run stress test after interactive PRD update for fresh results - Log warnings on LLM parse failures instead of silent degradation - Use ambiguity ID (not title) as unique key in tech spec cross-refs - Validate rewritten PRD isn't truncated before accepting - Set ANTHROPIC_API_KEY in CLI test env for proper mock isolation
|
Code Review: PR 446 - cf prd stress-test. Overall this is a well-structured addition to the THINK phase. Full review posted separately due to formatting constraints. |
|
CRITICAL - Unbounded LLM fan-out: stress_test_prd can issue O(children^depth) API calls. With max_depth=3 and branching factor of 3 that is 40+ serial calls with no progress feedback and no concurrency cap. Recommend adding a max_nodes CLI flag, printing per-goal progress, and documenting expected call volume in help text. IMPORTANT (1) - _count_nodes function is in app.py but operates on DecompositionNode - it belongs in prd_stress_test.py alongside the models it works with, and currently has no test coverage. IMPORTANT (2) - Tech spec panel silently truncates at 2000 chars via Panel(result.tech_spec_markdown[:2000]) with no indication to the user. Add a visible truncation notice or use a pager. IMPORTANT (3) - Interactive mode: multiple questions per ambiguity are comma-joined into a single prompt. Either prompt per question or clarify in help text that one consolidated answer per group is expected. IMPORTANT (4) - resolve_ambiguities_into_prd uses len(updated) < len(prd_content) // 2 as a guard which will silently discard valid rewrites of short PRDs. The floor should be relative to resolved content, not a blanket 50 percent rule. NICE TO HAVE: (a) detect empty tree from extract_goals failure and warn the user explicitly; (b) DecompositionNode.children and .lineage lack field(default_factory=list); (c) resolved_answer rendering branch and resolve_ambiguities_into_prd flow have no test coverage; (d) workspace vs tmp_path fixture usage in CLI tests needs a comment. STRENGTHS: core module is headless with provider injected in CLI layer per CLAUDE.md conventions; tri-state classification model and prompts are well-designed and auditable; 23 tests with mocked LLM and CLI integration coverage are solid; re-running after interactive resolution is a good UX detail. |
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
codeframe/core/prd_stress_test.py (2)
123-130:⚠️ Potential issue | 🟠 MajorDon’t treat unparseable model output as a successful run.
On parse failure, returning
[](Line 130) or defaulting toATOMIC(Line 167) can hide provider failures and incorrectly report a PRD as well-specified.💡 Suggested fix
@@ - except (json.JSONDecodeError, TypeError) as exc: - logger.warning("Failed to parse goal extraction response: %s", exc) - return [] + except (json.JSONDecodeError, TypeError) as exc: + logger.warning("Failed to parse goal extraction response: %s", exc) + raise ValueError("Unparseable goal extraction response") from exc @@ - except (json.JSONDecodeError, TypeError) as exc: - logger.warning("Failed to parse classification for '%s': %s", title, exc) - return Classification.ATOMIC, [], None, "Low" + except (json.JSONDecodeError, TypeError) as exc: + logger.warning("Failed to parse classification for '%s': %s", title, exc) + ambiguity = Ambiguity( + id=str(uuid.uuid4()), + label="UNPARSEABLE_MODEL_OUTPUT", + source_node_title=title, + questions=["Could not parse model decomposition output. Can you restate this requirement?"], + recommendation="Clarify this requirement with explicit decomposition details.", + ) + return Classification.AMBIGUOUS, [], ambiguity, "Unknown"Also applies to: 163-167
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@codeframe/core/prd_stress_test.py` around lines 123 - 130, The current goal-extraction code treats JSON parse failures as successful runs by returning an empty list (variable goals) and the code later defaulting to ATOMIC; instead, surface parse failures so the caller can mark the provider run as failed. Change the handler for json.JSONDecodeError/TypeError (and the non-list branch) to raise a specific exception or return None (and update call sites) rather than returning [], and stop silently defaulting to ATOMIC; update the code paths that inspect the goal extraction result to treat None/exception as a provider failure. Ensure references to response.content, the goals variable, and any code that currently defaults to ATOMIC are updated to handle the new failure signal.
347-348:⚠️ Potential issue | 🟠 MajorGuard
response.contentbefore calling.strip().If provider returns
Nonecontent, Line 347 raisesAttributeErrorand aborts interactive flow.💡 Suggested fix
- updated = response.content.strip() + updated = (response.content or "").strip()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@codeframe/core/prd_stress_test.py` around lines 347 - 348, The code calls response.content.strip() without guarding for None; update the logic in the block that sets updated to first validate response.content (the variable response from the provider) before calling .strip(). Replace the direct call with a safe pattern (e.g., treat None as empty string or check isinstance/None and early-handle) so that updated is computed from a non-None string and the subsequent length check against prd_content is safe; adjust any downstream handling to account for the empty/None case.codeframe/cli/app.py (1)
1717-1729:⚠️ Potential issue | 🟠 MajorFail fast when interactive PRD version creation fails.
In interactive mode, Line 1728 only warns and continues with exit code 0. That can mislead users into thinking their resolved ambiguities were persisted.
💡 Suggested fix
- else: - console.print("[yellow]Warning:[/yellow] Failed to create new PRD version.") + else: + console.print("[red]Error:[/red] Failed to create new PRD version.") + raise typer.Exit(1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@codeframe/cli/app.py` around lines 1717 - 1729, The interactive flow currently treats a failed PRD version creation (when new_record is falsy) as a warning, which continues execution with exit code 0; change this to fail fast by aborting with a non-zero exit: in the branch where new_record is false (the block that now calls console.print("[yellow]Warning...")), raise or call a non-zero exit (e.g., raise RuntimeError or sys.exit(1)) with a clear message so the CLI returns failure to the caller; locate the logic around new_record, console.print, emit_for_workspace and stress_test_prd to implement this behavior so unresolved persistence failures in interactive mode do not proceed silently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@codeframe/core/prd_stress_test.py`:
- Around line 176-177: The code assumes children is a list of dicts when cls ==
Classification.COMPOSITE and then calls .get on each child in
recursive_decompose; to fix, validate and normalize the payload before
recursion: in the block that sets children (and in the recursive_decompose
function), check that data.get("children") is a list and replace non-list or
None with an empty list, then filter the list to only include dict items (e.g.,
[c for c in children if isinstance(c, dict)]); if any invalid items are detected
optionally log a warning or ignore them so subsequent calls to child.get(...)
are safe. Ensure you update the assignment and any recursion points that rely on
children (the children variable and the recursive_decompose function) to use the
validated/filtered list.
- Around line 114-404: Import the core event API from core.events and instrument
major state transitions: call the event emitter at start/end of stress_test_prd,
after extract_goals returns (in extract_goals), when classify_and_decompose
detects an ambiguity (in classify_and_decompose), when recursive_decompose hits
max depth and when it finishes building a node (in recursive_decompose), and
when resolve_ambiguities_into_prd produces an updated PRD (in
resolve_ambiguities_into_prd); include meaningful event names (e.g.,
"prd.stress_test.started"/".completed", "prd.goals.extracted",
"prd.ambiguity.detected", "prd.node.leaf_forced", "prd.node.decomposed",
"prd.prd.updated") and attach contextual payloads (prd_title or snippet,
goal/title, node id/title, ambiguity id/details, complexity_hint, and whether
rewrite was truncated) so auditors can correlate events — use the event API
exported by core/events.py (e.g., import emit_event or the module’s emitter) and
ensure calls are non-blocking and tolerate emitter failures (log and continue).
---
Duplicate comments:
In `@codeframe/cli/app.py`:
- Around line 1717-1729: The interactive flow currently treats a failed PRD
version creation (when new_record is falsy) as a warning, which continues
execution with exit code 0; change this to fail fast by aborting with a non-zero
exit: in the branch where new_record is false (the block that now calls
console.print("[yellow]Warning...")), raise or call a non-zero exit (e.g., raise
RuntimeError or sys.exit(1)) with a clear message so the CLI returns failure to
the caller; locate the logic around new_record, console.print,
emit_for_workspace and stress_test_prd to implement this behavior so unresolved
persistence failures in interactive mode do not proceed silently.
In `@codeframe/core/prd_stress_test.py`:
- Around line 123-130: The current goal-extraction code treats JSON parse
failures as successful runs by returning an empty list (variable goals) and the
code later defaulting to ATOMIC; instead, surface parse failures so the caller
can mark the provider run as failed. Change the handler for
json.JSONDecodeError/TypeError (and the non-list branch) to raise a specific
exception or return None (and update call sites) rather than returning [], and
stop silently defaulting to ATOMIC; update the code paths that inspect the goal
extraction result to treat None/exception as a provider failure. Ensure
references to response.content, the goals variable, and any code that currently
defaults to ATOMIC are updated to handle the new failure signal.
- Around line 347-348: The code calls response.content.strip() without guarding
for None; update the logic in the block that sets updated to first validate
response.content (the variable response from the provider) before calling
.strip(). Replace the direct call with a safe pattern (e.g., treat None as empty
string or check isinstance/None and early-handle) so that updated is computed
from a non-None string and the subsequent length check against prd_content is
safe; adjust any downstream handling to account for the empty/None case.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ea99d2b3-ce26-4784-be1d-df1020ec85bd
📒 Files selected for processing (3)
codeframe/cli/app.pycodeframe/core/prd_stress_test.pytests/core/test_prd_stress_test.py
| def extract_goals(prd_content: str, provider) -> list[str]: | ||
| """Extract high-level deliverable goals from a PRD.""" | ||
| response = provider.complete( | ||
| messages=[{"role": "user", "content": prd_content}], | ||
| purpose=Purpose.PLANNING, | ||
| system=GOAL_EXTRACTION_SYSTEM, | ||
| max_tokens=1024, | ||
| temperature=0.0, | ||
| ) | ||
| try: | ||
| goals = json.loads(response.content) | ||
| if isinstance(goals, list): | ||
| return [str(g) for g in goals] | ||
| logger.warning("Goal extraction returned non-list: %s", type(goals).__name__) | ||
| except (json.JSONDecodeError, TypeError) as exc: | ||
| logger.warning("Failed to parse goal extraction response: %s", exc) | ||
| return [] | ||
|
|
||
|
|
||
| def classify_and_decompose( | ||
| title: str, | ||
| description: str, | ||
| lineage: list[str], | ||
| prd_content: str, | ||
| depth: int, | ||
| provider, | ||
| ) -> tuple[Classification, list[dict], Optional[Ambiguity], str]: | ||
| """Classify a goal node and optionally decompose or flag ambiguity.""" | ||
| lineage_ctx = "" | ||
| if lineage: | ||
| lineage_ctx = "\n\nAncestor context:\n" + "\n".join( | ||
| f"- {desc}" for desc in lineage | ||
| ) | ||
|
|
||
| user_msg = ( | ||
| f"Goal: {title}\n" | ||
| f"Description: {description}\n" | ||
| f"Depth: {depth}{lineage_ctx}\n\n" | ||
| f"PRD:\n{prd_content}" | ||
| ) | ||
|
|
||
| response = provider.complete( | ||
| messages=[{"role": "user", "content": user_msg}], | ||
| purpose=Purpose.PLANNING, | ||
| system=CLASSIFY_AND_DECOMPOSE_SYSTEM, | ||
| max_tokens=2048, | ||
| temperature=0.0, | ||
| ) | ||
|
|
||
| try: | ||
| data = json.loads(response.content) | ||
| except (json.JSONDecodeError, TypeError) as exc: | ||
| logger.warning("Failed to parse classification for '%s': %s", title, exc) | ||
| return Classification.ATOMIC, [], None, "Low" | ||
|
|
||
| raw_cls = data.get("classification", "atomic").lower() | ||
| try: | ||
| cls = Classification(raw_cls) | ||
| except ValueError: | ||
| cls = Classification.ATOMIC | ||
|
|
||
| complexity = data.get("complexity_hint", "Low") | ||
| children = data.get("children", []) if cls == Classification.COMPOSITE else [] | ||
|
|
||
| ambiguity = None | ||
| if cls == Classification.AMBIGUOUS: | ||
| ambiguity = Ambiguity( | ||
| id=str(uuid.uuid4()), | ||
| label=data.get("ambiguity_label", "UNSPECIFIED"), | ||
| source_node_title=title, | ||
| questions=data.get("questions", []), | ||
| recommendation=data.get("recommendation", ""), | ||
| ) | ||
|
|
||
| return cls, children, ambiguity, complexity | ||
|
|
||
|
|
||
| def recursive_decompose( | ||
| title: str, | ||
| description: str, | ||
| lineage: list[str], | ||
| prd_content: str, | ||
| depth: int, | ||
| max_depth: int, | ||
| ambiguities: list[Ambiguity], | ||
| provider, | ||
| ) -> DecompositionNode: | ||
| """Recursively decompose a goal, collecting ambiguities along the way.""" | ||
| # Force leaf at max depth | ||
| if depth >= max_depth: | ||
| return DecompositionNode( | ||
| id=str(uuid.uuid4()), | ||
| title=title, | ||
| description=description, | ||
| classification=Classification.ATOMIC, | ||
| children=[], | ||
| lineage=lineage, | ||
| depth=depth, | ||
| complexity_hint="Unknown", | ||
| ) | ||
|
|
||
| cls, child_dicts, ambiguity, complexity = classify_and_decompose( | ||
| title, description, lineage, prd_content, depth, provider, | ||
| ) | ||
|
|
||
| if ambiguity: | ||
| ambiguities.append(ambiguity) | ||
|
|
||
| children = [] | ||
| if cls == Classification.COMPOSITE: | ||
| child_lineage = lineage + [title] | ||
| for child_dict in child_dicts: | ||
| child_node = recursive_decompose( | ||
| child_dict.get("title", "Untitled"), | ||
| child_dict.get("description", ""), | ||
| child_lineage, | ||
| prd_content, | ||
| depth + 1, | ||
| max_depth, | ||
| ambiguities, | ||
| provider, | ||
| ) | ||
| children.append(child_node) | ||
|
|
||
| return DecompositionNode( | ||
| id=str(uuid.uuid4()), | ||
| title=title, | ||
| description=description, | ||
| classification=cls, | ||
| children=children, | ||
| lineage=lineage, | ||
| depth=depth, | ||
| complexity_hint=complexity, | ||
| ambiguity_id=ambiguity.id if ambiguity else None, | ||
| ) | ||
|
|
||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # Renderers | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| def render_tech_spec( | ||
| tree: list[DecompositionNode], ambiguities: list[Ambiguity] | ||
| ) -> str: | ||
| """Render the decomposition tree as a markdown technical specification.""" | ||
| amb_map = {a.id: i + 1 for i, a in enumerate(ambiguities)} | ||
| lines = ["# Technical Specification\n"] | ||
|
|
||
| for node in tree: | ||
| _render_spec_node(node, lines, amb_map, header_level=2) | ||
|
|
||
| return "\n".join(lines) | ||
|
|
||
|
|
||
| def _render_spec_node( | ||
| node: DecompositionNode, | ||
| lines: list[str], | ||
| amb_map: dict[str, int], | ||
| header_level: int, | ||
| ) -> None: | ||
| """Recursively render a node into the tech spec.""" | ||
| prefix = "#" * min(header_level, 6) | ||
| lines.append(f"{prefix} {node.title}") | ||
|
|
||
| if node.classification == Classification.AMBIGUOUS: | ||
| amb_num = amb_map.get(node.ambiguity_id, "?") if node.ambiguity_id else "?" | ||
| lines.append(f"**[NEEDS CLARIFICATION — see ambiguity #{amb_num}]**\n") | ||
| elif node.classification == Classification.ATOMIC: | ||
| lines.append(f"- {node.description}") | ||
| if node.complexity_hint: | ||
| lines.append(f"- Estimated complexity: {node.complexity_hint}") | ||
| lines.append("") | ||
| else: | ||
| lines.append("") | ||
|
|
||
| for child in node.children: | ||
| _render_spec_node(child, lines, amb_map, header_level + 1) | ||
|
|
||
|
|
||
| def render_ambiguity_report(ambiguities: list[Ambiguity]) -> str: | ||
| """Render the ambiguity list as a human-readable report.""" | ||
| if not ambiguities: | ||
| return "No ambiguities found — PRD is well-specified for decomposition." | ||
|
|
||
| lines = [ | ||
| f"PRD Stress Test — {len(ambiguities)} ambiguities found:\n", | ||
| ] | ||
|
|
||
| for i, amb in enumerate(ambiguities, 1): | ||
| lines.append( | ||
| f"{i}. {amb.label} (from decomposing \"{amb.source_node_title}\")" | ||
| ) | ||
| lines.append(" The PRD doesn't specify:") | ||
| for q in amb.questions: | ||
| lines.append(f" - {q}") | ||
| lines.append(f" → Recommendation: {amb.recommendation}") | ||
| if amb.resolved_answer: | ||
| lines.append(f" ✓ Resolved: {amb.resolved_answer}") | ||
| lines.append("") | ||
|
|
||
| return "\n".join(lines) | ||
|
|
||
|
|
||
| def resolve_ambiguities_into_prd( | ||
| prd_content: str, | ||
| ambiguities: list[Ambiguity], | ||
| provider, | ||
| ) -> str: | ||
| """Use LLM to update PRD content with resolved ambiguity answers.""" | ||
| resolved = [a for a in ambiguities if a.resolved_answer] | ||
| if not resolved: | ||
| return prd_content | ||
|
|
||
| resolution_text = "\n".join( | ||
| f"- {a.label}: {', '.join(a.questions)} → Answer: {a.resolved_answer}" | ||
| for a in resolved | ||
| ) | ||
|
|
||
| response = provider.complete( | ||
| messages=[{ | ||
| "role": "user", | ||
| "content": ( | ||
| f"Original PRD:\n{prd_content}\n\n" | ||
| f"Resolved ambiguities:\n{resolution_text}\n\n" | ||
| "Update the PRD to incorporate these answers." | ||
| ), | ||
| }], | ||
| purpose=Purpose.PLANNING, | ||
| system=AMBIGUITY_RESOLUTION_SYSTEM, | ||
| max_tokens=8192, | ||
| temperature=0.0, | ||
| ) | ||
| updated = response.content.strip() | ||
| if not updated or len(updated) < len(prd_content) // 2: | ||
| logger.warning( | ||
| "PRD rewrite looks truncated (%d chars vs original %d), returning original", | ||
| len(updated), len(prd_content), | ||
| ) | ||
| return prd_content | ||
| return updated | ||
|
|
||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # Top-level Orchestrator | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| def stress_test_prd( | ||
| prd_content: str, provider, max_depth: int = 3 | ||
| ) -> StressTestResult: | ||
| """Run the full PRD stress test: extract goals → recursive decompose → render.""" | ||
| goals = extract_goals(prd_content, provider) | ||
|
|
||
| tree: list[DecompositionNode] = [] | ||
| ambiguities: list[Ambiguity] = [] | ||
|
|
||
| for goal in goals: | ||
| node = recursive_decompose( | ||
| title=goal, | ||
| description=goal, | ||
| lineage=[], | ||
| prd_content=prd_content, | ||
| depth=0, | ||
| max_depth=max_depth, | ||
| ambiguities=ambiguities, | ||
| provider=provider, | ||
| ) | ||
| tree.append(node) | ||
|
|
||
| tech_spec = render_tech_spec(tree, ambiguities) | ||
| amb_report = render_ambiguity_report(ambiguities) | ||
|
|
||
| # Extract title from PRD (first heading or first line) | ||
| prd_title = "Untitled" | ||
| for line in prd_content.splitlines(): | ||
| stripped = line.strip() | ||
| if stripped.startswith("# "): | ||
| prd_title = stripped[2:].strip() | ||
| break | ||
| if stripped: | ||
| prd_title = stripped[:80] | ||
| break | ||
|
|
||
| return StressTestResult( | ||
| prd_title=prd_title, | ||
| tree=tree, | ||
| ambiguities=ambiguities, | ||
| tech_spec_markdown=tech_spec, | ||
| ambiguity_report=amb_report, | ||
| ) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add core event emission for stress-test state transitions.
This module performs stateful transitions (goal extraction, ambiguity detection/resolution, decomposition completion) but emits no core events, which weakens auditability/observability.
As per coding guidelines, "All core modules must emit events for state transitions via core/events.py for audit and observability".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@codeframe/core/prd_stress_test.py` around lines 114 - 404, Import the core
event API from core.events and instrument major state transitions: call the
event emitter at start/end of stress_test_prd, after extract_goals returns (in
extract_goals), when classify_and_decompose detects an ambiguity (in
classify_and_decompose), when recursive_decompose hits max depth and when it
finishes building a node (in recursive_decompose), and when
resolve_ambiguities_into_prd produces an updated PRD (in
resolve_ambiguities_into_prd); include meaningful event names (e.g.,
"prd.stress_test.started"/".completed", "prd.goals.extracted",
"prd.ambiguity.detected", "prd.node.leaf_forced", "prd.node.decomposed",
"prd.prd.updated") and attach contextual payloads (prd_title or snippet,
goal/title, node id/title, ambiguity id/details, complexity_hint, and whether
rewrite was truncated) so auditors can correlate events — use the event API
exported by core/events.py (e.g., import emit_event or the module’s emitter) and
ensure calls are non-blocking and tolerate emitter failures (log and continue).
Guard against malformed LLM output in classify_and_decompose by filtering children to only dicts with title or description keys.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
codeframe/core/prd_stress_test.py (3)
123-130:⚠️ Potential issue | 🟠 MajorFail loud on unparseable model output instead of silently “succeeding.”
Line 123 and Line 165 currently degrade parse failures to empty goals /
ATOMIC, which can produce false-clean stress-test results.Suggested fix
def extract_goals(prd_content: str, provider) -> list[str]: @@ - try: - goals = json.loads(response.content) + try: + goals = json.loads(response.content) if isinstance(goals, list): return [str(g) for g in goals] - logger.warning("Goal extraction returned non-list: %s", type(goals).__name__) + raise ValueError(f"Goal extraction returned {type(goals).__name__}, expected list") except (json.JSONDecodeError, TypeError) as exc: - logger.warning("Failed to parse goal extraction response: %s", exc) - return [] + raise ValueError(f"Failed to parse goal extraction response: {exc}") from exc @@ - try: - data = json.loads(response.content) - except (json.JSONDecodeError, TypeError) as exc: - logger.warning("Failed to parse classification for '%s': %s", title, exc) - return Classification.ATOMIC, [], None, "Low" + try: + data = json.loads(response.content) + except (json.JSONDecodeError, TypeError) as exc: + raise ValueError(f"Failed to parse classification for '{title}': {exc}") from excAlso applies to: 165-167
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@codeframe/core/prd_stress_test.py` around lines 123 - 130, The code currently swallows parse failures and non-list outputs by logging a warning and returning an empty list (or falling back to "ATOMIC") which hides real failures; instead, when json.loads(response.content) raises JSONDecodeError/TypeError or when the decoded value is not a list, raise an explicit exception (e.g., ValueError or RuntimeError) that includes the raw response.content and the parse error/type info, so callers fail loudly; update both the block around the json.loads(response.content)/logger.warning non-list check and the similar fallback at lines 165-167 to raise with context rather than returning empty/ATOMIC.
114-122:⚠️ Potential issue | 🟠 MajorAdd core event emission for stress-test state transitions.
This module performs multiple state transitions (start/completion, goal extraction outcomes, ambiguity detection, recursion cutoffs, PRD rewrite outcomes) but emits no
core/events.pyevents.As per coding guidelines, "All core modules must emit events for state transitions via
core/events.pyfor audit and observability".Also applies to: 155-161, 196-253, 323-359, 367-409
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@codeframe/core/prd_stress_test.py` around lines 114 - 122, This module currently performs many state transitions but doesn't emit observability events; import the core/events emitter (e.g. from core.events import emit_event) and add well-named event emissions around each state change: emit a "prd_stress_test.started" event at the beginning of the top-level stress-test runner and "prd_stress_test.completed" on exit (include duration/status), emit "prd_goals.extracted" or "prd_goals.failed" in extract_goals with metadata (prd identifier, goals count or error), emit "prd_ambiguity.detected" where ambiguity logic runs, emit "prd_recursion.cutoff" when recursion limits trigger, and emit "prd_rewrite.result" for rewrite outcomes; include relevant context (prd id, timestamps, error/message) and ensure these calls are added in the functions handling those transitions (e.g., extract_goals and the ambiguity/recursion/rewrite handlers referenced in the diff ranges).
352-353:⚠️ Potential issue | 🟠 MajorHandle nullable provider content before
.strip().Line 352 assumes
response.contentis always a string;Nonewill raiseAttributeError.Suggested fix
- updated = response.content.strip() + updated = (response.content or "").strip()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@codeframe/core/prd_stress_test.py` around lines 352 - 353, The code assumes response.content is a non-null string before calling .strip(), which will raise AttributeError if it's None; update the logic around response and updated (used with prd_content) to first handle a nullable response.content (e.g., treat None as empty string or explicitly check for None) before calling .strip(), then apply the existing length check comparing updated to prd_content; ensure you modify the block that sets updated and the subsequent if that references prd_content to use the safe/normalized updated value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@codeframe/core/prd_stress_test.py`:
- Around line 163-173: The code assumes parsed JSON `data` is a dict and that
`data.get("classification")` returns a string before calling `.lower()`, which
can fail if the model returns an array or non-string; update the block around
`data = json.loads(response.content)` / `raw_cls = data.get("classification",
"atomic").lower()` to first ensure `data` is a dict and that `classification` is
a str (e.g., check isinstance(data, dict) and isinstance(cls_val, str)); if
those checks fail, fall back to "atomic" (or the default) so
`Classification(raw_cls)` never receives invalid input, and keep the existing
error handling using `Classification.ATOMIC` when ValueError occurs.
---
Duplicate comments:
In `@codeframe/core/prd_stress_test.py`:
- Around line 123-130: The code currently swallows parse failures and non-list
outputs by logging a warning and returning an empty list (or falling back to
"ATOMIC") which hides real failures; instead, when json.loads(response.content)
raises JSONDecodeError/TypeError or when the decoded value is not a list, raise
an explicit exception (e.g., ValueError or RuntimeError) that includes the raw
response.content and the parse error/type info, so callers fail loudly; update
both the block around the json.loads(response.content)/logger.warning non-list
check and the similar fallback at lines 165-167 to raise with context rather
than returning empty/ATOMIC.
- Around line 114-122: This module currently performs many state transitions but
doesn't emit observability events; import the core/events emitter (e.g. from
core.events import emit_event) and add well-named event emissions around each
state change: emit a "prd_stress_test.started" event at the beginning of the
top-level stress-test runner and "prd_stress_test.completed" on exit (include
duration/status), emit "prd_goals.extracted" or "prd_goals.failed" in
extract_goals with metadata (prd identifier, goals count or error), emit
"prd_ambiguity.detected" where ambiguity logic runs, emit "prd_recursion.cutoff"
when recursion limits trigger, and emit "prd_rewrite.result" for rewrite
outcomes; include relevant context (prd id, timestamps, error/message) and
ensure these calls are added in the functions handling those transitions (e.g.,
extract_goals and the ambiguity/recursion/rewrite handlers referenced in the
diff ranges).
- Around line 352-353: The code assumes response.content is a non-null string
before calling .strip(), which will raise AttributeError if it's None; update
the logic around response and updated (used with prd_content) to first handle a
nullable response.content (e.g., treat None as empty string or explicitly check
for None) before calling .strip(), then apply the existing length check
comparing updated to prd_content; ensure you modify the block that sets updated
and the subsequent if that references prd_content to use the safe/normalized
updated value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 93dba181-568a-47ff-af81-a36691e60f6e
📒 Files selected for processing (1)
codeframe/core/prd_stress_test.py
| try: | ||
| data = json.loads(response.content) | ||
| except (json.JSONDecodeError, TypeError) as exc: | ||
| logger.warning("Failed to parse classification for '%s': %s", title, exc) | ||
| return Classification.ATOMIC, [], None, "Low" | ||
|
|
||
| raw_cls = data.get("classification", "atomic").lower() | ||
| try: | ||
| cls = Classification(raw_cls) | ||
| except ValueError: | ||
| cls = Classification.ATOMIC |
There was a problem hiding this comment.
Guard JSON shape before using .get() / .lower().
If the model returns a JSON array or a non-string classification, this block can fail or behave unpredictably.
Suggested fix
- try:
- data = json.loads(response.content)
+ try:
+ data = json.loads(response.content)
+ if not isinstance(data, dict):
+ raise TypeError(f"Expected object, got {type(data).__name__}")
except (json.JSONDecodeError, TypeError) as exc:
logger.warning("Failed to parse classification for '%s': %s", title, exc)
return Classification.ATOMIC, [], None, "Low"
- raw_cls = data.get("classification", "atomic").lower()
+ raw_cls = data.get("classification", "atomic")
+ if not isinstance(raw_cls, str):
+ raw_cls = "atomic"
+ raw_cls = raw_cls.lower()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@codeframe/core/prd_stress_test.py` around lines 163 - 173, The code assumes
parsed JSON `data` is a dict and that `data.get("classification")` returns a
string before calling `.lower()`, which can fail if the model returns an array
or non-string; update the block around `data = json.loads(response.content)` /
`raw_cls = data.get("classification", "atomic").lower()` to first ensure `data`
is a dict and that `classification` is a str (e.g., check isinstance(data, dict)
and isinstance(cls_val, str)); if those checks fail, fall back to "atomic" (or
the default) so `Classification(raw_cls)` never receives invalid input, and keep
the existing error handling using `Classification.ATOMIC` when ValueError
occurs.
PR #446 Follow-up ReviewBuilding on my prior reviews — the earlier issues (patch target, type hints, silent empty-goals, 2000-char truncation) are still present. Two additional bugs found in the latest diff that were not covered before: Bug: prd_module.store() called with swapped content/title args in tests In tests/core/test_prd_stress_test.py, both setUp calls pass arguments in the wrong order. The actual prd.store() signature is store(workspace, content, title=None, ...) but the tests call it as store(workspace, 'Test PRD', sample_prd, {}). This stores the literal string 'Test PRD' as the PRD body and sample_prd as the title. Tests pass today because they only validate rendered output, not the stored record fields — but any assertion on record.content will silently return the wrong value. Fix: swap to store(workspace, sample_prd, 'Test PRD', {}). Bug: output.write_text() missing encoding parameter output.write_text(result.tech_spec_markdown) at line 151 uses the platform default encoding. On Windows this is typically cp1252, not UTF-8. The tech spec contains Unicode characters (arrows, checkmarks). prd.py's export_to_file already passes encoding='utf-8' — this write_text call should match that convention. Minor: child validation uses OR and can silently produce Untitled nodes The child filter uses ("title" in c or "description" in c). A dict with only a description key passes through and gets title = child_dict.get('title', 'Untitled') downstream. Using AND instead of OR is more defensive and matches the LLM prompt contract which specifies both keys are required. |
Summary
Implements #421: Recursive Decomposition as PRD Stress Test and Technical Spec Generator
cf prd stress-testcommand that recursively decomposes PRD goals using tri-state classification (atomic/composite/ambiguous)--interactive) resolves ambiguities inline and creates a new PRD version--max-depth, default 3)--output) writes tech spec to diskFiles Changed
codeframe/core/prd_stress_test.py— Core module with data models, prompts, recursive engine, rendererscodeframe/cli/app.py— Addedprd stress-testCLI commandtests/core/test_prd_stress_test.py— 23 tests covering models, functions, renderers, CLIAcceptance Criteria
cf prd stress-testrecursively decomposes PRD goals--interactivemode allows inline ambiguity resolutionprd.create_new_version())--outputwrites tech spec to filecf prd generateTest Plan
Implementation Notes
PrdDiscoverySessionpatterns for LLM interaction andtask_tree.pyfor recursive decompositionClassificationenum) extends beyondtask_tree.py's binary classifyAnthropicProviderimported in CLI (not core) to keep core provider-agnosticCloses #421
Summary by CodeRabbit