fix(prompts): LLM prompting audit fixes (8 findings)#218
Merged
Conversation
From the prompting audit: #2 Acting Coach was at SectionPriority.IMPORTANT while its source signals (causal_context, valence_context, body_state) were CRITICAL. Under token pressure the interpreted view dropped before the raw signals it synthesises — backwards. Bumped to CRITICAL so the agent safety annotations survive alongside the bio data they summarise. #4 Added regression tests for the deliberation_transcript suppression contract in _add_perception_sections + _add_working_memory_section. When transcript is present, bio_enrichment_context and working_memory_thoughts are intentionally suppressed (transcript is assumed to subsume them). The risk is silent staleness: a fresh percept arriving after transcript construction would have its bio_enrichment dropped. The tests pin both branches so any future producer/consumer divergence surfaces loudly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… auto-sense logging #7 Bumped current-observation section from NICE_TO_HAVE to IMPORTANT in PromptBuilder._add_perception_sections. Tools cannot be safely invoked without knowing the current state, so the observation should outlast conversation history under token pressure. #3 SensePresenceTool.input_schema migrated from the legacy description-as-value pattern to JSONSchema directly. The old form emitted required: ["context"] in the export despite the prose saying "Optional", which strict MCP / Anthropic clients reject. execute() already handles a missing context gracefully and the auto-sense caller never passes one. Added regression test in test_tool_discovery.py. Auto-sense logging in agent_loop.py:1017-1057 — split the bare except (KeyError, Exception): pass blanket into: - KeyError on registry.get(): silent (tool not registered is a legitimate config, e.g. agents without entity_map). - Anything else from execute(): log_swallowed_exception so silent blindness can no longer hide a real bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e plan B1
The substrate plan's B1 ("PromptAssembler — single composition point")
was archived as shipped, but the migration was never completed. Exit
criterion (grep -r system_message=f outside the assembler returns
nothing) was never met. agents/prompt_builder.py is the actual
production builder at 1658 lines; prompts/assembler.py was a 184-line
abandoned scaffold.
What was dead:
- PromptAssembler class (no production caller, never instantiated)
- compose_memory_section, compose_observation_section methods
- MemorySummary.from_atl classmethod
- MemoryHub.get_memory_summary (only consumer of from_atl; itself had
zero production callers, only one test asserting it returned None
when substrate disabled)
Removed: prompts/assembler.py, prompts/__init__.py exports for the
three dead symbols, MemoryHub.get_memory_summary, related tests in
test_substrate_recognition.py (TestMemorySummary, TestPromptAssembler,
test_get_memory_summary_none_when_disabled), and the stale "PromptAssembler
will replace this" comment in dm_runtime.py.
Updated CLAUDE.md "Prompt composition" key files row: prompt_builder.py
is the canonical builder, no longer mislabelled "legacy" — the file
that had the misleading "(legacy)" annotation is the one in production,
and the file labelled canonical is what got deleted.
If a future plan wants to replace prompt_builder.py with a structured
composition layer, start fresh — the previous attempt is not a useful
foundation to build on.
Test count: 6303 passed (was 6312; 9 tests removed for the dead code).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…or truncation + AgentPool stash cleanup #5 Two prompt-builder budget fixes: - Conversation history turn cap was hardcoded `3 if acting_coach else 12`, ignoring n_ctx entirely. Embodied agents on 32K-context models still got 3 turns; non-embodied on 4K still got 12. Replaced with proportional formulas that scale with n_ctx and floor at sensible minimums: embodied min(12, max(3, n_ctx // 2000)), non-embodied min(20, max(6, n_ctx // 800)). - Motor programs section truncated by line count (`m // 15`), which left orphan continuation lines (Steps:/Known risks:) when the owning name line got dropped. Each motor program is 1-3 lines; line-count slicing doesn't respect that structure. Replaced with an entry-aware truncate_fn that drops full entries from the tail using a char-budget estimate (`max_tok * 4`, same heuristic as the deliberation-transcript truncator). Also stops once only the header remains. The deliberation-transcript budget at line 1277 (`min(2000, available * 0.3)`) was flagged in the audit but is actually a correct proportional pattern (cap + fraction); left unchanged. #6 AgentPool.remove() now drops the per-agent bio_integration stash entries (`_episode_ticks`, `_latest_pain_intensity`, `_latest_substrate_nodes`) so a future agent reusing the same id doesn't inherit stale tick counters / pain intensity / substrate node refs from the removed one. No bug today since multi-agent runs use unique ids, but the cleanup is cheap insurance against NPC respawn or pool recycling patterns. Tests: +4 (test_remove_clears_per_agent_stash + three motor program truncation cases pinning no-orphans + tail-drop + head-preservation). Total: 6307 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes from the LLM-prompting audit (8 findings, all closed). Net: 4 commits, +422/−350 LOC, 6307 tests pass (+10 new regression tests for the contracts that were silently relied on).
sense_presence.input_schemamigrated from the legacy description-as-value pattern to JSONSchema, so the export markscontextas optional (matching the implementation) instead of required (matching the prose). Strict MCP / Anthropic clients no longer reject calls that omitcontext.agent_loop.pysplit:KeyError(tool not registered) stays silent, anything else routes throughlog_swallowed_exceptionso silent blindness can no longer hide a real bug.prompts/assembler.pywas abandoned scaffold whose only consumer-of-a-consumer (MemoryHub.get_memory_summary) had zero production callers. CLAUDE.md "legacy" label corrected.min(12, max(3, n_ctx // 2000))for embodied /min(20, max(6, n_ctx // 800))for non-embodied. Embodied agents on 32K-context models no longer get clipped to 3 turns regardless of context.AgentPool.remove()now drops module-level bio_integration stash entries (_episode_ticks,_latest_pain_intensity,_latest_substrate_nodes) so a future agent reusing the same id can't inherit stale state. No bug today — multi-agent runs use unique ids — but cheap insurance against NPC respawn / pool recycling patterns.Plus regression tests pinning the deliberation-transcript suppression contract (
bio_enrichment_contextandworking_memory_thoughtsare intentionally suppressed when a transcript is present) so future producer/consumer divergence surfaces loudly instead of as a stale-prompt regression.Test plan
ruff check+ruff formaton every touched fileAgentPoolstash cleanup (1),sense_presenceschema export (1)MAXIM_LOG_FILEto confirm Acting Coach now appears under token pressure on a 4K-ctx modelsense_presencecalls withoutcontextno longer get rejected🤖 Generated with Claude Code