Problem
Terse or informal Slack messages that clearly match a skill are not reliably triggering skill loads. Confirmed miss:
User: "pr has some feedback, cleanup"
Expected: loadSkill("pr-cleanup")
Actual: skill not loaded
The pr-cleanup description includes "address review feedback", "clean up a PR", "make an existing PR ready for review" — all semantically matching — but the colloquial shorthand wasn't enough.
Root Causes
1. Skill routing is advisory, not mandatory
The authoritative <skill-policy> block only constrains invalid skill loading (only load listed ones, one at a time). It does not issue a positive obligation to scan skills before proceeding with any actionable request.
2. The strongest positive directive is in a weak location
"Scan before answering. Load the most specific matching skill..." lives in the <available-skills> block header, not in <skill-policy>. Models typically weight policy blocks higher than metadata/headers in long prompts.
3. Descriptions lack terse Slack-native phrasings
Current description trigger phrases are well-written but full-sentence. Slack messages like "pr has feedback", "PR has comments", "cleanup pr" are shorthand that doesn't match the description's register.
4. Explicit-name-match bias
The current hint — "A request that names a skill, plugin, provider, or account matching a skill name is a skill match" — biases the model toward exact name matches over semantic task matching.
5. No harness-level routing enforcement or eval
No router/classifier, no forced skill preflight, no regression test for terse trigger phrasings. Routing depends entirely on one generation step self-selecting the right skill.
Research: How Other Harnesses Handle This
- Anthropic / Claude Skills: Description is the primary routing signal + progressive disclosure. Eval-driven development is recommended. Description quality is critical but insufficient alone for reliability.
- OpenAI function calling: Descriptions are primary signal, but production systems often add
tool_choice constraints or application-side routing when reliability matters.
- LangGraph / workflow agents: Separate router/classifier node decides next skill/tool before execution — routing is a distinct control-plane step, not a self-selection by the worker.
- Cursor/Codex AGENTS.md / CLAUDE.md: Explicit task rules ("for PR review comments, use Y workflow") as routing tables — not pure semantic matching.
The common pattern across reliable harnesses: routing is a separate, explicit, mechanically enforced step, not hoped-for self-selection.
Proposed Fixes
Fix 1: Strengthen <skill-policy> (highest impact, fix first)
Move the positive obligation into the authoritative block:
<skill-policy>
- Before attempting any actionable request, check available skills for a semantic match.
- If any listed skill matches the user's task or likely intent, load the most specific matching skill before answering or taking other action.
- Do not answer from memory, inspect files, call unrelated tools, or ask broad clarifying questions when a skill fits.
- Match skills semantically, including terse Slack phrasing, abbreviations, and thread context. Exact skill-name matches are sufficient but not required.
- Slack messages are often terse fragments; interpret them using current thread context when determining skill matches.
- Only load skills listed in <available-skills>, <user-callable-skills>, or named by <explicit-skill-trigger>. Never guess or invent a skill name.
- Load one skill at a time. After loadSkill, follow the instructions returned by that tool result.
</skill-policy>
Fix 2: Add a compact routing table for high-confusion skills
A <skill-routing-hints> block focused on commonly confused cases:
<skill-routing-hints>
- Existing PR + feedback / comments / review / requested changes / cleanup / CI / readiness → pr-cleanup
Short forms: "pr has feedback", "cleanup the pr", "PR has comments", "address review feedback", "ci failed on the pr"
- New code, implementation, debugging, repo edits not specifically iterating an existing PR → github-code
</skill-routing-hints>
Keep this table compact — only high-value confusion pairs, not a duplication of all descriptions.
Fix 3: Add terse Slack phrasings to pr-cleanup description
Iterate on an existing pull request until it is review-ready. Use for terse or explicit requests involving a PR/pull request plus cleanup, feedback, review comments, requested changes, CI failures, polishing, or making the PR ready. Examples: "pr has feedback", "cleanup the pr", "PR has comments", "address review feedback", "fix requested changes", "fix PR CI", "make this ready for review".
Fix 4: Evals / regression tests
Add routing evals where expected first action is loadSkill("pr-cleanup"):
Should trigger:
- "pr has some feedback, cleanup"
- "pr has feedback"
- "cleanup the pr"
- "PR comments came in, fix them"
- "address review comments"
- "ci failed on the pr"
- "make the PR ready"
- "polish this PR"
- "there are requested changes"
Should NOT trigger:
- "write a cleanup script" → not pr-cleanup
- "clean up this function" → github-code
- "what is a PR?" → no skill
- "open a new PR" → github-code / pr-writer
Fix 5 (longer term): Harness-level router
For high-value routing, add a lightweight classifier that runs before the executor:
user message + thread context
→ router classifier (skill | none)
→ executor with skill preloaded
This makes routing a control-plane step, not a self-selection hope. Especially useful for: skills that handle terse/implicit Slack messages, skills that are frequently missed, skills with adjacent overlapping descriptions.
Priority
<skill-policy> policy strengthening (prompt change, zero-risk, global fix)
pr-cleanup description update (description change, low-risk)
- Compact
<skill-routing-hints> block (prompt change, low-risk)
- Routing evals (new test coverage)
- Harness-level router (architecture change, longer term)
Impact
This miss is likely not isolated to pr-cleanup. The same policy weakness affects all skills when users phrase requests tersely or colloquially — which is most Slack messages.
--
View Junior Session in Sentry
Problem
Terse or informal Slack messages that clearly match a skill are not reliably triggering skill loads. Confirmed miss:
The
pr-cleanupdescription includes "address review feedback", "clean up a PR", "make an existing PR ready for review" — all semantically matching — but the colloquial shorthand wasn't enough.Root Causes
1. Skill routing is advisory, not mandatory
The authoritative
<skill-policy>block only constrains invalid skill loading (only load listed ones, one at a time). It does not issue a positive obligation to scan skills before proceeding with any actionable request.2. The strongest positive directive is in a weak location
"Scan before answering. Load the most specific matching skill..." lives in the
<available-skills>block header, not in<skill-policy>. Models typically weight policy blocks higher than metadata/headers in long prompts.3. Descriptions lack terse Slack-native phrasings
Current description trigger phrases are well-written but full-sentence. Slack messages like "pr has feedback", "PR has comments", "cleanup pr" are shorthand that doesn't match the description's register.
4. Explicit-name-match bias
The current hint — "A request that names a skill, plugin, provider, or account matching a skill name is a skill match" — biases the model toward exact name matches over semantic task matching.
5. No harness-level routing enforcement or eval
No router/classifier, no forced skill preflight, no regression test for terse trigger phrasings. Routing depends entirely on one generation step self-selecting the right skill.
Research: How Other Harnesses Handle This
tool_choiceconstraints or application-side routing when reliability matters.The common pattern across reliable harnesses: routing is a separate, explicit, mechanically enforced step, not hoped-for self-selection.
Proposed Fixes
Fix 1: Strengthen
<skill-policy>(highest impact, fix first)Move the positive obligation into the authoritative block:
Fix 2: Add a compact routing table for high-confusion skills
A
<skill-routing-hints>block focused on commonly confused cases:Keep this table compact — only high-value confusion pairs, not a duplication of all descriptions.
Fix 3: Add terse Slack phrasings to
pr-cleanupdescriptionFix 4: Evals / regression tests
Add routing evals where expected first action is
loadSkill("pr-cleanup"):Should trigger:
Should NOT trigger:
Fix 5 (longer term): Harness-level router
For high-value routing, add a lightweight classifier that runs before the executor:
This makes routing a control-plane step, not a self-selection hope. Especially useful for: skills that handle terse/implicit Slack messages, skills that are frequently missed, skills with adjacent overlapping descriptions.
Priority
<skill-policy>policy strengthening (prompt change, zero-risk, global fix)pr-cleanupdescription update (description change, low-risk)<skill-routing-hints>block (prompt change, low-risk)Impact
This miss is likely not isolated to
pr-cleanup. The same policy weakness affects all skills when users phrase requests tersely or colloquially — which is most Slack messages.--
View Junior Session in Sentry