Skip to content

fix(api): AIN-173 · soften code-capability inference + empty-route diagnostics#32

Merged
hizrianraz merged 1 commit into
mainfrom
fix/ain-173-soft-capability-inference
May 18, 2026
Merged

fix(api): AIN-173 · soften code-capability inference + empty-route diagnostics#32
hizrianraz merged 1 commit into
mainfrom
fix/ain-173-soft-capability-inference

Conversation

@hizrianraz
Copy link
Copy Markdown
Contributor

Summary

  • Role-weighted code detection: code capability only inferred from user/assistant content, not system role. Addresses Manwe's 2026-05-18 over-rejection (2KB SOUL → 400 no_models_match_constraints even though user just said "hi").
  • Empty-route diagnostic logging: structured WARNING with per-filter elimination counts so incident triage can read off "capability=8, quality=2, cost=2" without per-call debug logging.

Hypothesis C fix only (Hypothesis A/B deferred)

Per ticket §"Recommended: Combined fix (defense in depth)" — this PR ships #3 (soften capability inference) only.

Why not A (catalog audit): Per Memory #5 verify-the-plan, verifying code capability flags on top 5 frontier models against provider docs needs founder lock. SQL UPDATE without that lock is too risky. Filing as founder-action recommended for next session.

Why not B (cost recalibration): Needs 30-day rolling actual-cost data which lives in AIN-154 Phase E (ATS scoring) pipeline. Deferred to that phase.

Why not #4 (downgrade header): Changing 400 → 200 + x-ainfera-capability-downgraded is a behavioral contract change deserving wider review.

Tests added (4 regressions)

  • test_code_not_detected_when_only_in_system_prompt — exact Manwe reproducer pattern
  • test_code_detected_when_user_turn_has_code_even_with_neutral_system — legitimate code request guard
  • test_code_detected_when_assistant_turn_has_code — multi-turn (only system is descriptive-only)
  • test_long_context_still_role_agnostic — scope guard (fix is code only)

Test plan

  • 4 new assertions pass
  • Pre-commit hooks (ruff / mypy --strict / pytest -x) green
  • CI green
  • Manwe replay: same 2KB SOUL system + "hi" user → 200 (post-merge)
  • Tulkas probe (AIN-178): 10 capability-trigger prompts → 0 unexpected 400s

Refs

  • AIN-173 (parent · BUG 1 from Manwe production dogfood)
  • AIN-154 (router hardening epic)
  • AIN-178 (Tulkas activation — verification path)

🤖 Generated with Claude Code

…e diagnostics

Per Manwe v0.14.0 production dogfood 2026-05-18 BUG 1. Hypothesis C
fix only — addresses over-aggressive inference where descriptive
system prompts triggered the `code` capability requirement that
filtered out all candidate models.

## Changes

### 1. Role-weighted code detection

`detect_capabilities` now extracts text into (system_text, other_text)
buckets and only triggers `code` when the syntax markers appear in
user/assistant content. System-only mentions of code keywords are
treated as descriptive context, not capability requests.

Rationale: SOUL system prompts (Manwe's 2KB pattern) often describe
"this agent can review code / write scripts / etc." with example
snippets. The user's actual turn might be "hi" — no need to filter
to code-capable models for a greeting.

`_CODE_SYNTAX_MARKERS` unchanged; only matching substrate moves
from "full_text" to "other_text".

### 2. Empty-route diagnostic logging

`auto_route` now emits a structured WARNING with per-filter
elimination counts when returning empty:

    auto_route_empty agent=<id> required_caps=[code, text]
    quality_floor=good per_call_cap=2.00 rows_inspected=12
    dropped_by={capability=8, quality=2, cost=2}

Manwe's bug surfaced as a generic 400 with no per-filter breakdown
— this fixes the visibility gap.

## Deferred with explicit notes

- Hypothesis A catalog audit (model capability flag verification
  against provider docs) — founder lock required per Memory #5
- Hypothesis B cost-estimate recalibration — needs 30d rolling data
  via AIN-154 Phase E
- Recommendation #4 downgrade header (200 + x-ainfera-capability-
  downgraded) — behavioral contract change deserves wider review

## Tests added

4 regression cases in tests/unit/test_t9_auto_routing.py:
- code NOT inferred from system-only mention (Manwe pattern)
- code IS inferred when user turn has code (regression guard)
- code IS inferred when assistant turn has code (multi-turn)
- long_context still role-agnostic (scope guard)

Pre-commit hooks green.

## Refs

- AIN-173 (parent · BUG 1)
- AIN-154 (router hardening epic)
- AIN-178 (Tulkas activation — verify 10 capability-trigger prompts)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented May 18, 2026

You have used all Bugbot PR reviews included in your free trial for your GitHub account on this workspace.

To continue using Bugbot reviews, enable Bugbot for your team in the Cursor dashboard.

@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 18, 2026

AIN-173 🔴 BUG 1: /v1/inference no_models_match_constraints over-rejects code-capable requests (hard-fails turn)

Severity: URGENT 🔴

Filed from Manwe (hermes-agent v0.14.0) production dogfood 2026-05-18. Customer #1 evidence: hard-fails agent turns, framework retries 3× then aborts with empty response. Opaque error from consumer perspective.

Symptom

POST /v1/inference returns HTTP 400 with:

{
  "detail": {
    "code": "no_models_match_constraints",
    "message": "no active model satisfies the requested capabilities, quality_floor, and per_call_cap_usd",
    "context": {
      "capabilities_required": ["code", "text"],
      "quality_floor": "good",
      "per_call_cap_usd": "2"
    }
  }
}

Reproducer

# Yesterday's request worked
curl -X POST https://api.ainfera.ai/v1/inference \
  -H "Authorization: Bearer $MANWE_KEY" \
  -d '{
    "model": "ainfera-auto",
    "messages": [
      {"role": "system", "content": "<512-byte SOUL without code keywords>"},
      {"role": "user", "content": "hi"}
    ]
  }'
# → 200 OK with claude-opus-4-7

# Today's request fails
curl -X POST https://api.ainfera.ai/v1/inference \
  -H "Authorization: Bearer $MANWE_KEY" \
  -d '{
    "model": "ainfera-auto",
    "messages": [
      {"role": "system", "content": "<2KB SOUL mentioning coding/system admin/code execution>"},
      {"role": "user", "content": "hi"}
    ]
  }'
# → 400 no_models_match_constraints

Root cause hypothesis

Auto-router infers capabilities_required=["code", "text"] from prompt keywords. Catalog state changed when AIN-141 migration ran (2026-05-18 14:55-14:58 UTC). One of:

A. claude-opus-4-7 registration in models table is missing the code capability flag (catalog drift from AIN-165 case study — same pattern: SKILL.md vs daemon config drift, this time model_catalog vs router_capability_inference)

B. claude-opus-4-7 per-call cost estimate exceeds the $2 per_call_cap_usd. Measured per-call cost ~$0.035, but estimate may overshoot. Catalog cost_per_call_estimate_usd column may be miscalibrated.

C. Routing layer's capability inference is too aggressive — promotes "coding" → code capability when the prompt is descriptive ("my agent is good at coding") rather than directive ("write me code"). Should distinguish discussion-of-code from request-for-code.

Cross-framework impact

Agent Hit probability Reason
Manwe (hermes) 🔴 HITS NOW Confirmed via this bug report
Varda (Claude SDK + OpenClaw) 🔴 HITS System prompts mention code review, PR triage, audit — all keyword triggers
Aule (Claude SDK + Opus) 🔴 HITS git_pr_create, bash_execute_sandboxed tools — entire purpose is code
Yavanna (LangGraph) 🟡 LOW-MED Tweet drafts mention competitors who do "code" (LangChain, etc.) → may trigger
Namo (Letta + Gemini) 🟡 LOW Research synthesis may discuss code-related papers → keyword trigger
Tulkas (Garak + Mistral) 🟡 LOW Single-prompt probes; system prompts mention "drain proof" not code

Net: 3 of 6 fleet agents at production risk. Manwe confirmed broken. Varda + Aule pending validation.

Fix recommendation (pick one OR combine)

Recommended: Combined fix (defense in depth)

  1. Audit models catalog — verify every model has correct capability flags:

    SELECT provider, slug, capabilities, cost_per_call_estimate_usd, active 
    FROM models 
    WHERE active = true 
    ORDER BY provider, slug;

    Cross-reference with provider API docs. Fix any code-capable models marked otherwise. PRIORITY: claude-opus-4-7, claude-sonnet-4-6, gpt-5-5, gpt-5-5-mini, gemini-3-1-pro, grok-4.

  2. Audit cost_per_call_estimate_usd — measured values vs catalog estimates. If overshoot >2×, recalibrate from last-30-days p95 actual cost.

  3. Soften capability inference — distinguish:

    • "code" capability inferred from EXPLICIT directives ("write code", "implement", "fix this bug", code blocks in user message) → strict
    • "code" capability inferred from DESCRIPTIVE prompts ("my agent does code review", "I work on a codebase") → soft (informational, doesn't gate routing)
  4. Header/extension fallback hint — when constraint over-rejects but text-only routing would succeed, return 200 with x-ainfera-capability-downgraded: true header instead of 400. Consumer can opt out via X-Strict-Capabilities: true request header.

Acceptance gates

  • Migration: catalog audit + fix for top 5 frontier models (claude-opus-4-7 first)
  • Test: replay Manwe's failing request → 200 OK
  • Test: code-explicit user message → still routes to code-capable model
  • Test: descriptive system prompt with "code" keyword + text-only user message → 200 OK (downgrade header set)
  • Tulkas probe (AIN-178): replay 10 capability-triggering prompts post-fix → 0 unexpected 400s
  • Audit event added: inference.capability_downgraded when fallback fires
  • Per When-stuck fix(api): AIN-141 strip retired AAMC model literals (code-side only; DB migration deferred) #19: STAGING canary before prod
  • PR opens off fix/ain-173-code-capability-rejection branch (Aule author override)

Connection to existing tickets

  • AIN-154 (router hardening) — this is a router-layer bug, parents here
  • AIN-165 (SKILL.md vs daemon config drift) — same Discipline fix(orm): use postgresql.ARRAY so capabilities.contains() emits @> #11 corollary pattern: catalog vs router_inference drift
  • AIN-141 (catalog migration that triggered state change) — root cause analysis lesson
  • AIN-171 (auto-route asymmetric exception coverage) — same auto-route path; verify Choice B exception coverage handles no_models_match_constraints correctly

Workaround (Manwe already using)

Pin model via consumer env var: MANWE_AINFERA_DEFAULT_MODEL=claude-opus-4-7 bypasses auto-router entirely. This is the AIN-172 SPOF mitigation pattern — proves the env-override pattern is load-bearing for incident response.

Founder authorization

Per "Fix this error for and check from all frameworks. Tulkas need to start working now." (2026-05-18 PM)

Review in Linear

@hizrianraz hizrianraz merged commit 6809731 into main May 18, 2026
3 checks passed
@hizrianraz hizrianraz deleted the fix/ain-173-soft-capability-inference branch May 18, 2026 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant