Skip to content

test(round22): TD-192 live latency regression — fractal-entry 2 → 1 calls#32

Merged
engkimo merged 1 commit into
mainfrom
test/round22-td192-latency
May 15, 2026
Merged

test(round22): TD-192 live latency regression — fractal-entry 2 → 1 calls#32
engkimo merged 1 commit into
mainfrom
test/round22-td192-latency

Conversation

@engkimo

@engkimo engkimo commented May 15, 2026

Copy link
Copy Markdown
Owner

Summary

  • New live E2E tests/integration/test_round22_td192_latency.py pins TD-192's LLM round-trip reduction.
  • Spy LLMGateway counts complete() calls; runs against real Ollama (qwen3:8b).
  • Pre-TD-192 baseline: 2 calls/goal (bypass classifier + output classifier).
  • Post-TD-192: 1 call/goal — 50% reduction at the fractal entry point.

Live measurement (Round 22)

Goal calls elapsed bypass output complexity
氷川神社のスライドを作って 1 7.80s False file medium
What is 2+2? 1 1.08s True text simple
2-goal total 2 2.41s (baseline 4, saved 2)

TD-191 guard re-verified

  • Artifact goal still takes fractal path (output != TEXT clamps bypass=False).
  • Plain text Q&A still bypasses (TD-167 latency win preserved).

Test plan

  • pytest tests/integration/test_round22_td192_latency.py -v -s → 3/3 PASS against real Ollama.
  • ruff check tests/integration/test_round22_td192_latency.py clean.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
    • Added integration tests verifying improved LLM call efficiency for classification workflows, with skip conditions for unavailable dependencies.

Review Change Stack

… → 1

Pin TD-192's LLM round-trip reduction with a real Ollama (qwen3:8b) test:
spy LLMGateway counts complete() invocations during fractal-entry
classification. Pre-TD-192 baseline = 2 calls/goal (bypass + output
classifiers); post-TD-192 = 1 call/goal. Round 22 measured 2 LLM calls
across 2 goals (artifact + text), saving 2 calls vs baseline. TD-191
guard re-verified end-to-end: artifact goal stays on fractal path,
plain text Q&A still bypasses.
@coderabbitai

coderabbitai Bot commented May 15, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

This PR adds a new Ollama-backed integration test suite that measures LLM call counts to validate TD-192 fractal-entry classification latency improvements. The tests confirm FractalBypassClassifier makes exactly one LLM call per goal and verify end-to-end call reduction matches the expected TD-192 target.

Changes

TD-192 Call Count Reduction Test Suite

Layer / File(s) Summary
Test context and setup
tests/integration/test_round22_td192_latency.py
Module docstring documents TD-192/TD-191 intent and expected call reduction target. Imports, Ollama CLI detection, pytest skip markers, and constants define pre/post TD-192 call counts for artifact and text goals.
Test infrastructure and fixtures
tests/integration/test_round22_td192_latency.py
In-memory cost repository provides async cost tracking methods. _SpyLLMGateway wrapper increments call_count on each complete() call while delegating to real LiteLLMGateway. Module-scoped fixtures manage asyncio event loop, Ollama availability validation with qwen3 model check, and LiteLLMGateway construction with CostTracker and Settings.
Test cases for call count validation
tests/integration/test_round22_td192_latency.py
TestTD192CallCountReduction runs three async tests: separate artifact goal and text Q&A goal tests each assert exactly one LLM call with correct bypass behavior and output requirements; test_round22_summary runs both goals end-to-end, measures elapsed time, asserts combined call count matches TD-192 reduced total, and prints baseline vs. saved call summary.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A spy counts calls with whispered care,
One per goal floats through the air,
TD-192's latency saved,
Integration tests now, bravely paved! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding a test for TD-192 that verifies fractal-entry LLM call reduction from 2 to 1 calls.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch test/round22-td192-latency

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/integration/test_round22_td192_latency.py (1)

49-67: ⚡ Quick win

Add type annotations to test double.

The _InMemoryCostRepo class is missing type hints on the record parameter and the _records list. Adding these would improve type safety and make the interface contract clearer.

♻️ Proposed type annotations
+from typing import Any
+
 class _InMemoryCostRepo:
     def __init__(self) -> None:
-        self._records: list = []
+        self._records: list[Any] = []
 
-    async def save(self, record) -> None:
+    async def save(self, record: Any) -> None:
         self._records.append(record)

If the actual cost record type is available (e.g., from a domain model), use that instead of Any.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/test_round22_td192_latency.py` around lines 49 - 67, The
test double _InMemoryCostRepo should declare types for its internal list and the
save parameter: annotate self._records as list[CostRecord] (or list[Any] if
CostRecord isn't available) and change save(self, record) to save(self, record:
CostRecord) -> None (or record: Any); add the appropriate typing import (from
typing import Any) or the domain CostRecord import so the signatures on
_InMemoryCostRepo, save, and the _records field clearly express the expected
record type.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/integration/test_round22_td192_latency.py`:
- Around line 49-67: The test double _InMemoryCostRepo should declare types for
its internal list and the save parameter: annotate self._records as
list[CostRecord] (or list[Any] if CostRecord isn't available) and change
save(self, record) to save(self, record: CostRecord) -> None (or record: Any);
add the appropriate typing import (from typing import Any) or the domain
CostRecord import so the signatures on _InMemoryCostRepo, save, and the _records
field clearly express the expected record type.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 81545cb0-4417-4e18-9605-761722353db7

📥 Commits

Reviewing files that changed from the base of the PR and between e7d4fc9 and d730396.

📒 Files selected for processing (1)
  • tests/integration/test_round22_td192_latency.py

@engkimo engkimo merged commit 97f11a7 into main May 15, 2026
6 checks passed
@engkimo engkimo deleted the test/round22-td192-latency branch May 15, 2026 09:08
engkimo added a commit that referenced this pull request May 18, 2026
Covers 33 commits since v0.6.1:
- TD-194 Council Pilot full merge (#20) + 5 post-merge fix-ups (#22-#26)
- TD-189 steps 1-4: per-task cache_hit_rate plumbing (#27-#30)
- TD-192: fold OutputRequirementClassifier into FractalBypassClassifier (#31)
- Round 22 live latency regression — fractal-entry 2 → 1 LLM calls (#32)
- Haiku 4.5 cache threshold pinned at ~4096 tokens via --pad-entries (#33)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant