Skip to content

[AAASM-195] ✅ (bench): Add Python-side performance benchmarks for latency contracts#21

Merged
Chisanan232 merged 15 commits into
masterfrom
v0.0.0/AAASM-195/add_python_benchmarks
May 1, 2026
Merged

[AAASM-195] ✅ (bench): Add Python-side performance benchmarks for latency contracts#21
Chisanan232 merged 15 commits into
masterfrom
v0.0.0/AAASM-195/add_python_benchmarks

Conversation

@Chisanan232
Copy link
Copy Markdown
Contributor

Description

Add a comprehensive Python-side performance benchmark suite to verify the <2ms per-call (AAASM-45) and <50ms detection (AAASM-47) latency contracts. This establishes measurable baselines so future regressions are detectable.

Type of Change

  • ✨ New feature
  • 🔧 Bug fix
  • ♻️ Refactoring
  • 🍀 Performance improvement
  • 📚 Documentation update
  • 🚀 Release

Breaking Changes

Does this PR introduce any breaking changes?

  • No
  • Yes (please describe below)

Related Issues

  • Related JIRA ticket: AAASM-195
  • Related stories: AAASM-45 (<2ms per-call), AAASM-47 (<50ms detection)

What Changed

  • Added pytest-benchmark to dev dependencies
  • Created test/bench/ directory with shared fixtures and latency contract constants
  • Added benchmark pytest marker to pytest.ini
  • Benchmarks for all 6 adapter hook register/unregister cycles (LangChain, LangGraph, CrewAI, Pydantic AI, OpenAI Agents, MCP)
  • Benchmarks for AdapterRegistry.auto_detect() scaling with 0/1/2/4 frameworks
  • Benchmark for init_assembly() cold-start time
  • Conditional benchmark for report_llm_call() PyO3 round-trip (skips when native module not built)
  • Latency contract enforcement tests using time.perf_counter_ns() with P50/P95/P99 percentile reporting
  • Baseline results documented in test/bench/BASELINE.md
  • CI workflow (.github/workflows/benchmarks.yml) for automated regression detection

Baseline Results

All benchmarks pass well within contract thresholds:

  • Adapter hooks: 0.6–2.7µs mean (contract: <2ms)
  • Detection: ~1.3ms mean (contract: <50ms)
  • Cold start: ~1.5ms mean

Testing

Describe the testing performed for this PR:

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No tests required (explain why)

Run benchmarks: pytest test/bench/ --benchmark-only
Run contract tests: pytest test/bench/test_latency_contracts.py --benchmark-disable

Checklist

  • Code follows project style guidelines
  • Self-review completed
  • Comments added for complex logic
  • Documentation updated if needed
  • All tests passing

Chisanan232 and others added 12 commits May 1, 2026 18:03
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Chisanan232 and others added 3 commits May 1, 2026 18:44
Benchmark the governance interception overhead on each tool/function
call when hooks are active (the hot path). Covers all 6 adapters:
CrewAI, LangChain, LangGraph, Pydantic AI, OpenAI Agents, MCP.

Addresses AAASM-195 AC1: per-call overhead of each framework adapter hook.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace register/unregister cycle measurement with actual per-call
patched function overhead for the <2ms P99 contract. Each adapter
now benchmarks its real hot path: CrewAI BaseTool.run(), LangChain
callback dispatch, LangGraph wrapped node, and async adapters
(Pydantic AI, OpenAI Agents, MCP) measured inside event loops.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Document patched-call benchmark results for all 6 adapters.
Sync adapters ~1-2us, async adapters ~30-40us (includes event-loop
scheduling overhead from benchmark harness). All well under 2ms.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 1, 2026

@Chisanan232 Chisanan232 merged commit 330a813 into master May 1, 2026
23 checks passed
@Chisanan232 Chisanan232 deleted the v0.0.0/AAASM-195/add_python_benchmarks branch May 1, 2026 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant