Skip to content

feat(cpp): VLM image support in C++ SDK#858

Merged
itomek merged 4 commits intomainfrom
feat/issue-785-cpp-vlm-support
Apr 24, 2026
Merged

feat(cpp): VLM image support in C++ SDK#858
itomek merged 4 commits intomainfrom
feat/issue-785-cpp-vlm-support

Conversation

@itomek
Copy link
Copy Markdown
Collaborator

@itomek itomek commented Apr 24, 2026

Closes #785

Changes

  • Adds gaia::Image with fromBytes / fromFile factories, RFC 4648 base64 encoding, magic-byte MIME detection (PNG/JPEG/GIF/WebP/BMP), 20 MiB size cap, O_NOFOLLOW + post-open fstat TOCTOU guard on POSIX, and a whitelist enforcing only the five supported MIME types
  • Adds gaia::ContentPart (text / image_url parts with toJson() producing the OpenAI vision wire format)
  • Extends gaia::Message with an additive std::optional<std::vector<ContentPart>> parts field; toJson() dispatches to array or string accordingly — fully backward-compatible with existing aggregate-init sites
  • Adds two new processQuery overloads (string + vector<Image> ergonomic; vector<Message> caller-composed) unified through a private processQueryInternal that is the sole writer of conversationHistory_
  • Image parts are stripped from history at end-of-turn (base64 never retained across calls)
  • RAII InFlightGuard via std::atomic<bool> inFlight_ and compare_exchange_strong — concurrent processQuery calls on the same Agent throw std::runtime_error
  • Empty-input validation fires before ensureModelLoaded so no /load fires on invalid input
  • Lifts cpp/benchmarks/mock_llm_server.h to cpp/tests/support/mock_llm_server.h and extends it with receivedBodies(), loadRequestCount(), holdNextResponse(), and a reportModelLoaded constructor flag; benchmark header is now a thin shim
  • Adds cpp/examples/vlm_agent.cpp end-to-end demo
  • Adds cpp/tests/integration/test_integration_vlm.cpp (3 tests: live Lemonade VLM smoke, messages-list overload, ctx-overflow error surface) with Lemonade version-pin probe via GAIA_PINNED_LEMONADE_VERSION
  • Updates docs/cpp/api-reference.mdx, docs/cpp/overview.mdx, docs/cpp/quickstart.mdx with VLM section, new overloads, thread-safety update, and example invocation

Test coverage

Layer Tests Status
Unit — MIME / base64 / Image 20 tests in test_image.cpp ✅ 331/331 pass
Unit — ContentPart / Message 15 new tests in test_types.cpp
Agent-level (mock HTTP) 13 tests in test_agent_vlm.cpp
Integration (live Lemonade) 3 tests in test_integration_vlm.cpp gated, opt-in

Reviewer notes

  • The Message::parts field is additive — all existing code compiles unchanged; consumers linked against a prebuilt gaia_core must rebuild (noted in docs)
  • detectImageMimeType returns "" (empty string) for ≥ 12-byte buffers with unrecognized magic; returns "image/png" only for null/short buffers (AC-15e safe-fallback contract)
  • Integration tests require -DGAIA_BUILD_INTEGRATION_TESTS=ON and a live Lemonade server with Qwen3-VL-4B-Instruct-GGUF; they are non-blocking in CI
  • The pre-existing errorCount unused-variable warning in agent.cpp:692 is not introduced by this PR

itomek added 3 commits April 24, 2026 09:41
Adds gaia::Image, ContentPart, and three processQuery overloads
(text+images ergonomic, vector<Message>) unified through a private
processQueryInternal. All overloads share stateful history (image parts
stripped after each turn), a RAII re-entrancy guard, and validate-before-
load semantics.

New types: Image::fromBytes/fromFile with magic-byte MIME detection,
O_NOFOLLOW + post-open fstat TOCTOU guard, 20 MiB cap. base64Encode is
RFC 4648 standard alphabet with correct padding. Message gains an
additive std::optional<vector<ContentPart>> parts field (toJson
dispatches array vs string accordingly, defined out-of-line in types.cpp).

Test additions:
- cpp/tests/test_image.cpp (18 tests: base64, Image fromBytes/fromFile)
- cpp/tests/test_agent_vlm.cpp (13 agent-level mock-server tests)
- cpp/tests/test_types.cpp (15 new cases: MIME detection, ContentPart,
  Message backward-compat, fromUser)
- cpp/tests/support/mock_llm_server.h (lifted + extended: receivedBodies,
  loadRequestCount, holdNextResponse; benchmark shim updated)
- cpp/tests/integration/test_integration_vlm.cpp (3 tests: live Lemonade
  VLM smoke, messages-list overload, ctx-overflow error surface; AC-23
  version-pin probe via GAIA_PINNED_LEMONADE_VERSION)

Examples: cpp/examples/vlm_agent.cpp
Docs: api-reference.mdx VLM section, overview/quickstart mentions

Closes #785
- detectImageMimeType: return empty string for unrecognized full-sized
  buffers; callers (fromBytes, fromFile) now throw invalid_argument with
  a clear message instead of silently mislabeling data as image/png
- image.cpp: wrap POSIX fd in FdGuard RAII so it is closed even when
  std::vector allocation throws std::bad_alloc after the 20 MiB cap check
- processQueryInternal: tighten empty-input check to reject parts vectors
  containing only empty-text ContentParts (not just an empty vector)
- mock_llm_server.h: store receivedBodies_ before incrementing requestCount_
  so observers polling requestCount() see body already available
- summarizeUserInput: remove unsafe last-element fallback that could use
  non-USER message content; fall back to first non-empty content instead
- test_integration_vlm.cpp: narrow catch(…) in envInt to only catch
  invalid_argument and out_of_range from std::stoi
…-magic test

- types.h: update doc comment for detectImageMimeType to accurately
  reflect that full-sized (>= 12 byte) buffers with unrecognized magic
  now return empty string (not "image/png")
- test_image.cpp: add ImageFromBytesAutoDetectUnrecognizedThrows to
  cover the new throw path for unrecognized magic without explicit MIME
@github-actions github-actions Bot added documentation Documentation changes cpp labels Apr 24, 2026
@itomek
Copy link
Copy Markdown
Collaborator Author

itomek commented Apr 24, 2026

@claude review this pull request

@github-actions
Copy link
Copy Markdown
Contributor

PR Review — VLM image support in C++ SDK

Reviewed the diff (~2350 lines, 19 files). This is solid, well-scoped work. Highlights first, then a handful of concerns.

🟢 Strengths

  • Clean layering. processQueryInternal as the single owner of concurrency guard, validation, history prepend, LLM loop, and end-of-turn write is exactly right. All three public processQuery overloads delegate as their first action — no partial write paths (cpp/src/agent.cpp:483-502, cpp/src/agent.cpp:573-597).
  • RAII concurrency guard. InFlightGuard + compare_exchange_strong on std::atomic<bool> inFlight_ is the right primitive (cpp/src/agent.cpp:506-520). Test at cpp/tests/test_agent_vlm.cpp:1748 exercises it deterministically via holdNextResponse, which is a nice upgrade to the mock server.
  • POSIX file hardening. O_NOFOLLOW + post-open fstat re-check of S_ISREG is defense-in-depth against TOCTOU. FdGuard closes even on bad_alloc from the vector ctor — nicely thought through (cpp/src/image.cpp:763-798).
  • Additive Message::parts. std::optional<std::vector<ContentPart>> keeps existing aggregate-init sites compiling unchanged. Out-of-line toJson() definition in types.cpp avoids ODR hazards for prebuilt-linked consumers — good call.
  • History contract. stripImageParts at end-of-turn guarantees base64 data URIs never accumulate across turns. AgentVlmTest.ProcessQueryWithImagesStripsAcrossThreeTurns asserts via turn3.find("data:image/") == npos, which is a robust invariant to lock down.
  • Mock server lift. Moving mock_llm_server.h to tests/support/ with a thin re-include shim under benchmarks/ is the right direction. The reportModelLoaded flag, loadRequestCount(), receivedBodies(), and holdNextResponse additions are all test-driven and keep the benchmark surface unchanged.

🟡 Concerns

1. detectImageMimeType short-buffer contract is surprising. Returning "image/png" for nullptr or size < 12 (cpp/src/types.cpp:862-868) is a "safe fallback" that silently mislabels arbitrary bytes as PNG. The MIME whitelist later accepts image/png, so a caller who fed 11 bytes of random data through Image::fromBytes with that result would get a nominally-valid Image — except fromBytes itself rejects the short buffer path because it calls detectImageMimeType only when mimeType is empty and non-empty bytes are present. Net: the contract only saves you from OOB reads; it cannot actually produce a bad Image today because Image::fromBytes gates on bytes.empty(). The docstring says "AC-15e safe-fallback contract" but doesn't explain why returning "image/png" is safer than returning "" and letting callers decide. Consider returning "" for short buffers too — the Image::fromBytes / Image::fromFile paths already throw when detection yields empty, and the current asymmetry is the kind of thing that drifts over time. Not a blocker.

2. Pin-check "non-fatal" is actually a test failure. expectPinnedLemonadeVersion() at cpp/tests/integration/test_integration_vlm.cpp:1063-1090 says "Non-fatal ... does not skip the test body, so the VLM contract still runs." But GTEST_MESSAGE_(... kNonFatalFailure) marks the test as failed (it's the primitive EXPECT_* uses). If the Lemonade server omits the version field, every VLM integration test will fail the pin check even though the VLM body ran fine. Intent seems to be "log + continue" — use GTEST_SKIP_("...") (skips cleanly) or just a std::cerr log + SUCCEED() if you want it purely informational. Same for the catch-all throw branch. 🟡 worth tightening before this lands in CI.

3. Message::fromUser sets m.content = text as an aside. cpp/src/types.cpp:959 stores text into .content for "callers that read .content" even though .parts is the authoritative form. stripImageParts in agent.cpp:525-537 later overwrites that with "\n"-joined text parts when it drains .parts. For the common single-text+images case these are identical, so there's no observable bug today — but a caller who constructs a Message with multiple TEXT parts via fromUser won't hit that code path (it only accepts one text string), so the divergence is latent. Either (a) don't set .content at all in fromUser (make .parts the sole source of truth when present), or (b) drop the redundant assignment and rely on stripImageParts for retention. Minor — comment even flags it ("retained for callers that read .content; JSON uses parts"). 🟢

🟢 Minor nits

  • BMP magic ('B','M') has notorious false-positive potential (any text starting "BM..."). Acceptable trade-off given the whitelist is tight and Image::fromBytes + fromFile both validate size/MIME downstream, but worth a one-line comment in detectImageMimeType.
  • #define GAIA_MAX_IMAGE_BYTES works, but constexpr std::size_t kMaxImageBytes = ... (with macro override fallthrough) reads more like modern C++. Not worth a re-roll.
  • cpp/tests/fixtures/tiny.png and tiny.jpg are binary — make sure they're .gitattributes'd or otherwise excluded from text filters if the repo has any; otherwise LF/CRLF munging could corrupt them.

Test plan verification

  • Unit coverage is thorough: 20 tests in test_image.cpp, 15 new cases in test_types.cpp, 13 agent-level tests in test_agent_vlm.cpp. The backward-compat test (ProcessQueryTextOnlyParsedJsonEqualsBaseline) locks down the string-content wire format — exactly the regression surface I'd worry about.
  • Integration tests gated behind -DGAIA_BUILD_INTEGRATION_TESTS=ON — good, non-blocking in CI as described.

Summary

Recommend merge after addressing (2) — the pin-check will flake CI otherwise. (1) and (3) are improvements worth a follow-up but not blocking. Nice work on the InFlightGuard and the history-stripping contract.

- test_integration_vlm.cpp: replace GTEST_MESSAGE_(kNonFatalFailure)
  in expectPinnedLemonadeVersion() with std::cerr + SUCCEED() so missing
  version field or /system-info probe failure is purely informational and
  does not mark VLM tests as failed; real version mismatches still fail
  via EXPECT_EQ (blocking concern from review)
- types.cpp: expand detectImageMimeType short-buffer comment to explain
  why "image/png" fallback exists for null/<12 byte buffers (AC-15e
  contract, test fixtures) vs empty-string return for full-sized
  unrecognized magic (non-blocking concern from review)
- types.cpp: add BMP "BM" false-positive trade-off comment (minor nit)
- .gitattributes: mark cpp/tests/fixtures/*.png and *.jpg as binary to
  prevent LF/CRLF text-filter corruption (minor nit)
@itomek
Copy link
Copy Markdown
Collaborator Author

itomek commented Apr 24, 2026

Thanks for the thorough review! All concerns addressed in 75e7a52:

🟡 Issue 2 (blocking) — pin-check non-fatal flake
Fixed. GTEST_MESSAGE_(kNonFatalFailure) replaced with std::cerr + SUCCEED() throughout expectPinnedLemonadeVersion(). Missing version field or /system-info probe failure now logs [VLM version pin]-prefixed output to stderr and continues without marking the test failed. Real mismatches (version field present but wrong) still fail via EXPECT_EQ.

🟡 Issue 1 (non-blocking) — short-buffer fallback rationale
Added an expanded comment in detectImageMimeType explaining the asymmetry: the "image/png" fallback for nullptr/size < 12 is the AC-15e contract (test fixtures call this directly with 1/5/11-byte stubs), and neither Image::fromBytes nor Image::fromFile can produce a mislabeled Image through this path in practice. Full-sized unrecognized buffers continue to return "" causing callers to throw. The contract is now self-documenting.

🟢 Minor nits

  • BMP "BM" false-positive trade-off comment added.
  • .gitattributes created with cpp/tests/fixtures/*.png binary and *.jpg binary to guard against LF/CRLF corruption.

Issue 3 (Message::fromUser setting .content alongside .parts) — left as-is with the existing comment // retained for callers that read .content; JSON uses parts. The redundancy is intentional for backward compat and stripImageParts already handles the authoritative form; removing it would be a separate cleanup. Happy to address in a follow-up if preferred.

331/331 unit tests pass; integration target builds clean.

@itomek
Copy link
Copy Markdown
Collaborator Author

itomek commented Apr 24, 2026

@claude review this PR

@itomek itomek self-assigned this Apr 24, 2026
@itomek itomek marked this pull request as ready for review April 24, 2026 15:41
@itomek itomek requested a review from kovtcharov-amd as a code owner April 24, 2026 15:41
@github-actions
Copy link
Copy Markdown
Contributor

PR Review (follow-up after 75e7a52) — VLM image support in C++ SDK

Re-reviewed with focus on the fixups in 75e7a52 and a fresh pass over the full 20-file diff. Verified against pr-diff.txt and the GitHub API (mergeable: true, 20 changed files, +1898/−164).

✅ Prior review concerns — all addressed correctly

  • 🟡 Pin-check flake (blocking) — Fixed cleanly. expectPinnedLemonadeVersion() at cpp/tests/integration/test_integration_vlm.cpp:1080-1110 now routes missing-version-field and probe-failure paths through std::cerr + SUCCEED(), so informational cases no longer mark the test failed. Real mismatches still fail via EXPECT_EQ. The <iostream> include was added in the same commit — compiles clean. One small upside of the change I hadn't anticipated: because these paths now report success, a real failure from the actual VLM call downstream is no longer masked by an earlier pin-check non-fatal failure mixed into the same test result. Strictly better than GTEST_SKIP_.
  • 🟡 detectImageMimeType short-buffer rationale — Expanded comment at cpp/src/types.cpp:871-879 clearly explains that the "image/png" return is the AC-15e contract for header-stub tests, and that Image::fromBytes/Image::fromFile can't produce a mislabeled Image through this path in practice (gated by bytes.empty() / full-file read upstream). Self-documenting, no behavior change.
  • 🟢 BMP false-positive — Two-line comment at cpp/src/types.cpp:902-904 explaining the trade-off. Correct.
  • 🟢 Binary fixtures — New .gitattributes at repo root pins cpp/tests/fixtures/*.{png,jpg} as binary. Prevents LF/CRLF corruption. No existing .gitattributes to conflict with.
  • 🟢 Issue Update Driver Check #3 (Message::fromUser retaining .content) — Explicitly deferred with clear reasoning. Fine as-is; not a correctness issue.

🟢 Fresh observations (non-blocking)

  1. InFlightGuard lifetime is correct under ensureModelLoaded throw. Walked the code — guard is acquired after empty-input validation and covers ensureModelLoaded, the LLM loop, and the history write. If ensureModelLoaded throws (server unreachable), the guard destructor resets inFlight_ and the next processQuery on the same Agent can proceed. RAII semantics hold on every exit path. Confirmed by ProcessQueryConcurrentEntryThrows (cpp/tests/test_agent_vlm.cpp:1769) — the test deterministically gates on mock.requestCount() >= 1 (i.e. thread 1 is already inside the handler, holding the flag) before the second processQuery call.
  2. Empty-input validation logic handles the edge case I was curious about. A Message with empty content and no parts keeps anyNonEmpty=false and throws std::invalid_argument before any HTTP call. cpp/tests/test_agent_vlm.cpp:1657 (ProcessQueryAllEmptyUserMessagesThrows) locks this down, and also asserts loadRequestCount()==0 — i.e. no spurious /load. Exactly what you want.
  3. stripImageParts is idempotent across turns. ProcessQueryWithImagesStripsAcrossThreeTurns + ProcessQueryInternalOwnsAllHistoryWrites (cpp/tests/test_agent_vlm.cpp:1704, 1724) together assert no base64 data URI appears in turn 2 or turn 3 outbound bodies. Good regression surface.
  4. Minor doc suggestion. docs/cpp/api-reference.mdx:369 reads "recommended minimum 32768" and "Smaller values (e.g. 2048) will surface a raw server error". The integration test ContextOverflowSurfacesError (cpp/tests/integration/test_integration_vlm.cpp:1182) intentionally tolerates the no-throw case because newer Lemonade builds may auto-grow ctx. Docs currently say "will surface" — consider softening to "may surface" to match the test's reality. Micro nit.
  5. Observed in passing: cpp/benchmarks/mock_llm_server.h is correctly reduced to a thin #include "../tests/support/mock_llm_server.h" shim — so benchmark sources continue compiling without change while unit tests get the new receivedBodies() / loadRequestCount() / holdNextResponse() / reportModelLoaded ctor flag. Clean lift.

Test coverage check

  • tests_mock now builds with httplib + GAIA_TEST_FIXTURES_DIR (CMakeLists.txt:48-63) — correctly conditioned on httplib_FOUND with fallback include dir. OpenSSL pulled in when present.
  • Unit (20 image tests + 15 types tests + 13 agent-VLM tests) all pass per author report; the integration target builds with -DGAIA_BUILD_INTEGRATION_TESTS=ON and is non-blocking in CI. Structure matches the AC grid in the PR description.

Recommendation

Ready to merge. Blocking concern is fully resolved; remaining items are cosmetic/future-improvement-class. Nice work on the InFlightGuard + end-of-turn stripImageParts contract — those two together give you genuinely clean turn isolation without relying on caller discipline.

Review by automated @claude responder. For human sign-off, a maintainer still needs to approve.

@github-actions
Copy link
Copy Markdown
Contributor

Summary

Solid, cohesive VLM implementation for the C++ SDK. The refactor unifies all processQuery overloads through a single private processQueryInternal that is the sole writer of conversationHistory_ — a clean architectural improvement that will pay for itself well beyond this PR. Security hardening on Image::fromFile (POSIX O_NOFOLLOW + post-open fstat TOCTOU guard, 20 MiB cap, MIME whitelist, regular-file pre/post check) is the right bar for a data: URI producer. Test coverage is thorough across all three layers (type/encode, agent-with-mock, live-Lemonade integration) and the mock-server lift-and-extend from benchmarks/ to tests/support/ is well-motivated.

The single most important thing to flag: there are no 🔴 issues, and only a handful of 🟢 minors worth considering. Approve with suggestions.

Issues Found

🟢 Minor

1. Anonymous-namespace helpers could insert instead of loop-push (cpp/src/agent.cpp:681-684)

Small perf nit — micro but essentially free:

    // Append caller-supplied user messages verbatim (may contain image parts).
    messages.insert(messages.end(), userMessages.begin(), userMessages.end());

2. namespace bench kept in cpp/tests/support/mock_llm_server.h (cpp/tests/support/mock_llm_server.h:24)

The mock now serves unit tests AND benchmarks, but the enclosing namespace is still bench. Keeping it preserves the benchmark's existing include path (a stated goal), so leaving it is defensible — just worth a TODO for a future rename to gaia::test (or gaia::mock) when there's an opportunity to touch benchmark sources in the same pass. Not blocking.

3. expectPinnedLemonadeVersion silently SUCCEEDs when no version field is found (cpp/tests/integration/test_integration_vlm.cpp:1093-1099)

When /system-info returns none of version/server_version/lemonade_version, the helper logs "informational only" and returns. This is arguably soft-fallback behavior in a version-pin guard. For a best-effort, opt-in integration test this is defensible (CLAUDE.md's no-silent-fallback rule targets production paths), but if the intent of AC-23 is "fail loudly on drift," consider making the missing field case a FAIL() — or at minimum log at a higher-visibility level. Keep the try/catch around the HTTP call itself (that's a genuine "server not reachable ⇒ skip" case). Leaving as informational is also acceptable given the comment; flagging for your call.

4. MIME detection fallback semantics are subtle (cpp/src/types.cpp:870-882)

detectImageMimeType returning "image/png" for null/short buffers and "" for full-sized unrecognized buffers is an AC-15e test contract that is well-documented at the call site and (as the comment notes) unreachable from the public factories. The asymmetry is still a sharp edge for future readers. Not an action item — the inline justification is one of the better comments in the PR — but consider whether a single return value (either always "" or always "image/png" for the safe-fallback case) would be simpler in a future iteration.

5. cpp/examples/vlm_agent.cpp:24silentMode = false

Most other cpp/examples/*.cpp default to silent or don't set it. Not wrong (the example shows the agent's own output), just worth noting for consistency. Leave as-is if intentional.

Strengths

  • Single-writer invariant for conversationHistory_. processQueryInternal owns it end-to-end (validation, guard, history prepend, LLM loop, end-of-turn strip-and-store). The comment at cpp/src/agent.cpp:492-493 ("Public overloads delegate ... as their FIRST action. No partial delegation.") is exactly the right rule to commit to.
  • Security posture on Image::fromFile. O_NOFOLLOW + post-open fstat TOCTOU guard + MIME whitelist + 20 MiB cap + regular-file check + RAII FdGuard is a textbook-correct path for ingesting user-supplied files. Bonus for flagging the std::bad_alloc case in the comment at cpp/src/image.cpp:798-799.
  • Deterministic concurrent-entry test. The holdNextResponse(std::shared_future<void>) mechanism in MockLlmServer plus compare_exchange_strong on inFlight_ lets ProcessQueryConcurrentEntryThrows test re-entrancy without sleeps/races. Clean and reusable.
  • End-of-turn image stripping is proven by test, not just asserted in docs — ProcessQueryWithImagesStripsAcrossThreeTurns and ProcessQueryInternalOwnsAllHistoryWrites check the turn-2 body for data:image/ as a substring. That's the right granularity.
  • Backward-compatibility is mechanically verifiedProcessQueryTextOnlyParsedJsonEqualsBaseline asserts the text-only path still emits content as a string, which catches any accidental regression to array-always.
  • Binary fixtures via .gitattributes — the *.png binary / *.jpg binary entries prevent LF/CRLF mangling on Windows clones. Small detail, easy to miss, done right.

Verdict

Approve with suggestions. All 🟢 items above are optional polish — none block merge. The architecture work (single-writer processQueryInternal), the security work on Image::fromFile, and the layered test coverage together make this a high-quality addition. Nice closer on #785.

@itomek itomek added this pull request to the merge queue Apr 24, 2026
Merged via the queue into main with commit c677a91 Apr 24, 2026
44 of 46 checks passed
@itomek itomek deleted the feat/issue-785-cpp-vlm-support branch April 24, 2026 17:16
pull Bot pushed a commit to bhardwajRahul/gaia that referenced this pull request Apr 28, 2026
## Summary

Bump the regression threshold for `memory_per_step_growth_kb` from the
generic 15% to 75%, so legitimate single-digit-KB feature growth doesn't
false-fail the C++ Benchmarks gate. This is the only metric at a
single-digit-KB scale, where ordinary feature work produces large
*percent* swings on small *absolute* numbers.

## Root cause

`memory_per_step_growth_kb` measures KB of RSS growth per agent loop
step. Pre-VLM main was ~7.2 KB/step. PR amd#858 (VLM image support, merged
2026-04-24) added the `Image` type, content-block JSON parsing, and new
per-message storage — bumping the metric to ~11.4 KB/step. That's +4 KB
absolute, a perfectly reasonable cost for a major new feature, but it
reads as +58% on the percent scale, well past the 15% generic threshold.

The cached baseline never refreshed because the workflow only saves a
new baseline on a successful main push (`if: github.ref ==
'refs/heads/main' && github.event_name == 'push'`), and every main push
since amd#858 has either failed (this same gate) or been cancelled. So
**every PR that touches C++ keeps inheriting the stale pre-VLM
baseline** and failing on this one metric.

## Why a threshold change, not a baseline bump

A baseline bump alone is a one-shot fix — the next feature that adds
another KB-level allocation per message will hit the same wall. A wider
threshold for *just this metric* is the structural fix: it acknowledges
that on a metric whose absolute values are tiny, percent swings are
noisy. Other metrics (binary size, latency, peak memory in MB) keep
their tighter thresholds.

## Verification

Most recent failing run shows the new value within the new band:

```
loop_latency_median_us           1026.0   918.0  -10.5%   15%   IMPROVED
memory_baseline_kb               9272.0  9332.0   +0.6%   15%   OK
memory_peak_kb                   9416.0  9560.0   +1.5%   15%   OK
memory_per_step_growth_kb           7.2    11.4  +58.3%   15%   FAIL  <-- this one
```

With the 75% threshold the FAIL becomes OK; everything else is
unchanged.

## Test plan

- [ ] CI green on this PR (the same Windows benchmark job that's been
red on every main push since 2026-04-24)
- [ ] After merge, baseline auto-refreshes on the next main push, future
PRs no longer inherit the stale 7.2 KB baseline
@github-actions github-actions Bot mentioned this pull request May 1, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpp documentation Documentation changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: VLM support in C++ SDK

2 participants