feat: 20.1 binary protocol server — hot-path operations#158
feat: 20.1 binary protocol server — hot-path operations#158vieiralucas merged 15 commits intomainfrom
Conversation
There was a problem hiding this comment.
9 issues found across 20 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/fila-fibp/src/frame.rs">
<violation number="1" location="crates/fila-fibp/src/frame.rs:147">
P1: `put_string` truncates lengths > u16::MAX, which can corrupt frame decoding by misaligning subsequent payload fields.</violation>
<violation number="2" location="crates/fila-fibp/src/frame.rs:159">
P1: `put_string_map` can wrap map length to u16 and produce malformed payloads for large maps.</violation>
<violation number="3" location="crates/fila-fibp/src/frame.rs:168">
P1: `put_string_array` can truncate element count to u16, causing payload desynchronization on decode.</violation>
</file>
<file name="crates/fila-server/src/binary_server.rs">
<violation number="1" location="crates/fila-server/src/binary_server.rs:476">
P2: Consumer leak when a duplicate `request_id` is used for a second Consume request. `HashMap::insert` silently overwrites the old `consumer_id`, which is never unregistered from the broker. Either reject a duplicate `request_id` with an error, or unregister the previous consumer before inserting the new one.</violation>
</file>
<file name="crates/fila-core/src/broker/config.rs">
<violation number="1" location="crates/fila-core/src/broker/config.rs:92">
P3: The new `binary_addr` doc comment is incorrect: default is `None` (disabled), not `0.0.0.0:5555`.</violation>
</file>
<file name="_bmad-output/implementation-artifacts/epic-execution-state.yaml">
<violation number="1" location="_bmad-output/implementation-artifacts/epic-execution-state.yaml:7">
P3: Set the open-PR story status to `review` instead of `in-progress`.
(Based on your team's feedback about keeping story status as `review` while a PR is open.) [FEEDBACK_USED]</violation>
</file>
<file name="crates/fila-server/tests/binary_protocol.rs">
<violation number="1" location="crates/fila-server/tests/binary_protocol.rs:86">
P2: Handle EOF in the receive loop; otherwise a closed connection can cause an infinite wait/hang.</violation>
<violation number="2" location="crates/fila-server/tests/binary_protocol.rs:107">
P1: `send_and_recv` should filter by `request_id` (or expected opcode) before returning; otherwise unsolicited frames can be mistaken for the response and make tests flaky.</violation>
</file>
<file name="crates/fila-fibp/src/types.rs">
<violation number="1" location="crates/fila-fibp/src/types.rs:206">
P1: Bound-check decoded item counts before `Vec::with_capacity(count)` to prevent memory-exhaustion DoS from malicious frame counts.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
… consumer leak, test robustness
There was a problem hiding this comment.
1 issue found across 7 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/fila-server/tests/binary_protocol.rs">
<violation number="1" location="crates/fila-server/tests/binary_protocol.rs:117">
P2: Discarding non-matching frames in `send_and_recv` can lose delivery events on the shared connection, causing flaky/failing consume tests when delivery arrives before the command response.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
…ndling in batch_ack_and_nack
There was a problem hiding this comment.
2 issues found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="_bmad-output/implementation-artifacts/epic-execution-state.yaml">
<violation number="1" location="_bmad-output/implementation-artifacts/epic-execution-state.yaml:7">
P2: Keep story 20.1 in `review` while the PR is open; the epic-review workflow sets it to done after merge.
(Based on your team's feedback about keeping story status as `review` until merge.) [FEEDBACK_USED]</violation>
</file>
<file name="_bmad-output/implementation-artifacts/sprint-status.yaml">
<violation number="1" location="_bmad-output/implementation-artifacts/sprint-status.yaml:205">
P2: Keep the story status as `review` until the PR is merged; the epic-review workflow will set it to `done` automatically.
(Based on your team's feedback about keeping story status in review until merge.) [FEEDBACK_USED]</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| status: review | ||
| - id: "20.1" | ||
| title: "Binary Protocol Server — Hot-Path Operations" | ||
| status: completed |
There was a problem hiding this comment.
P2: Keep story 20.1 in review while the PR is open; the epic-review workflow sets it to done after merge.
(Based on your team's feedback about keeping story status as review until merge.)
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At _bmad-output/implementation-artifacts/epic-execution-state.yaml, line 7:
<comment>Keep story 20.1 in `review` while the PR is open; the epic-review workflow sets it to done after merge.
(Based on your team's feedback about keeping story status as `review` until merge.) </comment>
<file context>
@@ -4,8 +4,8 @@ startedAt: "2026-03-30"
title: "Binary Protocol Server — Hot-Path Operations"
- status: review
- currentPhase: "pr-ci"
+ status: completed
+ currentPhase: ""
branch: "feat/20.1-binary-protocol-server-hot-path"
</file context>
| status: completed | |
| status: review |
| epic-20: backlog | ||
| 20-1-binary-protocol-server-hot-path-operations: backlog | ||
| epic-20: in-progress | ||
| 20-1-binary-protocol-server-hot-path-operations: done |
There was a problem hiding this comment.
P2: Keep the story status as review until the PR is merged; the epic-review workflow will set it to done automatically.
(Based on your team's feedback about keeping story status in review until merge.)
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At _bmad-output/implementation-artifacts/sprint-status.yaml, line 205:
<comment>Keep the story status as `review` until the PR is merged; the epic-review workflow will set it to `done` automatically.
(Based on your team's feedback about keeping story status in review until merge.) </comment>
<file context>
@@ -202,7 +202,7 @@ development_status:
# Full protocol migration: server, auth, Rust SDK, CLI, cluster, gRPC removal.
epic-20: in-progress
- 20-1-binary-protocol-server-hot-path-operations: review
+ 20-1-binary-protocol-server-hot-path-operations: done
20-2-admin-operations-auth-on-binary-protocol: backlog
20-3-rust-sdk-binary-protocol-client: backlog
</file context>
| 20-1-binary-protocol-server-hot-path-operations: done | |
| 20-1-binary-protocol-server-hot-path-operations: review |
vieiralucas
left a comment
There was a problem hiding this comment.
I don't think we implemented the continuation pattern per spec in docs/protocol.md did we?
| debug!(%peer, "new TCP connection"); | ||
| let server = Arc::clone(&server); | ||
| let tls = tls_acceptor.clone(); | ||
| tokio::spawn(async move { |
There was a problem hiding this comment.
Shouldn't we be tracking these tasks that we spawn here somewhere somehow? Is it really ok to do it like we are doing?
Like, if we get to the shutdown branch, wont we exit the loop and return the future immediately potentianelly leaving tasks handling connections in a limbo, and not properly gracefully shutdown?
There was a problem hiding this comment.
Addressed — connection tasks are now tracked in a tokio::task::JoinSet. On shutdown: abort_all() + drain. Completed tasks are reaped in the select loop to prevent unbounded JoinHandle growth. Commit 65821fd.
| // All delivery senders dropped — shouldn't happen | ||
| // since we hold delivery_tx. Ignore. |
There was a problem hiding this comment.
So maybe add unreachable! ?
There was a problem hiding this comment.
Addressed — replaced comment with unreachable!("delivery_tx is held by ConnectionState"). Commit 65821fd.
- admin opcodes renumbered from 0xFD downward (was 0x20 upward) so hot-path and admin ranges grow independently without colliding - Stream enum now implements AsyncRead + AsyncWrite traits instead of manual method delegation - connection tasks tracked via JoinSet for proper graceful shutdown (abort + drain on shutdown signal, reap completed in accept loop) - delivery_rx None branch uses unreachable!() since delivery_tx is held - added unit tests for ConnectionError Display and Stream trait bounds - updated docs/protocol.md with new admin opcode values throughout
|
RE: continuation frame support — confirmed this is not implemented in any story in Epic 20. The codec defines FLAG_CONTINUATION and is_continuation() but neither the server nor SDK ever uses them. With 16 MiB max frame size it's unlikely to matter in practice, but it's a spec conformance gap to track for a future epic. |
continuation frames: - ContinuationAssembler in fila-fibp: tracks per-request_id reassembly, validates opcode consistency, enforces max reassembled size - integrated into binary_server ConnectionState frame_loop - 8 unit tests (passthrough, 2/3-frame, interleaved, errors, clear) - 1 integration test (split enqueue over 2 continuation frames) tls e2e fixes: - drain both stdout AND stderr pipes in all test server helpers (was only draining stderr, causing server hang on full stdout pipe) - add missing [telemetry] otlp_endpoint="" to tls test config - add assert on readiness check in start_tls_server
There was a problem hiding this comment.
2 issues found across 8 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/fila-server/tests/binary_protocol.rs">
<violation number="1" location="crates/fila-server/tests/binary_protocol.rs:489">
P2: Handle EOF in the read loop so the test fails immediately when the connection closes unexpectedly.</violation>
</file>
<file name="crates/fila-fibp/src/frame.rs">
<violation number="1" location="crates/fila-fibp/src/frame.rs:195">
P1: Add a bound on in-flight continuation states (and/or total buffered bytes) to prevent memory-exhaustion via many unfinished request IDs.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
…in test - cap in-flight continuation streams to 64 (DEFAULT_MAX_PENDING_STREAMS) to prevent memory exhaustion from many unfinished request IDs - handle EOF in continuation test read loop to fail fast on unexpected connection close instead of hanging
Summary
fila-fibpcrate: shared binary protocol codec with frame types, opcodes (28), error codes (18), typed request/response structs, and encode/decode primitivesfila-serverwith handshake, all hot-path operation handlers (enqueue, consume, ack, nack), and streaming delivery via multiplexed connection looptokio-rustlswith mTLS whenca_fileis configured[server].binary_addrconfig; gRPC remains onlisten_addrfor backward compatibilityTest plan
tls_valid_cert_connects_successfullyfails on main too)cargo clippy --workspace -- -D warningscleancargo fmt --checkcleancargo bench -p fila-bench --bench systempasses — system benchmarks healthyBenchmark numbers (baseline, gRPC SDK)
Note: Direct binary vs gRPC throughput comparison requires SDK migration (Story 20.3). These are baseline numbers using the gRPC SDK.
🤖 Generated with Claude Code
Summary by cubic
Adds a binary protocol TCP server for hot-path ops with a shared codec and bounded continuation reassembly. Completes Story 20.1; opt-in via
[server].binary_addr, with TLS/mTLS and strict size/batch guards, and fixes a rustls dual-provider panic.New Features
fila-fibp: codec with frames/opcodes/errors, typed requests/responses, encode/decode, and continuation reassembly (perrequest_id) with max-size checks.fila-server: binary TCP server (handshake, ping/pong, streaming consume) and batch enqueue/ack/nack; TLS viatokio-rustlswith mTLS whenca_fileis set; enable with[server].binary_addr(gRPC stays onlisten_addr).0xFDdownward; error opcode at0xFE.Bug Fixes
JoinSet;Streamnow implementsAsyncRead/AsyncWrite.CryptoProviderat startup to avoid dual-provider panic; drain stdout/stderr in test helpers; added[telemetry].otlp_endpoint = ""; increased timeouts for flaky TLS tests; added EOF handling in continuation tests.Written for commit d715b9f. Summary will update on new commits.
Benchmark Results (median of 3 runs, no baseline yet)
Commit:
86ef5d6Time: 2026-04-01T11:39:05Z