Skip to content

feat: 20.1 binary protocol server — hot-path operations#158

Merged
vieiralucas merged 15 commits intomainfrom
feat/20.1-binary-protocol-server-hot-path
Apr 1, 2026
Merged

feat: 20.1 binary protocol server — hot-path operations#158
vieiralucas merged 15 commits intomainfrom
feat/20.1-binary-protocol-server-hot-path

Conversation

@vieiralucas
Copy link
Copy Markdown
Member

@vieiralucas vieiralucas commented Mar 30, 2026

Summary

  • Created fila-fibp crate: shared binary protocol codec with frame types, opcodes (28), error codes (18), typed request/response structs, and encode/decode primitives
  • Added binary protocol TCP server to fila-server with handshake, all hot-path operation handlers (enqueue, consume, ack, nack), and streaming delivery via multiplexed connection loop
  • TLS support via tokio-rustls with mTLS when ca_file is configured
  • Server is opt-in via [server].binary_addr config; gRPC remains on listen_addr for backward compatibility

Test plan

  • 14 unit tests for FIBP codec round-trips (all frame types)
  • 8 integration tests: handshake, wrong version, enqueue+consume round-trip, batch enqueue (100 msgs), batch ack/nack, ping/pong, unknown opcode error
  • All 455+ existing tests pass (only pre-existing flaky tls_valid_cert_connects_successfully fails on main too)
  • cargo clippy --workspace -- -D warnings clean
  • cargo fmt --check clean
  • cargo bench -p fila-bench --bench system passes — system benchmarks healthy

Benchmark numbers (baseline, gRPC SDK)

Metric Value
enqueue_throughput_1kb 2,197.96 msg/s
e2e_latency_p50_light 0.18 ms
e2e_latency_p99_light 0.81 ms
fairness_overhead_pct 2.59%

Note: Direct binary vs gRPC throughput comparison requires SDK migration (Story 20.3). These are baseline numbers using the gRPC SDK.

🤖 Generated with Claude Code


Summary by cubic

Adds a binary protocol TCP server for hot-path ops with a shared codec and bounded continuation reassembly. Completes Story 20.1; opt-in via [server].binary_addr, with TLS/mTLS and strict size/batch guards, and fixes a rustls dual-provider panic.

  • New Features

    • fila-fibp: codec with frames/opcodes/errors, typed requests/responses, encode/decode, and continuation reassembly (per request_id) with max-size checks.
    • fila-server: binary TCP server (handshake, ping/pong, streaming consume) and batch enqueue/ack/nack; TLS via tokio-rustls with mTLS when ca_file is set; enable with [server].binary_addr (gRPC stays on listen_addr).
    • Protocol docs updated: admin opcodes renumbered to 0xFD downward; error opcode at 0xFE.
  • Bug Fixes

    • Added 16 MiB frame cap, early EOF checks, and decoded batch-count limits; fixed interleaved frame handling and consumer cleanup; tracked connection tasks with JoinSet; Stream now implements AsyncRead/AsyncWrite.
    • Robust continuation handling with opcode validation, size limits, and a cap of 64 in-flight streams to prevent truncation, mixed-frame, and memory-exhaustion errors.
    • TLS/startup stability: install rustls ring CryptoProvider at startup to avoid dual-provider panic; drain stdout/stderr in test helpers; added [telemetry].otlp_endpoint = ""; increased timeouts for flaky TLS tests; added EOF handling in continuation tests.

Written for commit d715b9f. Summary will update on new commits.

Benchmark Results (median of 3 runs, no baseline yet)

Commit: 86ef5d6
Time: 2026-04-01T11:39:05Z

Benchmark Value Unit
compaction_active_p99 0.445426 ms
compaction_idle_p99 0.45324 ms
compaction_p99_delta -0.008195000000000008 ms
consumer_concurrency_100_throughput 1816.0 msg/s
consumer_concurrency_10_throughput 1880.0 msg/s
consumer_concurrency_1_throughput 328.0 msg/s
e2e_latency_p50_light 0.378707 ms
e2e_latency_p95_light 0.414601 ms
e2e_latency_p99_light 0.51096 ms
enqueue_throughput_1kb 3096.513940406098 msg/s
enqueue_throughput_1kb_mbps 3.02393939492783 MB/s
fairness_accuracy_max_deviation 0.1999999999999988 % deviation
fairness_accuracy_tenant-1 0.1999999999999988 % deviation
fairness_accuracy_tenant-2 0.1999999999999988 % deviation
fairness_accuracy_tenant-3 0.099999999999989 % deviation
fairness_accuracy_tenant-4 0.099999999999989 % deviation
fairness_accuracy_tenant-5 0.099999999999989 % deviation
fairness_overhead_fair_throughput 1421.7255041395565 msg/s
fairness_overhead_fifo_throughput 1473.3016706071303 msg/s
fairness_overhead_pct 3.500720015224046 %
key_cardinality_10_throughput 2695.782639045814 msg/s
key_cardinality_10k_throughput 583.2305900717149 msg/s
key_cardinality_1k_throughput 1117.3258740813487 msg/s
lua_on_enqueue_overhead_us 24.62797105146092 us
lua_throughput_with_hook 1181.30796907616 msg/s
memory_per_message_overhead 99.5328 bytes/msg
memory_rss_idle 162.77734375 MB
memory_rss_loaded_10k 164.125 MB

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 issues found across 20 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-fibp/src/frame.rs">

<violation number="1" location="crates/fila-fibp/src/frame.rs:147">
P1: `put_string` truncates lengths > u16::MAX, which can corrupt frame decoding by misaligning subsequent payload fields.</violation>

<violation number="2" location="crates/fila-fibp/src/frame.rs:159">
P1: `put_string_map` can wrap map length to u16 and produce malformed payloads for large maps.</violation>

<violation number="3" location="crates/fila-fibp/src/frame.rs:168">
P1: `put_string_array` can truncate element count to u16, causing payload desynchronization on decode.</violation>
</file>

<file name="crates/fila-server/src/binary_server.rs">

<violation number="1" location="crates/fila-server/src/binary_server.rs:476">
P2: Consumer leak when a duplicate `request_id` is used for a second Consume request. `HashMap::insert` silently overwrites the old `consumer_id`, which is never unregistered from the broker. Either reject a duplicate `request_id` with an error, or unregister the previous consumer before inserting the new one.</violation>
</file>

<file name="crates/fila-core/src/broker/config.rs">

<violation number="1" location="crates/fila-core/src/broker/config.rs:92">
P3: The new `binary_addr` doc comment is incorrect: default is `None` (disabled), not `0.0.0.0:5555`.</violation>
</file>

<file name="_bmad-output/implementation-artifacts/epic-execution-state.yaml">

<violation number="1" location="_bmad-output/implementation-artifacts/epic-execution-state.yaml:7">
P3: Set the open-PR story status to `review` instead of `in-progress`.

(Based on your team's feedback about keeping story status as `review` while a PR is open.) [FEEDBACK_USED]</violation>
</file>

<file name="crates/fila-server/tests/binary_protocol.rs">

<violation number="1" location="crates/fila-server/tests/binary_protocol.rs:86">
P2: Handle EOF in the receive loop; otherwise a closed connection can cause an infinite wait/hang.</violation>

<violation number="2" location="crates/fila-server/tests/binary_protocol.rs:107">
P1: `send_and_recv` should filter by `request_id` (or expected opcode) before returning; otherwise unsolicited frames can be mistaken for the response and make tests flaky.</violation>
</file>

<file name="crates/fila-fibp/src/types.rs">

<violation number="1" location="crates/fila-fibp/src/types.rs:206">
P1: Bound-check decoded item counts before `Vec::with_capacity(count)` to prevent memory-exhaustion DoS from malicious frame counts.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/fila-fibp/src/frame.rs
Comment thread crates/fila-fibp/src/frame.rs
Comment thread crates/fila-fibp/src/frame.rs
Comment thread crates/fila-server/tests/binary_protocol.rs Outdated
Comment thread crates/fila-fibp/src/types.rs
Comment thread crates/fila-server/src/binary_server.rs Outdated
Comment thread crates/fila-server/tests/binary_protocol.rs
Comment thread crates/fila-core/src/broker/config.rs Outdated
Comment thread _bmad-output/implementation-artifacts/epic-execution-state.yaml Outdated
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 7 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-server/tests/binary_protocol.rs">

<violation number="1" location="crates/fila-server/tests/binary_protocol.rs:117">
P2: Discarding non-matching frames in `send_and_recv` can lose delivery events on the shared connection, causing flaky/failing consume tests when delivery arrives before the command response.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/fila-server/tests/binary_protocol.rs
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="_bmad-output/implementation-artifacts/epic-execution-state.yaml">

<violation number="1" location="_bmad-output/implementation-artifacts/epic-execution-state.yaml:7">
P2: Keep story 20.1 in `review` while the PR is open; the epic-review workflow sets it to done after merge.

(Based on your team's feedback about keeping story status as `review` until merge.) [FEEDBACK_USED]</violation>
</file>

<file name="_bmad-output/implementation-artifacts/sprint-status.yaml">

<violation number="1" location="_bmad-output/implementation-artifacts/sprint-status.yaml:205">
P2: Keep the story status as `review` until the PR is merged; the epic-review workflow will set it to `done` automatically.

(Based on your team's feedback about keeping story status in review until merge.) [FEEDBACK_USED]</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

status: review
- id: "20.1"
title: "Binary Protocol Server — Hot-Path Operations"
status: completed
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Keep story 20.1 in review while the PR is open; the epic-review workflow sets it to done after merge.

(Based on your team's feedback about keeping story status as review until merge.)

View Feedback

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At _bmad-output/implementation-artifacts/epic-execution-state.yaml, line 7:

<comment>Keep story 20.1 in `review` while the PR is open; the epic-review workflow sets it to done after merge.

(Based on your team's feedback about keeping story status as `review` until merge.) </comment>

<file context>
@@ -4,8 +4,8 @@ startedAt: "2026-03-30"
     title: "Binary Protocol Server — Hot-Path Operations"
-    status: review
-    currentPhase: "pr-ci"
+    status: completed
+    currentPhase: ""
     branch: "feat/20.1-binary-protocol-server-hot-path"
</file context>
Suggested change
status: completed
status: review
Fix with Cubic

epic-20: backlog
20-1-binary-protocol-server-hot-path-operations: backlog
epic-20: in-progress
20-1-binary-protocol-server-hot-path-operations: done
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Keep the story status as review until the PR is merged; the epic-review workflow will set it to done automatically.

(Based on your team's feedback about keeping story status in review until merge.)

View Feedback

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At _bmad-output/implementation-artifacts/sprint-status.yaml, line 205:

<comment>Keep the story status as `review` until the PR is merged; the epic-review workflow will set it to `done` automatically.

(Based on your team's feedback about keeping story status in review until merge.) </comment>

<file context>
@@ -202,7 +202,7 @@ development_status:
   # Full protocol migration: server, auth, Rust SDK, CLI, cluster, gRPC removal.
   epic-20: in-progress
-  20-1-binary-protocol-server-hot-path-operations: review
+  20-1-binary-protocol-server-hot-path-operations: done
   20-2-admin-operations-auth-on-binary-protocol: backlog
   20-3-rust-sdk-binary-protocol-client: backlog
</file context>
Suggested change
20-1-binary-protocol-server-hot-path-operations: done
20-1-binary-protocol-server-hot-path-operations: review
Fix with Cubic

Copy link
Copy Markdown
Member Author

@vieiralucas vieiralucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we implemented the continuation pattern per spec in docs/protocol.md did we?

Comment thread crates/fila-fibp/src/opcode.rs Outdated
Comment thread crates/fila-server/src/binary_server.rs Outdated
debug!(%peer, "new TCP connection");
let server = Arc::clone(&server);
let tls = tls_acceptor.clone();
tokio::spawn(async move {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be tracking these tasks that we spawn here somewhere somehow? Is it really ok to do it like we are doing?
Like, if we get to the shutdown branch, wont we exit the loop and return the future immediately potentianelly leaving tasks handling connections in a limbo, and not properly gracefully shutdown?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — connection tasks are now tracked in a tokio::task::JoinSet. On shutdown: abort_all() + drain. Completed tasks are reaped in the select loop to prevent unbounded JoinHandle growth. Commit 65821fd.

Comment thread crates/fila-server/src/binary_server.rs Outdated
Comment thread crates/fila-server/src/binary_server.rs Outdated
Comment on lines +278 to +279
// All delivery senders dropped — shouldn't happen
// since we hold delivery_tx. Ignore.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe add unreachable! ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — replaced comment with unreachable!("delivery_tx is held by ConnectionState"). Commit 65821fd.

Comment thread crates/fila-server/src/binary_server.rs
- admin opcodes renumbered from 0xFD downward (was 0x20 upward) so
  hot-path and admin ranges grow independently without colliding
- Stream enum now implements AsyncRead + AsyncWrite traits instead of
  manual method delegation
- connection tasks tracked via JoinSet for proper graceful shutdown
  (abort + drain on shutdown signal, reap completed in accept loop)
- delivery_rx None branch uses unreachable!() since delivery_tx is held
- added unit tests for ConnectionError Display and Stream trait bounds
- updated docs/protocol.md with new admin opcode values throughout
@vieiralucas
Copy link
Copy Markdown
Member Author

RE: continuation frame support — confirmed this is not implemented in any story in Epic 20. The codec defines FLAG_CONTINUATION and is_continuation() but neither the server nor SDK ever uses them. With 16 MiB max frame size it's unlikely to matter in practice, but it's a spec conformance gap to track for a future epic.

continuation frames:
- ContinuationAssembler in fila-fibp: tracks per-request_id reassembly,
  validates opcode consistency, enforces max reassembled size
- integrated into binary_server ConnectionState frame_loop
- 8 unit tests (passthrough, 2/3-frame, interleaved, errors, clear)
- 1 integration test (split enqueue over 2 continuation frames)

tls e2e fixes:
- drain both stdout AND stderr pipes in all test server helpers
  (was only draining stderr, causing server hang on full stdout pipe)
- add missing [telemetry] otlp_endpoint="" to tls test config
- add assert on readiness check in start_tls_server
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 8 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-server/tests/binary_protocol.rs">

<violation number="1" location="crates/fila-server/tests/binary_protocol.rs:489">
P2: Handle EOF in the read loop so the test fails immediately when the connection closes unexpectedly.</violation>
</file>

<file name="crates/fila-fibp/src/frame.rs">

<violation number="1" location="crates/fila-fibp/src/frame.rs:195">
P1: Add a bound on in-flight continuation states (and/or total buffered bytes) to prevent memory-exhaustion via many unfinished request IDs.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/fila-fibp/src/frame.rs
Comment thread crates/fila-server/tests/binary_protocol.rs Outdated
…in test

- cap in-flight continuation streams to 64 (DEFAULT_MAX_PENDING_STREAMS)
  to prevent memory exhaustion from many unfinished request IDs
- handle EOF in continuation test read loop to fail fast on unexpected
  connection close instead of hanging
@vieiralucas vieiralucas merged commit bdd9252 into main Apr 1, 2026
8 checks passed
@vieiralucas vieiralucas deleted the feat/20.1-binary-protocol-server-hot-path branch April 1, 2026 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant