Skip to content

feat: 20.4 cli & cluster inter-node migration to binary protocol#161

Merged
vieiralucas merged 13 commits intomainfrom
feat/20.4-cli-cluster-migration
Apr 3, 2026
Merged

feat: 20.4 cli & cluster inter-node migration to binary protocol#161
vieiralucas merged 13 commits intomainfrom
feat/20.4-cli-cluster-migration

Conversation

@vieiralucas
Copy link
Copy Markdown
Member

@vieiralucas vieiralucas commented Mar 31, 2026

Summary

  • Rewrote fila-cli from gRPC to binary protocol (fila-fibp frames over TCP)
  • Migrated cluster inter-node communication (Raft RPCs, leader forwarding, membership management) from gRPC to binary protocol
  • 14 new cluster opcodes (0x40-0x4D), 10 new cluster frame types
  • Protobuf serialization kept for Raft payloads (transported inside binary frames)
  • Updated all e2e test CLI invocations to use binary_addr

Test plan

  • All 348 fila-core tests pass (including 17 cluster tests)
  • All 27 fila-fibp tests pass
  • Workspace builds clean
  • cargo clippy --workspace -- -D warnings clean
  • E2e tests — CI will verify

🤖 Generated with Claude Code


Summary by cubic

Migrates fila-cli and cluster inter-node traffic from gRPC to fila-fibp over TCP to unify transport and reduce overhead. Completes Linear 20.4; TLS/mTLS preserved; all writes flow through Raft; fila-sdk delivery handling is backpressure-safe.

  • New Features

    • Finalized cluster opcodes; Raft RPCs and leader-forwarded writes carried in binary frames (protobuf payloads).
    • Enqueue/ack/nack and queue create/delete go via Raft; stats/list include leader and replication info.
    • Rewrote fila-cli to fila-fibp with TLS/mTLS and existing flags; updated fila-sdk tests, e2e, benches, and profile-workload to use binary_addr.
  • Bug Fixes

    • Prevented fila-sdk deadlocks and delivery loss under backpressure with an overflow buffer; response frames never starve; consume() cancels inline; server enforces one active subscription per connection.
    • Reused mTLS root store; fixed cluster enqueue error-id; validated ACL permission kinds in CLI.
    • Corrected opcode positions (ConsumeOk 0x13, Delivery 0x14); fixed latency bench to reuse the same stream.

Written for commit 9854a89. Summary will update on new commits.

Benchmark Results (median of 3 runs, no baseline yet)

Commit: 1a770f5
Time: 2026-04-03T18:54:57Z

Benchmark Value Unit
compaction_active_p99 0.174303 ms
compaction_idle_p99 0.17737699999999998 ms
compaction_p99_delta 0.002394000000000035 ms
consumer_concurrency_100_throughput 3551.3333333333335 msg/s
consumer_concurrency_10_throughput 596.6666666666666 msg/s
consumer_concurrency_1_throughput 79.0 msg/s
e2e_latency_p50_light 40.694884 ms
e2e_latency_p95_light 40.766093000000005 ms
e2e_latency_p99_light 40.853371 ms
enqueue_throughput_1kb 7936.610956948823 msg/s
enqueue_throughput_1kb_mbps 7.750596637645335 MB/s
fairness_accuracy_max_deviation 5.0000000000000115 % deviation
fairness_accuracy_tenant-1 5.0000000000000115 % deviation
fairness_accuracy_tenant-2 1.1000000000000036 % deviation
fairness_accuracy_tenant-3 0.10000000000000286 % deviation
fairness_accuracy_tenant-4 0.6249999999999936 % deviation
fairness_accuracy_tenant-5 0.9399999999999964 % deviation
fairness_overhead_fair_throughput 77.74173173119429 msg/s
fairness_overhead_fifo_throughput 79.87201125211335 msg/s
fairness_overhead_pct 2.6671164122747615 %
key_cardinality_10_throughput 6560.591585069704 msg/s
key_cardinality_10k_throughput 613.2647881183649 msg/s
key_cardinality_1k_throughput 1350.2511156427342 msg/s
lua_on_enqueue_overhead_us 0.0 us
lua_throughput_with_hook 137.79503548837582 msg/s
memory_per_message_overhead 5954.3552 bytes/msg
memory_rss_idle 149.64453125 MB
memory_rss_loaded_10k 210.75390625 MB

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 27 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="_bmad-output/implementation-artifacts/stories/20-4-cli-cluster-inter-node-migration.md">

<violation number="1" location="_bmad-output/implementation-artifacts/stories/20-4-cli-cluster-inter-node-migration.md:3">
P3: Story status should remain `review` while the PR is open; `ready-for-dev` doesn’t match the repo’s story workflow for open PRs.</violation>
</file>

<file name="crates/fila-cli/src/main.rs">

<violation number="1" location="crates/fila-cli/src/main.rs:375">
P2: The mTLS path rebuilds the root certificate store with bare `.unwrap()` calls, re-reading the CA cert file that was already parsed above. If any step fails (e.g., file permission change, invalid cert), the CLI panics instead of printing a clean error. Build the `ClientConfig` once by deferring the `.with_root_certificates(root_store)` call until you know whether `.with_client_auth_cert()` or `.with_no_client_auth()` is needed, avoiding both the duplicate I/O and the panics.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/fila-cli/src/main.rs Outdated
@@ -0,0 +1,59 @@
# Story 20.4: CLI & Cluster Inter-Node Migration

Status: ready-for-dev
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Story status should remain review while the PR is open; ready-for-dev doesn’t match the repo’s story workflow for open PRs.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At _bmad-output/implementation-artifacts/stories/20-4-cli-cluster-inter-node-migration.md, line 3:

<comment>Story status should remain `review` while the PR is open; `ready-for-dev` doesn’t match the repo’s story workflow for open PRs.</comment>

<file context>
@@ -0,0 +1,59 @@
+# Story 20.4: CLI & Cluster Inter-Node Migration
+
+Status: ready-for-dev
+
+## Story
</file context>
Fix with Cubic

@vieiralucas vieiralucas force-pushed the feat/20.3-rust-sdk-binary-protocol branch from ef72426 to 6d7d5fe Compare April 1, 2026 13:30
Base automatically changed from feat/20.3-rust-sdk-binary-protocol to main April 2, 2026 11:21
@vieiralucas vieiralucas force-pushed the feat/20.4-cli-cluster-migration branch from 2dbd6fc to b9c1f03 Compare April 2, 2026 11:29
- add cluster write forwarding to enqueue/ack/nack handlers so writes
  go through Raft consensus (matches gRPC service behavior)
- add cluster-aware create/delete queue through meta Raft
- enrich get_stats and list_queues with Raft leader/replication info
- fix opcode shift: ConsumeOk at 0x13, Delivery at 0x14, etc.
- validate ACL permission kinds in CLI (produce/consume/admin only)
- fix lua and visibility_timeout tests to use binary_addr
- all 52 e2e tests pass including 7 cluster tests
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 7 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-server/src/binary_handlers.rs">

<violation number="1" location="crates/fila-server/src/binary_handlers.rs:146">
P2: Error description text is leaked into the `message_id` field. Clients parsing this response will receive an error string where they expect a UUID (or empty string). The single-node error paths and the wildcard arm in the same function all correctly use `String::new()`.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/fila-server/src/binary_handlers.rs Outdated
Comment thread crates/fila-cli/src/main.rs
Comment thread crates/fila-cli/src/main.rs Outdated
- implement AsyncRead/AsyncWrite on CLI Stream enum (same pattern as
  server and SDK — third time this feedback was given)
- remove bare .unwrap() in mTLS path, reuse root_store instead of
  re-reading CA cert file (cubic P2)
- fix error description leaked into message_id field in cluster
  enqueue error path (cubic P2)
- validate ACL permission kinds in CLI
- use try_send instead of send.await for delivery frames in the
  background reader to avoid blocking when the channel is full,
  which prevented processing of response frames (enqueue, ack)
  and caused deadlocks in benchmarks and drain-then-consume patterns
- server: drain all existing consumers when a new Consume arrives
  on the same connection (one active consumer per connection)
- consume(): cancel existing subscription inline before starting
  a new one (no sleep, proper state cleanup)
- remove state cleanup from ConsumeStream::Drop (raced with next
  consume() call)
- add double_consume_does_not_hang test
- fix latency bench to reuse same stream for drain and measurement
replace try_send (drops messages) with overflow buffer architecture:
- delivery frames: try_send to channel, on Full push to VecDeque overflow
- overflow flushed at start of each reader loop iteration
- response frames (enqueue/ack/nack results) always go to oneshots
  immediately — never blocked by delivery backpressure
- TCP reads always continue so response frames are never starved
- server-side delivery channel provides natural backpressure

this fixes:
- deadlock when delivery channel is full and client tries enqueue/ack
- message loss from try_send dropping deliveries
- benchmark hangs on latency drain and fairness accuracy
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-sdk/src/client.rs">

<violation number="1" location="crates/fila-sdk/src/client.rs:876">
P1: `DELIVERY_OVERFLOW_HIGH_WATER` is declared but never checked in any code path. The doc comment promises TCP reads are paused when overflow exceeds this threshold, but the step-3 comment says "We always read, even if the overflow is large." The backpressure mechanism described in the doc is not implemented, so the overflow `VecDeque` can grow without bound under sustained consumer slowness.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/fila-sdk/src/client.rs Outdated
@vieiralucas vieiralucas merged commit f026de2 into main Apr 3, 2026
8 checks passed
@vieiralucas vieiralucas deleted the feat/20.4-cli-cluster-migration branch April 3, 2026 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant