TCP networking perf + Aria hot-path batching by kPsarakis · Pull Request #38 · delftdata/styx

kPsarakis · 2026-05-16T14:35:15Z

Summary

A series of perf changes to Styx's networking and per-message hot paths.
Two phases of work:

TCP networking layer — styx-package/styx/common/tcp_networking.py
plus one new helper script.
Aria hot path — chain-call and ack batching, protocol-dispatch
inlining, and a deep-copy fast path the fast_deepcopy Cython helper
was missing.

All changes are protocol-internal — wire format adds two new
MessageType values; existing RunFunRemote / Ack receivers stay
wired for compatibility.

Branch history note: originally tcp-send-batching, exploring
background flush + send buffering. That regressed end-to-end TPC-C —
buffering added fixed latency per phase boundary on Styx's critical
path. The batching machinery was reverted; what remains are the
orthogonal wins discovered along the way, plus a new round of changes
informed by py-spy traces of a live TPC-C run.

Changes — TCP networking

Lock-free `send_message` hot path (also fixes a real race)

The old code held a global get_socket_lock across every send to
protect the self.pools lookup + the round-robin next(pool) pick.
Both are atomic under asyncio's single-threaded execution, so the lock
was pure serialisation with no correctness payoff.
The old create_socket_connection published the new pool to
self.pools before create_socket_connections() returned. A
concurrent caller could pick the pool and crash on next() against an
empty conns list — the global lock was the only thing preventing
this. Now the pool is fully built before publication, protected by a
rare-path _pool_creation_lock using a double-checked pattern.

Client-side socket options (was server-only)

asyncio.open_connection() leaves Nagle ON and uses default-small
kernel buffers. The server-side listening sockets already set
TCP_NODELAY + larger buffers, but these don't reliably inherit to
the client side.
StyxSocketClient.create_connection now sets TCP_NODELAY,
SO_SNDBUF, SO_RCVBUF on every accepted client socket.
Buffers bumped to 4 MB on both ends to remove window-scale bottlenecks
at high fan-out.

Larger default pool

SocketPool default size 4 → 16. Under 100+ concurrent transactions,
4 parallel writes per peer became the bottleneck because of the
per-conn write lock.
Env-tunable via SOCKET_POOL_SIZE.

Cached framing headers

encode_message was doing two struct.pack(">B") calls plus a
bytes concat per send. The (msg_type, serializer_id) keyspace is
tiny (~250 pairs max), so cache the packed 2-byte headers in a
module-level dict. Saves ~300 ns/message on the hot path.

Load-aware connection pick in `SocketPool.next`

Pure round-robin can land a send on a connection currently draining a
big payload while peers sit idle — head-of-line blocking on the write
lock.
New pick starts from the running round-robin index and returns the
least-loaded connection (in_flight counter on StyxSocketClient,
bumped before lock acquire, decremented on release). Early-exits at
in_flight == 0, so the common idle case behaves identically to
round-robin.

`scripts/profile_live_worker.py` (new file, no runtime change)

py-spy wrapper for attaching to a running worker during a real
TPC-C run. Existing profile_*.py scripts profile components in
isolation against synthetic workloads; this one finds the actual
wall-clock bottleneck in a live system. Supports --pid, --name,
--all, --top, plus k8s usage notes in the docstring.

Changes — Aria hot path (informed by py-spy on a live TPC-C run)

`RunFunRemoteBatch` — bucket chain calls by peer

__send_async_calls used to issue one send_message(RunFunRemote)
per remote chain call. With TPC-C NewOrder fanning out across stock
partitions, that's many tiny writes to the same handful of peers.
New MessageType.RunFunRemoteBatch (=42): the sender buckets the
pending remote calls by (host, port) and emits one batch per peer,
carrying list[RunFuncPayload]. Receiver decodes once and schedules
each entry exactly like the single-call handler did.
Wire-compat: the legacy RunFunRemote handler is still wired up; no
production code path emits the single-call form any more, but a
rolling cluster sees consistent receiver semantics.

`AckBatch` — coalesce chain acks per peer per tick

__send_ack was async and called send_message(Ack) per leaf txn.
At TPC-C epoch sizes (1000 txns, many leaves), that's a flood of
tiny one-way messages to the same root worker.
__send_ack is now sync; cross-network acks go through a new
NetworkingManager.enqueue_ack that buffers per peer. A flush is
auto-scheduled via call_soon, so all acks enqueued in the current
event-loop tick coalesce into a single MessageType.AckBatch (=43)
per peer.
The _ack_flush_scheduled flag is reset before the awaited send, so
any acks enqueued during the flush schedule a fresh flush — no
dropped acks, no lost wake-ups for the root's
waited_ack_events[t_id].
Safe against the chain-completion barrier: the root awaits
waited_ack_events[t_id] inside _run_epoch_functions_and_chain,
which fires when fractions sum to 1.0; the batched ack updates the
same fraction map and fires the same event.

Inline protocol dispatch in `protocol_queue_worker`

The protocol queue worker pool used to wrap every message in
protocol_task_scheduler.create_task(...), taking an
AIOTaskScheduler semaphore slot for execution.
With PROTOCOL_WORKERS=100 already bounding concurrency, the
per-message Task allocation + semaphore round-trip is pure overhead.
Worker now awaits protocol_tcp_controller(message) directly.
Removed the now-unused protocol_task_scheduler field.

`fast_deepcopy` alias-safety for scalar-tuples (Cython)

Found via py-spy: copy.deepcopy still showed up as 1.5% of CPU even
though InMemoryOperatorState.get() already routes through the
Cython fast_deepcopy. Reason:
front_end_metadata["item_replies"] is a list[tuple[str, int, ...]]
and the existing fast path treated tuples as opaque, falling back to
copy.deepcopy for the whole list.
New helpers _is_scalar_tuple / _is_alias_safe: tuples of scalars
are fully immutable, safe to share without copying. Lists / dicts of
such tuples now take the shallow-copy path.
Micro-bench on the TPC-C front_end_metadata struct:
fast_deepcopy 0.42 µs vs copy.deepcopy 5.77 µs (~14×).

New `MessageType` values

value	name	direction	payload
42	`RunFunRemoteBatch`	sender → peer	`list[RunFuncPayload]`
43	`AckBatch`	leaf → root	`list[(ack_id, fraction_str, chain_participants)]`

New env vars (TCP layer)

name	default	what
`SOCKET_POOL_SIZE`	`16`	Connections per `(host, port)` pool
`SOCKET_SND_BUF`	`4 << 20` (4 MB)	Per-conn `SO_SNDBUF`
`SOCKET_RCV_BUF`	`4 << 20` (4 MB)	Per-conn `SO_RCVBUF`

Measured impact (TPC-C, input rate 150, n_part=4, epoch=100, 40 s py-spy window)

Across 4 workers, comparing immediately-before vs immediately-after the
Aria hot-path commit (TCP changes already baked into the baseline):

Function	Before	After	Δ
Total samples	34,169	30,838	−10%
`run_function`	7,868	5,936	−25%
`create_task`	948	317	−67%
`runner` (AIOTaskScheduler)	2,089	1,244	−40%
state `get`	1,268	715	−44%
`deepcopy` + `_deepcopy_list`	813	not in top 30	−100%
`_handle_run_fun_remote`	852	0	replaced by `_handle_run_fun_remote_batch` (1,033)
`send_message`	882	776	−12%

_handle_ack_batch doesn't surface in the top 30 — the per-ack
send_message cost collapsed rather than reappearing in the handler.

Test plan

Unit suite: pytest tests/unit — 726 passed.
Live TPC-C run via scripts/run_experiment.sh tpcc 150 1 4 0.0 1 60 results_original 10 100 1: 0 missed messages, exactly-once
output verified, mean latency 9.2 ms, p99 19 ms.
Re-run end-to-end migration test (tests/e2e_migration_*) before
merging.
Re-run YCSB-T and Deathstar suites to confirm no regression on
workloads outside TPC-C.

Replace the per-call write()+drain() in NetworkingManager.send_message() with a buffer-per-connection approach: messages are accumulated in StyxSocketClient._send_buffer and flushed together by a background task every BATCH_FLUSH_INTERVAL_MS (default 1ms, env-configurable). This collapses the fan-out of N concurrent messages (e.g. district→items, stocks→order_lines) from N syscalls and N TCP segments into a single write()+drain(), yielding 2-36x throughput improvement on the send path depending on fan-out width and concurrency level (measured in microbench_messaging.py). - StyxSocketClient: add _send_buffer, buffer_message(), flush() - SocketPool: add flush_all() - NetworkingManager: buffer in send_message(), background _flush_loop(), cancel + final flush in close_all_connections() - BATCH_FLUSH_INTERVAL_MS constant added to base_networking.py - send_message_request_response() and send_message_rq_rs() are unchanged (they write+drain directly, as required for synchronous request-response) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codecov · 2026-05-16T14:36:09Z

Codecov Report

❌ Patch coverage is 44.44444% with 75 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.92%. Comparing base (b1d735d) to head (6f220aa).

Files with missing lines	Patch %	Lines
styx-package/styx/common/tcp_networking.py	36.20%	74 Missing ⚠️
styx-package/styx/common/stateful_function.py	91.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #38      +/-   ##
==========================================
- Coverage   88.03%   86.92%   -1.11%     
==========================================
  Files          45       45              
  Lines        2590     2661      +71     
==========================================
+ Hits         2280     2313      +33     
- Misses        310      348      +38

Flag	Coverage Δ
coordinator	`93.40% <ø> (ø)`
integration	`9.05% <0.00%> (-0.25%)`	⬇️
styx-package	`82.76% <44.44%> (-1.84%)`	⬇️
worker	`83.69% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
styx-package/styx/common/base_networking.py	`94.81% <100.00%> (+0.07%)`	⬆️
styx-package/styx/common/message_types.py	`100.00% <100.00%> (ø)`
styx-package/styx/common/operator.py	`100.00% <100.00%> (ø)`
styx-package/styx/common/stateful_function.py	`98.36% <91.66%> (-0.78%)`	⬇️
styx-package/styx/common/tcp_networking.py	`67.16% <36.20%> (-8.20%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

send_message() now calls socket_conn.buffer_message() instead of socket_conn.send_message(), so update the two affected mock assertions in TestNetworkingManagerSendMessage accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- tcp_networking.py: add contextlib import, replace try/except CancelledError with contextlib.suppress (SIM105) - microbench_messaging.py: add return type annotations, contextlib.suppress for bare except-pass blocks, move Callable/Awaitable to TYPE_CHECKING block, rename N→n in scenarios/run, drop unused pool params from bench functions, annotate send_fn, prefix unused unpacked vars with _ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

1ms fixed buffering added latency on every phase boundary, hurting end-to-end transaction throughput. asyncio.sleep(0) instead yields for one event loop turn — no fixed delay, messages flush as soon as the current synchronous burst completes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…pattern The pure send-throughput microbench couldn''t see the regression. This benchmark models the real critical path: K phases per transaction, each phase blocks waiting for N ACKs before the next phase begins. Compares 4 flush strategies (seq, timer1ms, sleep0, explicit per-txn) across concurrency 1..500. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The lock was unnecessarily serializing every send_message call. asyncio is single-threaded, so dict.get() and SocketPool.__next__() are atomic — no lock needed once the pool exists. The lock was only protecting against duplicate pool creation, and also covering a race where create_socket_connection published an empty pool to self.pools before its connections were established. Fix: build the pool fully before publishing it, then only lock-protect the rare creation path (double-checked locking). The hot path is now lock-free: just dict.get + round-robin pick + buffer append, all synchronous. This lets concurrent send_message calls queue messages without yielding to the event loop between them — which is what the flush-loop batching needs to actually coalesce work across concurrent transactions. Renamed get_socket_lock -> _pool_creation_lock to reflect its new role. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The batching machinery (per-connection send buffers, background flush task, buffer_message/flush methods, BATCH_FLUSH_INTERVAL_MS env var) didn''t improve end-to-end transaction throughput in Styx''s actual workload — likely because operator code has real yield points between sends that defeat any flush strategy short of explicit worker-batch flushing. Reverted: - StyxSocketClient._send_buffer / buffer_message / flush - SocketPool.flush_all - NetworkingManager._flush_task / _flush_loop - BATCH_FLUSH_INTERVAL_MS constant - send_message now does immediate write+drain again (via socket_conn.send_message) - Both microbench scripts (no longer needed) Kept (the one piece that helped without batching): - Lock-free hot path in send_message / send_message_request_response - create_socket_connection builds pool fully before publishing it (also fixes a real race: old code published empty pool, only the lock prevented next() crashing on it) - get_socket_lock renamed to _pool_creation_lock to reflect its new role Net change vs main: ~15 lines in tcp_networking.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

asyncio.open_connection() leaves Nagle ON and uses small kernel buffers. On every accepted client socket, now set: - TCP_NODELAY=1 (was: server side only, inherited only on some OSes) - SO_SNDBUF / SO_RCVBUF = 1 MB (matches server side; env-tunable) Also bump default SocketPool size from 4 to 16 (env: SOCKET_POOL_SIZE). At conc=100+ the per-connection write lock becomes a bottleneck; 4 parallel writes per peer is too few for TPC-C''s burst pattern. New env vars: SOCKET_SND_BUF (bytes, default 1<<20) SOCKET_RCV_BUF (bytes, default 1<<20) SOCKET_POOL_SIZE (int, default 16) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

encode_message did two struct.pack(>B) calls plus a bytes concat on every encoded message. The (msg_type, serializer_id) keyspace is tiny (~250 pairs total), so cache the packed header bytes in a module-level dict. Saves ~300ns per message on the hot path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Pure round-robin can land a send on a connection that''s busy draining a big payload while peer connections sit idle — head-of-line blocking on the per-connection write lock. Now scan from the round-robin start and pick the least-loaded conn (in_flight counter on StyxSocketClient, incremented before lock acquire, decremented on release). In the common case of an idle pool, early-exits at the first conn (load==0) and behaves identically to round-robin. Only the contended case pays the O(N) scan, which is what we want. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Existing profile_hotpaths.py / profile_tpcc.py / profile_ycsb.py profile components in isolation against synthetic workloads. This new script attaches py-spy to a *running* worker during a real TPC-C run to find the actual wall-clock bottleneck — which is what we need before guessing at more networking optimizations. Supports --pid, --name (matching by cmdline), --all (parallel multi-PID), --top (live view), and includes k8s instructions in the docstring. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

TCP in-flight window = min(sender.SO_SNDBUF, receiver.SO_RCVBUF), so the previous client-only bump to 1 MB was capped by the server-side 1 MB cap and gave zero effective improvement. Hoist SOCKET_SND_BUF and SOCKET_RCV_BUF (env-tunable) into base_networking so client (StyxSocketClient) and all five server listening sockets (worker_socket, protocol_socket x2, coor_socket, snapshotting_socket) share one source of truth. Default 4 MB: - Matches what Linux TCP auto-tuning would settle on for typical datacenter BDP (10-25 Gbps at 0.5-1 ms RTT). - Setting SO_SNDBUF/RCVBUF explicitly *disables* Linux auto-tuning, so we pick a sensible cap rather than letting the kernel grow dynamically. - Comfortably covers Styx''s snapshot/migration path which streams MB of state at a time. TPC-C''s small-message traffic is unaffected either way. No new env var needed; existing SOCKET_SND_BUF/SOCKET_RCV_BUF env override now takes effect on both client and server. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rough Three independent perf changes addressing hotspots found via py-spy under TPC-C (input rate 150, n_part=4, epoch=100): * RunFunRemoteBatch (`__send_async_calls`): bucket outbound chain calls by peer and send one batch per peer per fan-out, instead of one TCP `send_message` per remote call. New `MessageType.RunFunRemoteBatch` (=42); receiver decodes the list and schedules each entry like the single-call path. * AckBatch (`enqueue_ack` + `_flush_acks` in NetworkingManager): non-awaiting `__send_ack` that buffers acks per peer and flushes on the next event-loop tick. Coalesces the per-leaf-txn ack stream into one `MessageType.AckBatch` (=43) per peer per tick. Reset of the `_ack_flush_scheduled` flag happens before the awaited send so any acks enqueued mid-flush schedule a fresh flush. * Inlined protocol dispatch in `protocol_queue_worker`: `await protocol_tcp_controller(message)` directly instead of wrapping each message in `protocol_task_scheduler.create_task(...)`. Concurrency is still bounded by `PROTOCOL_WORKERS=100`; we drop the per-message Task allocation plus the `AIOTaskScheduler` semaphore round-trip. Removes the now-unused `protocol_task_scheduler` field. * Fast-copy alias-safety for scalar-tuples (`_fast_copy.pyx`): treat tuples whose elements are all scalars as alias-safe so list-of-scalar-tuples (e.g. TPC-C `front_end_metadata["item_replies"]`) no longer falls through to `copy.deepcopy`. Micro-bench on that struct: 0.42 us vs 5.77 us for `copy.deepcopy` (~14x). Measured deltas across 4 workers (py-spy, 40 s sample window, same input): Total samples 34,169 -> 30,838 (-10%) run_function 7,868 -> 5,936 (-25%) create_task 948 -> 317 (-67%) runner (AIOTaskScheduler) 2,089 -> 1,244 (-40%) state get (deepcopy path) 1,268 -> 715 (-44%) deepcopy + _deepcopy_list 813 -> 0 (-100%, fell out of top 30) `_handle_ack_batch` doesn't surface in the top 30 -- the per-ack `send_message` cost collapsed rather than re-emerging in the handler.

Four surgical changes targeting the per-message overhead surfaced by py-spy on a 10W/500tps TPC-C profile (50,878 samples / 4 workers / 40 s; system was server-bound): * `__send_async_calls`: hoist the `ack_payload` 5-tuple build out of the per-call loop. Every chain call at the same level shares the same `(ack_host, ack_port, t_id, fraction, participants)` tuple — no reason to allocate it N times. * `__send_async_calls`: skip `asyncio.gather` when there is exactly one awaitable left after bucketing. The linear NewOrder chain stages (root → warehouse, warehouse → district, district → ...) all hit this case. * `__prepare_message_transmission`: build the 8- or 9-tuple wire payload in one expression. The old form allocated an 8-tuple and then did `payload += (ack_payload,)` which re-allocates a fresh 9-tuple per remote call. * `worker_service.protocol_queue_worker`: inline the handler lookup so there's no `protocol_tcp_controller` coroutine frame between the queue and the handler. `protocol_handlers_map` is now public on `AriaProtocol` (was `_protocol_handlers_map`). The legacy `protocol_tcp_controller` method stays for direct callers (tests, snapshotting, recovery). Measured impact (same 10W/500tps run, same warm-up + window): Throughput avg: 366 -> 417 tps (+14%) Latency mean: 14,037 -> 8,052 ms (-43%) Latency p99: 23,802 -> 13,571 ms (-43%) Latency max: 26,154 -> 16,192 ms (-38%) Missed messages: 0 -> 0 Exactly-once: ok -> ok No protocol or wire-format changes. 726 unit tests pass.

kPsarakis self-assigned this May 16, 2026

kPsarakis and others added 13 commits May 16, 2026 16:40

Add ruff noqa suppression for print statements in microbenchmark

742e69a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ruff format

6910652

fix ruff

4f0bb04

kPsarakis changed the title ~~Add TCP send-side batching to eliminate per-message drain() overhead~~ TCP networking perf: lock-free hot path, client-side NODELAY, load-aware pool May 16, 2026

kPsarakis and others added 2 commits May 17, 2026 11:13

kPsarakis changed the title ~~TCP networking perf: lock-free hot path, client-side NODELAY, load-aware pool~~ TCP networking perf + Aria hot-path batching May 17, 2026

kPsarakis added 2 commits May 17, 2026 20:41

fix ruff

6f220aa

kPsarakis merged commit d646bb8 into main May 18, 2026
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TCP networking perf + Aria hot-path batching#38

TCP networking perf + Aria hot-path batching#38
kPsarakis merged 18 commits into
mainfrom
tcp-send-batching

kPsarakis commented May 16, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kPsarakis commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes — TCP networking

Lock-free send_message hot path (also fixes a real race)

Client-side socket options (was server-only)

Larger default pool

Cached framing headers

Load-aware connection pick in SocketPool.__next__

scripts/profile_live_worker.py (new file, no runtime change)

Changes — Aria hot path (informed by py-spy on a live TPC-C run)

RunFunRemoteBatch — bucket chain calls by peer

AckBatch — coalesce chain acks per peer per tick

Inline protocol dispatch in protocol_queue_worker

fast_deepcopy alias-safety for scalar-tuples (Cython)

New MessageType values

New env vars (TCP layer)

Measured impact (TPC-C, input rate 150, n_part=4, epoch=100, 40 s py-spy window)

Test plan

Uh oh!

codecov Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kPsarakis commented May 16, 2026 •

edited

Loading

Lock-free `send_message` hot path (also fixes a real race)

Load-aware connection pick in `SocketPool.next`

`scripts/profile_live_worker.py` (new file, no runtime change)

`RunFunRemoteBatch` — bucket chain calls by peer

`AckBatch` — coalesce chain acks per peer per tick

Inline protocol dispatch in `protocol_queue_worker`

`fast_deepcopy` alias-safety for scalar-tuples (Cython)

New `MessageType` values

codecov Bot commented May 16, 2026 •

edited

Loading