TCP networking perf + Aria hot-path batching#38
Merged
Conversation
Replace the per-call write()+drain() in NetworkingManager.send_message() with a buffer-per-connection approach: messages are accumulated in StyxSocketClient._send_buffer and flushed together by a background task every BATCH_FLUSH_INTERVAL_MS (default 1ms, env-configurable). This collapses the fan-out of N concurrent messages (e.g. district→items, stocks→order_lines) from N syscalls and N TCP segments into a single write()+drain(), yielding 2-36x throughput improvement on the send path depending on fan-out width and concurrency level (measured in microbench_messaging.py). - StyxSocketClient: add _send_buffer, buffer_message(), flush() - SocketPool: add flush_all() - NetworkingManager: buffer in send_message(), background _flush_loop(), cancel + final flush in close_all_connections() - BATCH_FLUSH_INTERVAL_MS constant added to base_networking.py - send_message_request_response() and send_message_rq_rs() are unchanged (they write+drain directly, as required for synchronous request-response) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #38 +/- ##
==========================================
- Coverage 88.03% 86.92% -1.11%
==========================================
Files 45 45
Lines 2590 2661 +71
==========================================
+ Hits 2280 2313 +33
- Misses 310 348 +38
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
send_message() now calls socket_conn.buffer_message() instead of socket_conn.send_message(), so update the two affected mock assertions in TestNetworkingManagerSendMessage accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tcp_networking.py: add contextlib import, replace try/except CancelledError with contextlib.suppress (SIM105) - microbench_messaging.py: add return type annotations, contextlib.suppress for bare except-pass blocks, move Callable/Awaitable to TYPE_CHECKING block, rename N→n in scenarios/run, drop unused pool params from bench functions, annotate send_fn, prefix unused unpacked vars with _ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1ms fixed buffering added latency on every phase boundary, hurting end-to-end transaction throughput. asyncio.sleep(0) instead yields for one event loop turn — no fixed delay, messages flush as soon as the current synchronous burst completes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pattern The pure send-throughput microbench couldn''t see the regression. This benchmark models the real critical path: K phases per transaction, each phase blocks waiting for N ACKs before the next phase begins. Compares 4 flush strategies (seq, timer1ms, sleep0, explicit per-txn) across concurrency 1..500. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The lock was unnecessarily serializing every send_message call. asyncio is single-threaded, so dict.get() and SocketPool.__next__() are atomic — no lock needed once the pool exists. The lock was only protecting against duplicate pool creation, and also covering a race where create_socket_connection published an empty pool to self.pools before its connections were established. Fix: build the pool fully before publishing it, then only lock-protect the rare creation path (double-checked locking). The hot path is now lock-free: just dict.get + round-robin pick + buffer append, all synchronous. This lets concurrent send_message calls queue messages without yielding to the event loop between them — which is what the flush-loop batching needs to actually coalesce work across concurrent transactions. Renamed get_socket_lock -> _pool_creation_lock to reflect its new role. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The batching machinery (per-connection send buffers, background flush task,
buffer_message/flush methods, BATCH_FLUSH_INTERVAL_MS env var) didn''t improve
end-to-end transaction throughput in Styx''s actual workload — likely because
operator code has real yield points between sends that defeat any flush
strategy short of explicit worker-batch flushing.
Reverted:
- StyxSocketClient._send_buffer / buffer_message / flush
- SocketPool.flush_all
- NetworkingManager._flush_task / _flush_loop
- BATCH_FLUSH_INTERVAL_MS constant
- send_message now does immediate write+drain again (via socket_conn.send_message)
- Both microbench scripts (no longer needed)
Kept (the one piece that helped without batching):
- Lock-free hot path in send_message / send_message_request_response
- create_socket_connection builds pool fully before publishing it
(also fixes a real race: old code published empty pool, only the lock
prevented next() crashing on it)
- get_socket_lock renamed to _pool_creation_lock to reflect its new role
Net change vs main: ~15 lines in tcp_networking.py.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
asyncio.open_connection() leaves Nagle ON and uses small kernel buffers. On every accepted client socket, now set: - TCP_NODELAY=1 (was: server side only, inherited only on some OSes) - SO_SNDBUF / SO_RCVBUF = 1 MB (matches server side; env-tunable) Also bump default SocketPool size from 4 to 16 (env: SOCKET_POOL_SIZE). At conc=100+ the per-connection write lock becomes a bottleneck; 4 parallel writes per peer is too few for TPC-C''s burst pattern. New env vars: SOCKET_SND_BUF (bytes, default 1<<20) SOCKET_RCV_BUF (bytes, default 1<<20) SOCKET_POOL_SIZE (int, default 16) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
encode_message did two struct.pack(>B) calls plus a bytes concat on every encoded message. The (msg_type, serializer_id) keyspace is tiny (~250 pairs total), so cache the packed header bytes in a module-level dict. Saves ~300ns per message on the hot path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pure round-robin can land a send on a connection that''s busy draining a big payload while peer connections sit idle — head-of-line blocking on the per-connection write lock. Now scan from the round-robin start and pick the least-loaded conn (in_flight counter on StyxSocketClient, incremented before lock acquire, decremented on release). In the common case of an idle pool, early-exits at the first conn (load==0) and behaves identically to round-robin. Only the contended case pays the O(N) scan, which is what we want. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Existing profile_hotpaths.py / profile_tpcc.py / profile_ycsb.py profile components in isolation against synthetic workloads. This new script attaches py-spy to a *running* worker during a real TPC-C run to find the actual wall-clock bottleneck — which is what we need before guessing at more networking optimizations. Supports --pid, --name (matching by cmdline), --all (parallel multi-PID), --top (live view), and includes k8s instructions in the docstring. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TCP in-flight window = min(sender.SO_SNDBUF, receiver.SO_RCVBUF), so the
previous client-only bump to 1 MB was capped by the server-side 1 MB cap
and gave zero effective improvement.
Hoist SOCKET_SND_BUF and SOCKET_RCV_BUF (env-tunable) into base_networking
so client (StyxSocketClient) and all five server listening sockets
(worker_socket, protocol_socket x2, coor_socket, snapshotting_socket)
share one source of truth.
Default 4 MB:
- Matches what Linux TCP auto-tuning would settle on for typical datacenter
BDP (10-25 Gbps at 0.5-1 ms RTT).
- Setting SO_SNDBUF/RCVBUF explicitly *disables* Linux auto-tuning, so we
pick a sensible cap rather than letting the kernel grow dynamically.
- Comfortably covers Styx''s snapshot/migration path which streams MB of
state at a time. TPC-C''s small-message traffic is unaffected either way.
No new env var needed; existing SOCKET_SND_BUF/SOCKET_RCV_BUF env override now
takes effect on both client and server.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rough Three independent perf changes addressing hotspots found via py-spy under TPC-C (input rate 150, n_part=4, epoch=100): * RunFunRemoteBatch (`__send_async_calls`): bucket outbound chain calls by peer and send one batch per peer per fan-out, instead of one TCP `send_message` per remote call. New `MessageType.RunFunRemoteBatch` (=42); receiver decodes the list and schedules each entry like the single-call path. * AckBatch (`enqueue_ack` + `_flush_acks` in NetworkingManager): non-awaiting `__send_ack` that buffers acks per peer and flushes on the next event-loop tick. Coalesces the per-leaf-txn ack stream into one `MessageType.AckBatch` (=43) per peer per tick. Reset of the `_ack_flush_scheduled` flag happens before the awaited send so any acks enqueued mid-flush schedule a fresh flush. * Inlined protocol dispatch in `protocol_queue_worker`: `await protocol_tcp_controller(message)` directly instead of wrapping each message in `protocol_task_scheduler.create_task(...)`. Concurrency is still bounded by `PROTOCOL_WORKERS=100`; we drop the per-message Task allocation plus the `AIOTaskScheduler` semaphore round-trip. Removes the now-unused `protocol_task_scheduler` field. * Fast-copy alias-safety for scalar-tuples (`_fast_copy.pyx`): treat tuples whose elements are all scalars as alias-safe so list-of-scalar-tuples (e.g. TPC-C `front_end_metadata["item_replies"]`) no longer falls through to `copy.deepcopy`. Micro-bench on that struct: 0.42 us vs 5.77 us for `copy.deepcopy` (~14x). Measured deltas across 4 workers (py-spy, 40 s sample window, same input): Total samples 34,169 -> 30,838 (-10%) run_function 7,868 -> 5,936 (-25%) create_task 948 -> 317 (-67%) runner (AIOTaskScheduler) 2,089 -> 1,244 (-40%) state get (deepcopy path) 1,268 -> 715 (-44%) deepcopy + _deepcopy_list 813 -> 0 (-100%, fell out of top 30) `_handle_ack_batch` doesn't surface in the top 30 -- the per-ack `send_message` cost collapsed rather than re-emerging in the handler.
Four surgical changes targeting the per-message overhead surfaced by py-spy on a 10W/500tps TPC-C profile (50,878 samples / 4 workers / 40 s; system was server-bound): * `__send_async_calls`: hoist the `ack_payload` 5-tuple build out of the per-call loop. Every chain call at the same level shares the same `(ack_host, ack_port, t_id, fraction, participants)` tuple — no reason to allocate it N times. * `__send_async_calls`: skip `asyncio.gather` when there is exactly one awaitable left after bucketing. The linear NewOrder chain stages (root → warehouse, warehouse → district, district → ...) all hit this case. * `__prepare_message_transmission`: build the 8- or 9-tuple wire payload in one expression. The old form allocated an 8-tuple and then did `payload += (ack_payload,)` which re-allocates a fresh 9-tuple per remote call. * `worker_service.protocol_queue_worker`: inline the handler lookup so there's no `protocol_tcp_controller` coroutine frame between the queue and the handler. `protocol_handlers_map` is now public on `AriaProtocol` (was `_protocol_handlers_map`). The legacy `protocol_tcp_controller` method stays for direct callers (tests, snapshotting, recovery). Measured impact (same 10W/500tps run, same warm-up + window): Throughput avg: 366 -> 417 tps (+14%) Latency mean: 14,037 -> 8,052 ms (-43%) Latency p99: 23,802 -> 13,571 ms (-43%) Latency max: 26,154 -> 16,192 ms (-38%) Missed messages: 0 -> 0 Exactly-once: ok -> ok No protocol or wire-format changes. 726 unit tests pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A series of perf changes to Styx's networking and per-message hot paths.
Two phases of work:
styx-package/styx/common/tcp_networking.pyplus one new helper script.
inlining, and a deep-copy fast path the
fast_deepcopyCython helperwas missing.
All changes are protocol-internal — wire format adds two new
MessageTypevalues; existingRunFunRemote/Ackreceivers staywired for compatibility.
Branch history note: originally
tcp-send-batching, exploringbackground flush + send buffering. That regressed end-to-end TPC-C —
buffering added fixed latency per phase boundary on Styx's critical
path. The batching machinery was reverted; what remains are the
orthogonal wins discovered along the way, plus a new round of changes
informed by py-spy traces of a live TPC-C run.
Changes — TCP networking
Lock-free
send_messagehot path (also fixes a real race)get_socket_lockacross every send toprotect the
self.poolslookup + the round-robinnext(pool)pick.Both are atomic under asyncio's single-threaded execution, so the lock
was pure serialisation with no correctness payoff.
create_socket_connectionpublished the new pool toself.poolsbeforecreate_socket_connections()returned. Aconcurrent caller could pick the pool and crash on
next()against anempty
connslist — the global lock was the only thing preventingthis. Now the pool is fully built before publication, protected by a
rare-path
_pool_creation_lockusing a double-checked pattern.Client-side socket options (was server-only)
asyncio.open_connection()leaves Nagle ON and uses default-smallkernel buffers. The server-side listening sockets already set
TCP_NODELAY+ larger buffers, but these don't reliably inherit tothe client side.
StyxSocketClient.create_connectionnow setsTCP_NODELAY,SO_SNDBUF,SO_RCVBUFon every accepted client socket.at high fan-out.
Larger default pool
SocketPooldefault size 4 → 16. Under 100+ concurrent transactions,4 parallel writes per peer became the bottleneck because of the
per-conn write lock.
SOCKET_POOL_SIZE.Cached framing headers
encode_messagewas doing twostruct.pack(">B")calls plus abytesconcat per send. The(msg_type, serializer_id)keyspace istiny (~250 pairs max), so cache the packed 2-byte headers in a
module-level dict. Saves ~300 ns/message on the hot path.
Load-aware connection pick in
SocketPool.__next__big payload while peers sit idle — head-of-line blocking on the write
lock.
least-loaded connection (
in_flightcounter onStyxSocketClient,bumped before lock acquire, decremented on release). Early-exits at
in_flight == 0, so the common idle case behaves identically toround-robin.
scripts/profile_live_worker.py(new file, no runtime change)py-spywrapper for attaching to a running worker during a realTPC-C run. Existing
profile_*.pyscripts profile components inisolation against synthetic workloads; this one finds the actual
wall-clock bottleneck in a live system. Supports
--pid,--name,--all,--top, plus k8s usage notes in the docstring.Changes — Aria hot path (informed by py-spy on a live TPC-C run)
RunFunRemoteBatch— bucket chain calls by peer__send_async_callsused to issue onesend_message(RunFunRemote)per remote chain call. With TPC-C NewOrder fanning out across stock
partitions, that's many tiny writes to the same handful of peers.
MessageType.RunFunRemoteBatch(=42): the sender buckets thepending remote calls by
(host, port)and emits one batch per peer,carrying
list[RunFuncPayload]. Receiver decodes once and scheduleseach entry exactly like the single-call handler did.
RunFunRemotehandler is still wired up; noproduction code path emits the single-call form any more, but a
rolling cluster sees consistent receiver semantics.
AckBatch— coalesce chain acks per peer per tick__send_ackwas async and calledsend_message(Ack)per leaf txn.At TPC-C epoch sizes (1000 txns, many leaves), that's a flood of
tiny one-way messages to the same root worker.
__send_ackis now sync; cross-network acks go through a newNetworkingManager.enqueue_ackthat buffers per peer. A flush isauto-scheduled via
call_soon, so all acks enqueued in the currentevent-loop tick coalesce into a single
MessageType.AckBatch(=43)per peer.
_ack_flush_scheduledflag is reset before the awaited send, soany acks enqueued during the flush schedule a fresh flush — no
dropped acks, no lost wake-ups for the root's
waited_ack_events[t_id].waited_ack_events[t_id]inside_run_epoch_functions_and_chain,which fires when fractions sum to 1.0; the batched ack updates the
same fraction map and fires the same event.
Inline protocol dispatch in
protocol_queue_workerprotocol_task_scheduler.create_task(...), taking anAIOTaskSchedulersemaphore slot for execution.PROTOCOL_WORKERS=100already bounding concurrency, theper-message Task allocation + semaphore round-trip is pure overhead.
Worker now
awaitsprotocol_tcp_controller(message)directly.protocol_task_schedulerfield.fast_deepcopyalias-safety for scalar-tuples (Cython)copy.deepcopystill showed up as 1.5% of CPU eventhough
InMemoryOperatorState.get()already routes through theCython
fast_deepcopy. Reason:front_end_metadata["item_replies"]is alist[tuple[str, int, ...]]and the existing fast path treated tuples as opaque, falling back to
copy.deepcopyfor the whole list._is_scalar_tuple/_is_alias_safe: tuples of scalarsare fully immutable, safe to share without copying. Lists / dicts of
such tuples now take the shallow-copy path.
front_end_metadatastruct:fast_deepcopy0.42 µs vscopy.deepcopy5.77 µs (~14×).New
MessageTypevaluesRunFunRemoteBatchlist[RunFuncPayload]AckBatchlist[(ack_id, fraction_str, chain_participants)]New env vars (TCP layer)
SOCKET_POOL_SIZE16(host, port)poolSOCKET_SND_BUF4 << 20(4 MB)SO_SNDBUFSOCKET_RCV_BUF4 << 20(4 MB)SO_RCVBUFMeasured impact (TPC-C, input rate 150, n_part=4, epoch=100, 40 s py-spy window)
Across 4 workers, comparing immediately-before vs immediately-after the
Aria hot-path commit (TCP changes already baked into the baseline):
run_functioncreate_taskrunner(AIOTaskScheduler)getdeepcopy+_deepcopy_list_handle_run_fun_remote_handle_run_fun_remote_batch(1,033)send_message_handle_ack_batchdoesn't surface in the top 30 — the per-acksend_messagecost collapsed rather than reappearing in the handler.Test plan
pytest tests/unit— 726 passed.scripts/run_experiment.sh tpcc 150 1 4 0.0 1 60 results_original 10 100 1: 0 missed messages, exactly-onceoutput verified, mean latency 9.2 ms, p99 19 ms.
tests/e2e_migration_*) beforemerging.
workloads outside TPC-C.