Skip to content

TCP networking perf + Aria hot-path batching#38

Merged
kPsarakis merged 18 commits into
mainfrom
tcp-send-batching
May 18, 2026
Merged

TCP networking perf + Aria hot-path batching#38
kPsarakis merged 18 commits into
mainfrom
tcp-send-batching

Conversation

@kPsarakis
Copy link
Copy Markdown
Member

@kPsarakis kPsarakis commented May 16, 2026

Summary

A series of perf changes to Styx's networking and per-message hot paths.
Two phases of work:

  1. TCP networking layerstyx-package/styx/common/tcp_networking.py
    plus one new helper script.
  2. Aria hot path — chain-call and ack batching, protocol-dispatch
    inlining, and a deep-copy fast path the fast_deepcopy Cython helper
    was missing.

All changes are protocol-internal — wire format adds two new
MessageType values; existing RunFunRemote / Ack receivers stay
wired for compatibility.

Branch history note: originally tcp-send-batching, exploring
background flush + send buffering. That regressed end-to-end TPC-C —
buffering added fixed latency per phase boundary on Styx's critical
path. The batching machinery was reverted; what remains are the
orthogonal wins discovered along the way, plus a new round of changes
informed by py-spy traces of a live TPC-C run.

Changes — TCP networking

Lock-free send_message hot path (also fixes a real race)

  • The old code held a global get_socket_lock across every send to
    protect the self.pools lookup + the round-robin next(pool) pick.
    Both are atomic under asyncio's single-threaded execution, so the lock
    was pure serialisation with no correctness payoff.
  • The old create_socket_connection published the new pool to
    self.pools before create_socket_connections() returned. A
    concurrent caller could pick the pool and crash on next() against an
    empty conns list — the global lock was the only thing preventing
    this. Now the pool is fully built before publication, protected by a
    rare-path _pool_creation_lock using a double-checked pattern.

Client-side socket options (was server-only)

  • asyncio.open_connection() leaves Nagle ON and uses default-small
    kernel buffers. The server-side listening sockets already set
    TCP_NODELAY + larger buffers, but these don't reliably inherit to
    the client side.
  • StyxSocketClient.create_connection now sets TCP_NODELAY,
    SO_SNDBUF, SO_RCVBUF on every accepted client socket.
  • Buffers bumped to 4 MB on both ends to remove window-scale bottlenecks
    at high fan-out.

Larger default pool

  • SocketPool default size 4 → 16. Under 100+ concurrent transactions,
    4 parallel writes per peer became the bottleneck because of the
    per-conn write lock.
  • Env-tunable via SOCKET_POOL_SIZE.

Cached framing headers

  • encode_message was doing two struct.pack(">B") calls plus a
    bytes concat per send. The (msg_type, serializer_id) keyspace is
    tiny (~250 pairs max), so cache the packed 2-byte headers in a
    module-level dict. Saves ~300 ns/message on the hot path.

Load-aware connection pick in SocketPool.__next__

  • Pure round-robin can land a send on a connection currently draining a
    big payload while peers sit idle — head-of-line blocking on the write
    lock.
  • New pick starts from the running round-robin index and returns the
    least-loaded connection (in_flight counter on StyxSocketClient,
    bumped before lock acquire, decremented on release). Early-exits at
    in_flight == 0, so the common idle case behaves identically to
    round-robin.

scripts/profile_live_worker.py (new file, no runtime change)

  • py-spy wrapper for attaching to a running worker during a real
    TPC-C run. Existing profile_*.py scripts profile components in
    isolation against synthetic workloads; this one finds the actual
    wall-clock bottleneck in a live system. Supports --pid, --name,
    --all, --top, plus k8s usage notes in the docstring.

Changes — Aria hot path (informed by py-spy on a live TPC-C run)

RunFunRemoteBatch — bucket chain calls by peer

  • __send_async_calls used to issue one send_message(RunFunRemote)
    per remote chain call. With TPC-C NewOrder fanning out across stock
    partitions, that's many tiny writes to the same handful of peers.
  • New MessageType.RunFunRemoteBatch (=42): the sender buckets the
    pending remote calls by (host, port) and emits one batch per peer,
    carrying list[RunFuncPayload]. Receiver decodes once and schedules
    each entry exactly like the single-call handler did.
  • Wire-compat: the legacy RunFunRemote handler is still wired up; no
    production code path emits the single-call form any more, but a
    rolling cluster sees consistent receiver semantics.

AckBatch — coalesce chain acks per peer per tick

  • __send_ack was async and called send_message(Ack) per leaf txn.
    At TPC-C epoch sizes (1000 txns, many leaves), that's a flood of
    tiny one-way messages to the same root worker.
  • __send_ack is now sync; cross-network acks go through a new
    NetworkingManager.enqueue_ack that buffers per peer. A flush is
    auto-scheduled via call_soon, so all acks enqueued in the current
    event-loop tick coalesce into a single MessageType.AckBatch (=43)
    per peer.
  • The _ack_flush_scheduled flag is reset before the awaited send, so
    any acks enqueued during the flush schedule a fresh flush — no
    dropped acks, no lost wake-ups for the root's
    waited_ack_events[t_id].
  • Safe against the chain-completion barrier: the root awaits
    waited_ack_events[t_id] inside _run_epoch_functions_and_chain,
    which fires when fractions sum to 1.0; the batched ack updates the
    same fraction map and fires the same event.

Inline protocol dispatch in protocol_queue_worker

  • The protocol queue worker pool used to wrap every message in
    protocol_task_scheduler.create_task(...), taking an
    AIOTaskScheduler semaphore slot for execution.
  • With PROTOCOL_WORKERS=100 already bounding concurrency, the
    per-message Task allocation + semaphore round-trip is pure overhead.
    Worker now awaits protocol_tcp_controller(message) directly.
  • Removed the now-unused protocol_task_scheduler field.

fast_deepcopy alias-safety for scalar-tuples (Cython)

  • Found via py-spy: copy.deepcopy still showed up as 1.5% of CPU even
    though InMemoryOperatorState.get() already routes through the
    Cython fast_deepcopy. Reason:
    front_end_metadata["item_replies"] is a list[tuple[str, int, ...]]
    and the existing fast path treated tuples as opaque, falling back to
    copy.deepcopy for the whole list.
  • New helpers _is_scalar_tuple / _is_alias_safe: tuples of scalars
    are fully immutable, safe to share without copying. Lists / dicts of
    such tuples now take the shallow-copy path.
  • Micro-bench on the TPC-C front_end_metadata struct:
    fast_deepcopy 0.42 µs vs copy.deepcopy 5.77 µs (~14×).

New MessageType values

value name direction payload
42 RunFunRemoteBatch sender → peer list[RunFuncPayload]
43 AckBatch leaf → root list[(ack_id, fraction_str, chain_participants)]

New env vars (TCP layer)

name default what
SOCKET_POOL_SIZE 16 Connections per (host, port) pool
SOCKET_SND_BUF 4 << 20 (4 MB) Per-conn SO_SNDBUF
SOCKET_RCV_BUF 4 << 20 (4 MB) Per-conn SO_RCVBUF

Measured impact (TPC-C, input rate 150, n_part=4, epoch=100, 40 s py-spy window)

Across 4 workers, comparing immediately-before vs immediately-after the
Aria hot-path commit (TCP changes already baked into the baseline):

Function Before After Δ
Total samples 34,169 30,838 −10%
run_function 7,868 5,936 −25%
create_task 948 317 −67%
runner (AIOTaskScheduler) 2,089 1,244 −40%
state get 1,268 715 −44%
deepcopy + _deepcopy_list 813 not in top 30 −100%
_handle_run_fun_remote 852 0 replaced by _handle_run_fun_remote_batch (1,033)
send_message 882 776 −12%

_handle_ack_batch doesn't surface in the top 30 — the per-ack
send_message cost collapsed rather than reappearing in the handler.

Test plan

  • Unit suite: pytest tests/unit — 726 passed.
  • Live TPC-C run via scripts/run_experiment.sh tpcc 150 1 4 0.0 1 60 results_original 10 100 1: 0 missed messages, exactly-once
    output verified, mean latency 9.2 ms, p99 19 ms.
  • Re-run end-to-end migration test (tests/e2e_migration_*) before
    merging.
  • Re-run YCSB-T and Deathstar suites to confirm no regression on
    workloads outside TPC-C.

Replace the per-call write()+drain() in NetworkingManager.send_message()
with a buffer-per-connection approach: messages are accumulated in
StyxSocketClient._send_buffer and flushed together by a background task
every BATCH_FLUSH_INTERVAL_MS (default 1ms, env-configurable).

This collapses the fan-out of N concurrent messages (e.g. district→items,
stocks→order_lines) from N syscalls and N TCP segments into a single
write()+drain(), yielding 2-36x throughput improvement on the send path
depending on fan-out width and concurrency level (measured in
microbench_messaging.py).

- StyxSocketClient: add _send_buffer, buffer_message(), flush()
- SocketPool: add flush_all()
- NetworkingManager: buffer in send_message(), background _flush_loop(),
  cancel + final flush in close_all_connections()
- BATCH_FLUSH_INTERVAL_MS constant added to base_networking.py
- send_message_request_response() and send_message_rq_rs() are unchanged
  (they write+drain directly, as required for synchronous request-response)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kPsarakis kPsarakis self-assigned this May 16, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 16, 2026

Codecov Report

❌ Patch coverage is 44.44444% with 75 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.92%. Comparing base (b1d735d) to head (6f220aa).

Files with missing lines Patch % Lines
styx-package/styx/common/tcp_networking.py 36.20% 74 Missing ⚠️
styx-package/styx/common/stateful_function.py 91.66% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #38      +/-   ##
==========================================
- Coverage   88.03%   86.92%   -1.11%     
==========================================
  Files          45       45              
  Lines        2590     2661      +71     
==========================================
+ Hits         2280     2313      +33     
- Misses        310      348      +38     
Flag Coverage Δ
coordinator 93.40% <ø> (ø)
integration 9.05% <0.00%> (-0.25%) ⬇️
styx-package 82.76% <44.44%> (-1.84%) ⬇️
worker 83.69% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
styx-package/styx/common/base_networking.py 94.81% <100.00%> (+0.07%) ⬆️
styx-package/styx/common/message_types.py 100.00% <100.00%> (ø)
styx-package/styx/common/operator.py 100.00% <100.00%> (ø)
styx-package/styx/common/stateful_function.py 98.36% <91.66%> (-0.78%) ⬇️
styx-package/styx/common/tcp_networking.py 67.16% <36.20%> (-8.20%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kPsarakis and others added 13 commits May 16, 2026 16:40
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
send_message() now calls socket_conn.buffer_message() instead of
socket_conn.send_message(), so update the two affected mock assertions
in TestNetworkingManagerSendMessage accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tcp_networking.py: add contextlib import, replace try/except CancelledError
  with contextlib.suppress (SIM105)
- microbench_messaging.py: add return type annotations, contextlib.suppress for
  bare except-pass blocks, move Callable/Awaitable to TYPE_CHECKING block,
  rename N→n in scenarios/run, drop unused pool params from bench functions,
  annotate send_fn, prefix unused unpacked vars with _

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1ms fixed buffering added latency on every phase boundary, hurting
end-to-end transaction throughput. asyncio.sleep(0) instead yields
for one event loop turn — no fixed delay, messages flush as soon
as the current synchronous burst completes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pattern

The pure send-throughput microbench couldn''t see the regression. This benchmark
models the real critical path: K phases per transaction, each phase blocks
waiting for N ACKs before the next phase begins. Compares 4 flush strategies
(seq, timer1ms, sleep0, explicit per-txn) across concurrency 1..500.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The lock was unnecessarily serializing every send_message call. asyncio is
single-threaded, so dict.get() and SocketPool.__next__() are atomic — no lock
needed once the pool exists. The lock was only protecting against duplicate
pool creation, and also covering a race where create_socket_connection
published an empty pool to self.pools before its connections were established.

Fix: build the pool fully before publishing it, then only lock-protect the
rare creation path (double-checked locking). The hot path is now lock-free:
just dict.get + round-robin pick + buffer append, all synchronous.

This lets concurrent send_message calls queue messages without yielding to
the event loop between them — which is what the flush-loop batching needs to
actually coalesce work across concurrent transactions.

Renamed get_socket_lock -> _pool_creation_lock to reflect its new role.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The batching machinery (per-connection send buffers, background flush task,
buffer_message/flush methods, BATCH_FLUSH_INTERVAL_MS env var) didn''t improve
end-to-end transaction throughput in Styx''s actual workload — likely because
operator code has real yield points between sends that defeat any flush
strategy short of explicit worker-batch flushing.

Reverted:
  - StyxSocketClient._send_buffer / buffer_message / flush
  - SocketPool.flush_all
  - NetworkingManager._flush_task / _flush_loop
  - BATCH_FLUSH_INTERVAL_MS constant
  - send_message now does immediate write+drain again (via socket_conn.send_message)
  - Both microbench scripts (no longer needed)

Kept (the one piece that helped without batching):
  - Lock-free hot path in send_message / send_message_request_response
  - create_socket_connection builds pool fully before publishing it
    (also fixes a real race: old code published empty pool, only the lock
    prevented next() crashing on it)
  - get_socket_lock renamed to _pool_creation_lock to reflect its new role

Net change vs main: ~15 lines in tcp_networking.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
asyncio.open_connection() leaves Nagle ON and uses small kernel buffers.
On every accepted client socket, now set:
  - TCP_NODELAY=1 (was: server side only, inherited only on some OSes)
  - SO_SNDBUF / SO_RCVBUF = 1 MB (matches server side; env-tunable)

Also bump default SocketPool size from 4 to 16 (env: SOCKET_POOL_SIZE).
At conc=100+ the per-connection write lock becomes a bottleneck; 4
parallel writes per peer is too few for TPC-C''s burst pattern.

New env vars:
  SOCKET_SND_BUF   (bytes, default 1<<20)
  SOCKET_RCV_BUF   (bytes, default 1<<20)
  SOCKET_POOL_SIZE (int,   default 16)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
encode_message did two struct.pack(>B) calls plus a bytes concat on every
encoded message. The (msg_type, serializer_id) keyspace is tiny (~250 pairs
total), so cache the packed header bytes in a module-level dict. Saves
~300ns per message on the hot path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pure round-robin can land a send on a connection that''s busy draining a
big payload while peer connections sit idle — head-of-line blocking on
the per-connection write lock. Now scan from the round-robin start and
pick the least-loaded conn (in_flight counter on StyxSocketClient,
incremented before lock acquire, decremented on release).

In the common case of an idle pool, early-exits at the first conn (load==0)
and behaves identically to round-robin. Only the contended case pays the
O(N) scan, which is what we want.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Existing profile_hotpaths.py / profile_tpcc.py / profile_ycsb.py profile
components in isolation against synthetic workloads. This new script
attaches py-spy to a *running* worker during a real TPC-C run to find the
actual wall-clock bottleneck — which is what we need before guessing at
more networking optimizations.

Supports --pid, --name (matching by cmdline), --all (parallel multi-PID),
--top (live view), and includes k8s instructions in the docstring.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kPsarakis kPsarakis changed the title Add TCP send-side batching to eliminate per-message drain() overhead TCP networking perf: lock-free hot path, client-side NODELAY, load-aware pool May 16, 2026
kPsarakis and others added 2 commits May 17, 2026 11:13
TCP in-flight window = min(sender.SO_SNDBUF, receiver.SO_RCVBUF), so the
previous client-only bump to 1 MB was capped by the server-side 1 MB cap
and gave zero effective improvement.

Hoist SOCKET_SND_BUF and SOCKET_RCV_BUF (env-tunable) into base_networking
so client (StyxSocketClient) and all five server listening sockets
(worker_socket, protocol_socket x2, coor_socket, snapshotting_socket)
share one source of truth.

Default 4 MB:
  - Matches what Linux TCP auto-tuning would settle on for typical datacenter
    BDP (10-25 Gbps at 0.5-1 ms RTT).
  - Setting SO_SNDBUF/RCVBUF explicitly *disables* Linux auto-tuning, so we
    pick a sensible cap rather than letting the kernel grow dynamically.
  - Comfortably covers Styx''s snapshot/migration path which streams MB of
    state at a time. TPC-C''s small-message traffic is unaffected either way.

No new env var needed; existing SOCKET_SND_BUF/SOCKET_RCV_BUF env override now
takes effect on both client and server.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rough

Three independent perf changes addressing hotspots found via py-spy under
TPC-C (input rate 150, n_part=4, epoch=100):

* RunFunRemoteBatch (`__send_async_calls`): bucket outbound chain calls by
  peer and send one batch per peer per fan-out, instead of one TCP
  `send_message` per remote call. New `MessageType.RunFunRemoteBatch` (=42);
  receiver decodes the list and schedules each entry like the single-call
  path.
* AckBatch (`enqueue_ack` + `_flush_acks` in NetworkingManager): non-awaiting
  `__send_ack` that buffers acks per peer and flushes on the next event-loop
  tick. Coalesces the per-leaf-txn ack stream into one
  `MessageType.AckBatch` (=43) per peer per tick. Reset of the
  `_ack_flush_scheduled` flag happens before the awaited send so any acks
  enqueued mid-flush schedule a fresh flush.
* Inlined protocol dispatch in `protocol_queue_worker`: `await
  protocol_tcp_controller(message)` directly instead of wrapping each
  message in `protocol_task_scheduler.create_task(...)`. Concurrency is
  still bounded by `PROTOCOL_WORKERS=100`; we drop the per-message Task
  allocation plus the `AIOTaskScheduler` semaphore round-trip. Removes the
  now-unused `protocol_task_scheduler` field.
* Fast-copy alias-safety for scalar-tuples (`_fast_copy.pyx`): treat tuples
  whose elements are all scalars as alias-safe so list-of-scalar-tuples
  (e.g. TPC-C `front_end_metadata["item_replies"]`) no longer falls
  through to `copy.deepcopy`. Micro-bench on that struct: 0.42 us vs
  5.77 us for `copy.deepcopy` (~14x).

Measured deltas across 4 workers (py-spy, 40 s sample window, same input):

  Total samples           34,169 -> 30,838  (-10%)
  run_function             7,868 ->  5,936  (-25%)
  create_task                948 ->    317  (-67%)
  runner (AIOTaskScheduler) 2,089 ->  1,244  (-40%)
  state get (deepcopy path) 1,268 ->    715  (-44%)
  deepcopy + _deepcopy_list   813 ->      0  (-100%, fell out of top 30)

`_handle_ack_batch` doesn't surface in the top 30 -- the per-ack
`send_message` cost collapsed rather than re-emerging in the handler.
@kPsarakis kPsarakis changed the title TCP networking perf: lock-free hot path, client-side NODELAY, load-aware pool TCP networking perf + Aria hot-path batching May 17, 2026
kPsarakis added 2 commits May 17, 2026 20:41
Four surgical changes targeting the per-message overhead surfaced by
py-spy on a 10W/500tps TPC-C profile (50,878 samples / 4 workers /
40 s; system was server-bound):

* `__send_async_calls`: hoist the `ack_payload` 5-tuple build out of the
  per-call loop. Every chain call at the same level shares the same
  `(ack_host, ack_port, t_id, fraction, participants)` tuple — no
  reason to allocate it N times.

* `__send_async_calls`: skip `asyncio.gather` when there is exactly one
  awaitable left after bucketing. The linear NewOrder chain stages
  (root → warehouse, warehouse → district, district → ...) all hit
  this case.

* `__prepare_message_transmission`: build the 8- or 9-tuple wire
  payload in one expression. The old form allocated an 8-tuple and
  then did `payload += (ack_payload,)` which re-allocates a fresh
  9-tuple per remote call.

* `worker_service.protocol_queue_worker`: inline the handler lookup so
  there's no `protocol_tcp_controller` coroutine frame between the
  queue and the handler. `protocol_handlers_map` is now public on
  `AriaProtocol` (was `_protocol_handlers_map`). The legacy
  `protocol_tcp_controller` method stays for direct callers (tests,
  snapshotting, recovery).

Measured impact (same 10W/500tps run, same warm-up + window):

  Throughput avg:     366 -> 417 tps   (+14%)
  Latency mean:    14,037 -> 8,052 ms  (-43%)
  Latency p99:     23,802 -> 13,571 ms (-43%)
  Latency max:     26,154 -> 16,192 ms (-38%)
  Missed messages:      0 -> 0
  Exactly-once:         ok -> ok

No protocol or wire-format changes. 726 unit tests pass.
@kPsarakis kPsarakis merged commit d646bb8 into main May 18, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant