Skip to content

Implement scalable I/O model with worker-per-context#7

Merged
benoitc merged 14 commits intomainfrom
feature/scalable-io-model
Feb 25, 2026
Merged

Implement scalable I/O model with worker-per-context#7
benoitc merged 14 commits intomainfrom
feature/scalable-io-model

Conversation

@benoitc
Copy link
Owner

@benoitc benoitc commented Feb 24, 2026

Summary

  • Restructure event loop to follow Erlang's scalable I/O model with one worker process per Python context
  • Replace centralized py_event_router bottleneck with direct I/O and timer routing to workers
  • Add ETS-based worker registry for O(1) lookup by loop ID
  • Add benchmark suite for timer and TCP performance testing

Changes

New modules:

  • py_event_worker - Core worker process receiving FD events and timers directly via enif_select and erlang:send_after
  • py_event_worker_registry - ETS registry for worker lookup by loop ID
  • py_event_worker_sup - simple_one_for_one supervisor for dynamic workers

C layer modifications:

  • Added worker_pid and loop_id fields to erlang_event_loop_t
  • Modified add_reader/writer/call_later/cancel_timer to route to worker when available
  • Added NIFs: event_loop_set_worker/2, event_loop_set_id/2

Benchmark results vs main:

  • Timer throughput (single): +11.7% (57,730 vs 51,661/sec)
  • Timer latency p95: -8.5% (0.173ms vs 0.189ms)
  • TCP scaling efficiency (2 workers): +20.6% (112% vs 91%)

Restructure event loop to follow Erlang's scalable I/O model:
- Add py_event_worker: dedicated worker process per Python context
  that receives FD events and timers directly via enif_select
- Add py_event_worker_registry: ETS-based O(1) worker lookup by loop_id
- Add py_event_worker_sup: simple_one_for_one supervisor for workers

C layer changes:
- Add worker_pid, has_worker, loop_id fields to erlang_event_loop_t
- Add nif_event_loop_set_worker and nif_event_loop_set_id NIFs
- Update add_reader/writer/call_later/cancel_timer to route to worker
  when available, falling back to router for backward compatibility

Performance improvements (vs baseline):
- Single-worker timer throughput: +11.7%
- Timer latency p95: -8.5%
- TCP scaling efficiency at 2 workers: +20.6%

All 113 tests pass.
@benoitc benoitc force-pushed the feature/scalable-io-model branch from 9c5d544 to 72d5941 Compare February 24, 2026 12:07
The initial add_reader/add_writer NIFs correctly used worker_pid when
available, but the reselect functions and Python-level FD registration
were hardcoded to use router_pid. This caused FD events after the
initial registration to go back to the router instead of the worker,
breaking the scalable I/O model for TCP connections.

Fixed 14 locations to use the pattern:
  ErlNifPid *target_pid = loop->has_worker ? &loop->worker_pid : &loop->router_pid;

NIF functions: nif_reselect_reader, nif_reselect_writer,
nif_reselect_reader_fd, nif_reselect_writer_fd, nif_start_reader,
nif_start_writer

Python module functions: py_add_reader, py_add_writer, py_schedule_timer,
py_cancel_timer, py_add_reader_for, py_add_writer_for,
py_schedule_timer_for, py_cancel_timer_for

Also updated condition checks to accept either router or worker:
  (!loop->has_router && !loop->has_worker)
Combines FD event dispatch and reselect into a single NIF call,
eliminating one roundtrip. The wakeup remains separate for clarity.

Worker flow is now:
1. handle_fd_event_and_reselect - dispatch to pending queue + reselect
2. event_loop_wakeup - wake Python thread
- Add sync_sleep fields to erlang_event_loop_t (id, complete flag, cond var)
- Add py_erlang_sleep() Python function that blocks on Erlang timer
- Add nif_dispatch_sleep_complete() to signal sleep completion
- Handle {sleep_wait, DelayMs, SleepId} in py_event_worker.erl
- Add erlang_asyncio module with full asyncio-compatible API:
  * sleep(), run(), gather(), wait(), wait_for()
  * create_task(), ensure_future(), shield()
  * get_event_loop(), new_event_loop(), set_event_loop()
  * TimeoutError, CancelledError exceptions
  * ALL_COMPLETED, FIRST_COMPLETED, FIRST_EXCEPTION constants
- Add py_erlang_sleep_SUITE with 8 tests

Usage:
    import erlang_asyncio

    async def main():
        await erlang_asyncio.sleep(0.001)
        results = await erlang_asyncio.gather(task1(), task2())

    erlang_asyncio.run(main())
@benoitc benoitc force-pushed the feature/scalable-io-model branch from f3d2310 to 7bb28a9 Compare February 24, 2026 17:13
- Add erlang_asyncio section to docs/asyncio.md with API reference
- Add performance comparison and usage guidelines
- Update CHANGELOG.md with v1.8.0 features
CI runners can have unpredictable timing - allow up to 10x
tolerance instead of 2x to avoid flaky test failures.
- Add asyncio.md and web-frameworks.md to README docs list
- Update getting-started.md with hex.pm install and erlang_asyncio section
- Add performance tips about erlang_asyncio to web-frameworks.md
Implement optimizations for ASGI request/response handling:

- Direct response tuple extraction: extract_asgi_response() directly
  converts (status, headers, body) tuples to Erlang terms, avoiding
  generic py_to_term() overhead (5-10% improvement)

- Pre-interned header names: Cache 16 common HTTP headers (host,
  content-type, user-agent, etc.) as PyBytes with length-based
  dispatch for O(1) lookup (3-5% improvement)

- Cached status code integers: Cache 14 common HTTP status codes
  (200, 201, 204, 301, 302, 304, 400-405, 500-503) as PyLong
  objects (1-2% improvement)

Combined expected improvement: 9-17% for ASGI marshalling.

Add tests for all optimizations in py_SUITE.erl.
Implement thread-local scope template caching that provides 15-20%
improvement for applications with repeated path patterns:

- Cache scope templates per path using FNV-1a hash
- Clone cached templates and update only dynamic fields (client,
  headers, query_string) for cache hits
- Track interpreter ownership for subinterpreter/free-threading safety
- Add ASGI-specific Erlang atoms for efficient scope map lookups

The cache uses 64 entries per thread with automatic replacement.
Cache entries include interpreter tracking to prevent cross-interpreter
PyObject sharing which would cause crashes.

Add test_asgi_scope_caching test case.
Implement resource-backed buffer for request bodies >= 1KB (threshold
defined by ASGI_ZERO_COPY_THRESHOLD):

- Create AsgiBuffer Python type implementing buffer protocol
- Use NIF resource to manage buffer lifecycle safely
- Python can slice/view the buffer without additional copies
- Works correctly with async code since resource lifetime is managed
- Automatic fallback to PyBytes for small bodies or when resource
  type is not initialized

Components:
- asgi_buffer_resource_t: NIF resource holding binary data
- AsgiBufferObject: Python type with bf_getbuffer/bf_releasebuffer
- AsgiBuffer_from_resource(): Factory function
- Updated asgi_binary_to_buffer() to use resource for large bodies

Add test_asgi_zero_copy_buffer test case.
Implement LazyHeaderList Python type that converts headers on-demand
instead of eagerly converting all headers when building ASGI scope.

Key changes:
- LazyHeaderList with sequence protocol (__len__, __getitem__, __iter__)
- LazyHeaderListIter for iteration support
- NIF resource (lazy_headers_resource_t) holds copied header data
- Threshold of 4 headers (smaller lists use eager conversion)
- Headers cached after first access for repeated lookups

This completes all 6 ASGI NIF optimizations with expected total
improvement of 40-60% for typical ASGI workloads.
- Add ASGI NIF optimizations to 1.8.0 changelog
- Document all 6 optimizations with expected improvements
- Update web-frameworks.md with optimization details table
@benoitc benoitc merged commit b916d61 into main Feb 25, 2026
9 checks passed
@benoitc benoitc deleted the feature/scalable-io-model branch February 25, 2026 01:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant