Skip to content

Regression (8.15 -> 8.17): CURLMOPT_MAX_HOST_CONNECTIONS causes PENDING handle timeouts under sustained HTTPS load #21396

@juanbelonepic

Description

@juanbelonepic

I did this

I'm doing requests for uploading ~2.4M small analytics events from a standalone tool as roughly 800 concurrent HTTPS POST requests to a single host (TLS 1.3, HTTP/1.1). Pool is constrained with:

  • CURLMOPT_MAX_HOST_CONNECTIONS = 16
  • CURLMOPT_MAX_TOTAL_CONNECTIONS = 256
  • CURLMOPT_CONNECTTIMEOUT = 30s
  • No CURLOPT_TIMEOUT
  • Multi interface, repeated curl_multi_perform, one easy handle per request, added in a tight loop.

The easy handles are created via a CURLSH-shared DNS/connection pool on the multi handle, same config that has been in use for years.

I expected the following

All requests complete successfully (this is what happens with curl 8.15.0 : zero errors).

What goes wrong

About 49 out of ~800 requests fail with CURLE_OPERATION_TIMEDOUT after 30s. The info cache of every failing request is dominated by repeated "No more connections allowed to host" messages (~20-40 of them in a row), followed finally by a late TCP connect attempt that never completes within the remaining budget:

libcurl info message cache 0  (No more connections allowed to host)
libcurl info message cache 1  (No more connections allowed to host)
...
libcurl info message cache 18 (Hostname ___ was found in DNS cache)
libcurl info message cache 19 (  Trying IP:443...)
libcurl info message cache 20 (SSL reusing session with ALPN 'http/1.1')
libcurl info message cache 21 (ALPN: curl offers http/1.1)
libcurl info message cache 22 (TLSv1.3 (OUT), TLS handshake, Client hello (1):)
libcurl info message cache 23 (Connection timed out after 30000 milliseconds)
libcurl info message cache 24 (closing connection #815)

Each failing request spends the bulk of its 30s in MSTATE_PENDING waiting for a slot, gets woken late, and runs out of its connect-timeout budget mid-TLS handshake.

he same workload (same network, same host, same binary except the linked libcurl) produces 0 errors with libcurl 8.15.0.

Total connection IDs also differ: 8.17 reports connection numbers up to #815 (so MaxHost=16 isn't being effectively reused -> connections leave and come back). 8.15 creates a similar number of total connections, but never lets a PENDING handle hit the connect timeout.

Bisect / suspicion

Comparing the source:

  • Curl_cpool_check_limits and the dest-limit switch case in create_conn_helper/url.c are effectively unchanged between 8.15 and 8.17.
  • What changed in 8.16 (per release notes & source diff):
    • multi: process pending, one by one [90] - process_pending_handles now wakes only a single pending handle per call (present through master).
    • multi: replace remaining EXPIRE_RUN_NOW [67] -> move_pending_to_connect() now calls Curl_multi_mark_dirty(data) instead of Curl_expire(data, 0, EXPIRE_RUN_NOW). Previously the PENDING handle was re-queued via the splay tree for immediate retry; now it is only marked dirty and depends on a future process_pending_handles() call to be woken.
    • In 8.17 the happy-eyeballing code was extracted to cf-ip-happy.c and the connection filter lifecycle was reworked. This also coincides with "vtls: properly handle SSL shutdown timeout [433]" and schannel shutdown tweaks; I haven't isolated whether the connection-filter rework keeps CONN_INUSE true longer.
    • The end result is that under sustained saturation (all 16 live connections busy serving slow remote POSTs), some PENDING handles never get woken before their connect timeout expires.

Things I tried locally in 8.17 that did not fully fix it:

  • Revert process_pending_handles to wake all pending handles (8.15 behavior).
  • Reset TIMER_STARTSINGLE + re-schedule EXPIRE_CONNECTTIMEOUT inside move_pending_to_connect() so PENDING wait time doesn't count against the connect timeout.
  • Schedule a 1s EXPIRE_CONNECTTIMEOUT retry when entering PENDING so the splay tree polls for a slot.
  • Cap concurrent cf_ip_attempt entries in cf-ip-happy.c at 2 (to match 8.15's two-baller model).
  • Move process_pending_handles(data->multi) in multi_done() to after Curl_cpool_do_locked(... multi_done_locked ...) so the slot is actually free when pending handles wake.
  • None of those reliably eliminate the failures.

The only thing that reliably gives 0 errors is making Curl_cpool_check_limits return CPOOL_LIMIT_OK when the dest bundle is full (i.e. disabling MAX_HOST_CONNECTIONS). That implies the regression isn't about wake/timer semantics but about the real connection lifecycle under 8.17 - something keeps connections counting against dest_limit longer than 8.15 did.

How to reproduce (minimal)

  1. Multi interface.
  2. CURLMOPT_MAX_HOST_CONNECTIONS = 16, CURLMOPT_MAX_TOTAL_CONNECTIONS = 256.
  3. CURLOPT_CONNECTTIMEOUT = 30, no CURLOPT_TIMEOUT.
  4. Queue ~800 simultaneous HTTPS POSTs to the same remote host (small payload, e.g. 100 KB of JSON, against any real TLS server).
  5. Repeatedly call curl_multi_perform until all requests complete.

Happy to put together a standalone C reproducer if that would help - the failure is fully deterministic on my setup with this pattern.

Related

curl/libcurl version

libcurl/8.17.0 (OpenSSL 1.1.1t, zlib 1.3, nghttp2 1.64.0, HTTP/2) Windows x64, static lib, built (tried with both debug enabled and disabled)

operating system

Windows 11 / Windows Server 2022 (reproduced on both dev machines).
HTTPS via Schannel not used here -> built with OpenSSL..

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions