I did this
I'm doing requests for uploading ~2.4M small analytics events from a standalone tool as roughly 800 concurrent HTTPS POST requests to a single host (TLS 1.3, HTTP/1.1). Pool is constrained with:
CURLMOPT_MAX_HOST_CONNECTIONS = 16
CURLMOPT_MAX_TOTAL_CONNECTIONS = 256
CURLMOPT_CONNECTTIMEOUT = 30s
- No
CURLOPT_TIMEOUT
- Multi interface, repeated
curl_multi_perform, one easy handle per request, added in a tight loop.
The easy handles are created via a CURLSH-shared DNS/connection pool on the multi handle, same config that has been in use for years.
I expected the following
All requests complete successfully (this is what happens with curl 8.15.0 : zero errors).
What goes wrong
About 49 out of ~800 requests fail with CURLE_OPERATION_TIMEDOUT after 30s. The info cache of every failing request is dominated by repeated "No more connections allowed to host" messages (~20-40 of them in a row), followed finally by a late TCP connect attempt that never completes within the remaining budget:
libcurl info message cache 0 (No more connections allowed to host)
libcurl info message cache 1 (No more connections allowed to host)
...
libcurl info message cache 18 (Hostname ___ was found in DNS cache)
libcurl info message cache 19 ( Trying IP:443...)
libcurl info message cache 20 (SSL reusing session with ALPN 'http/1.1')
libcurl info message cache 21 (ALPN: curl offers http/1.1)
libcurl info message cache 22 (TLSv1.3 (OUT), TLS handshake, Client hello (1):)
libcurl info message cache 23 (Connection timed out after 30000 milliseconds)
libcurl info message cache 24 (closing connection #815)
Each failing request spends the bulk of its 30s in MSTATE_PENDING waiting for a slot, gets woken late, and runs out of its connect-timeout budget mid-TLS handshake.
he same workload (same network, same host, same binary except the linked libcurl) produces 0 errors with libcurl 8.15.0.
Total connection IDs also differ: 8.17 reports connection numbers up to #815 (so MaxHost=16 isn't being effectively reused -> connections leave and come back). 8.15 creates a similar number of total connections, but never lets a PENDING handle hit the connect timeout.
Bisect / suspicion
Comparing the source:
Curl_cpool_check_limits and the dest-limit switch case in create_conn_helper/url.c are effectively unchanged between 8.15 and 8.17.
- What changed in 8.16 (per release notes & source diff):
multi: process pending, one by one [90] - process_pending_handles now wakes only a single pending handle per call (present through master).
multi: replace remaining EXPIRE_RUN_NOW [67] -> move_pending_to_connect() now calls Curl_multi_mark_dirty(data) instead of Curl_expire(data, 0, EXPIRE_RUN_NOW). Previously the PENDING handle was re-queued via the splay tree for immediate retry; now it is only marked dirty and depends on a future process_pending_handles() call to be woken.
- In 8.17 the happy-eyeballing code was extracted to
cf-ip-happy.c and the connection filter lifecycle was reworked. This also coincides with "vtls: properly handle SSL shutdown timeout [433]" and schannel shutdown tweaks; I haven't isolated whether the connection-filter rework keeps CONN_INUSE true longer.
- The end result is that under sustained saturation (all 16 live connections busy serving slow remote POSTs), some PENDING handles never get woken before their connect timeout expires.
Things I tried locally in 8.17 that did not fully fix it:
- Revert
process_pending_handles to wake all pending handles (8.15 behavior).
- Reset
TIMER_STARTSINGLE + re-schedule EXPIRE_CONNECTTIMEOUT inside move_pending_to_connect() so PENDING wait time doesn't count against the connect timeout.
- Schedule a 1s
EXPIRE_CONNECTTIMEOUT retry when entering PENDING so the splay tree polls for a slot.
- Cap concurrent
cf_ip_attempt entries in cf-ip-happy.c at 2 (to match 8.15's two-baller model).
- Move
process_pending_handles(data->multi) in multi_done() to after Curl_cpool_do_locked(... multi_done_locked ...) so the slot is actually free when pending handles wake.
- None of those reliably eliminate the failures.
The only thing that reliably gives 0 errors is making Curl_cpool_check_limits return CPOOL_LIMIT_OK when the dest bundle is full (i.e. disabling MAX_HOST_CONNECTIONS). That implies the regression isn't about wake/timer semantics but about the real connection lifecycle under 8.17 - something keeps connections counting against dest_limit longer than 8.15 did.
How to reproduce (minimal)
- Multi interface.
CURLMOPT_MAX_HOST_CONNECTIONS = 16, CURLMOPT_MAX_TOTAL_CONNECTIONS = 256.
CURLOPT_CONNECTTIMEOUT = 30, no CURLOPT_TIMEOUT.
- Queue ~800 simultaneous HTTPS POSTs to the same remote host (small payload, e.g. 100 KB of JSON, against any real TLS server).
- Repeatedly call
curl_multi_perform until all requests complete.
Happy to put together a standalone C reproducer if that would help - the failure is fully deterministic on my setup with this pattern.
Related
curl/libcurl version
libcurl/8.17.0 (OpenSSL 1.1.1t, zlib 1.3, nghttp2 1.64.0, HTTP/2) Windows x64, static lib, built (tried with both debug enabled and disabled)
operating system
Windows 11 / Windows Server 2022 (reproduced on both dev machines).
HTTPS via Schannel not used here -> built with OpenSSL..
I did this
I'm doing requests for uploading ~2.4M small analytics events from a standalone tool as roughly 800 concurrent HTTPS POST requests to a single host (TLS 1.3, HTTP/1.1). Pool is constrained with:
CURLMOPT_MAX_HOST_CONNECTIONS = 16CURLMOPT_MAX_TOTAL_CONNECTIONS = 256CURLMOPT_CONNECTTIMEOUT = 30sCURLOPT_TIMEOUTcurl_multi_perform, one easy handle per request, added in a tight loop.The easy handles are created via a
CURLSH-shared DNS/connection pool on the multi handle, same config that has been in use for years.I expected the following
All requests complete successfully (this is what happens with curl 8.15.0 : zero errors).
What goes wrong
About 49 out of ~800 requests fail with
CURLE_OPERATION_TIMEDOUTafter 30s. The info cache of every failing request is dominated by repeated"No more connections allowed to host"messages (~20-40 of them in a row), followed finally by a late TCP connect attempt that never completes within the remaining budget:Each failing request spends the bulk of its 30s in
MSTATE_PENDINGwaiting for a slot, gets woken late, and runs out of its connect-timeout budget mid-TLS handshake.he same workload (same network, same host, same binary except the linked libcurl) produces 0 errors with libcurl 8.15.0.
Total connection IDs also differ: 8.17 reports connection numbers up to #815 (so
MaxHost=16isn't being effectively reused -> connections leave and come back). 8.15 creates a similar number of total connections, but never lets a PENDING handle hit the connect timeout.Bisect / suspicion
Comparing the source:
Curl_cpool_check_limitsand the dest-limit switch case increate_conn_helper/url.care effectively unchanged between 8.15 and 8.17.multi: process pending, one by one [90]-process_pending_handlesnow wakes only a single pending handle per call (present through master).multi: replace remaining EXPIRE_RUN_NOW [67]->move_pending_to_connect()now callsCurl_multi_mark_dirty(data)instead ofCurl_expire(data, 0, EXPIRE_RUN_NOW). Previously the PENDING handle was re-queued via the splay tree for immediate retry; now it is only marked dirty and depends on a futureprocess_pending_handles()call to be woken.cf-ip-happy.cand the connection filter lifecycle was reworked. This also coincides with "vtls: properly handle SSL shutdown timeout [433]" and schannel shutdown tweaks; I haven't isolated whether the connection-filter rework keepsCONN_INUSEtrue longer.Things I tried locally in 8.17 that did not fully fix it:
process_pending_handlesto wake all pending handles (8.15 behavior).TIMER_STARTSINGLE+ re-scheduleEXPIRE_CONNECTTIMEOUTinsidemove_pending_to_connect()so PENDING wait time doesn't count against the connect timeout.EXPIRE_CONNECTTIMEOUTretry when entering PENDING so the splay tree polls for a slot.cf_ip_attemptentries incf-ip-happy.cat 2 (to match 8.15's two-baller model).process_pending_handles(data->multi)inmulti_done()to afterCurl_cpool_do_locked(... multi_done_locked ...)so the slot is actually free when pending handles wake.The only thing that reliably gives 0 errors is making
Curl_cpool_check_limitsreturnCPOOL_LIMIT_OKwhen the dest bundle is full (i.e. disablingMAX_HOST_CONNECTIONS). That implies the regression isn't about wake/timer semantics but about the real connection lifecycle under 8.17 - something keeps connections counting againstdest_limitlonger than 8.15 did.How to reproduce (minimal)
CURLMOPT_MAX_HOST_CONNECTIONS = 16,CURLMOPT_MAX_TOTAL_CONNECTIONS = 256.CURLOPT_CONNECTTIMEOUT = 30, noCURLOPT_TIMEOUT.curl_multi_performuntil all requests complete.Happy to put together a standalone C reproducer if that would help - the failure is fully deterministic on my setup with this pattern.
Related
curl/libcurl version
libcurl/8.17.0 (OpenSSL 1.1.1t, zlib 1.3, nghttp2 1.64.0, HTTP/2) Windows x64, static lib, built (tried with both debug enabled and disabled)
operating system
Windows 11 / Windows Server 2022 (reproduced on both dev machines).
HTTPS via Schannel not used here -> built with OpenSSL..