Skip to content

[python] Make HTTP timeout/retry/keep-alive configurable via CatalogOptions#7732

Merged
JingsongLi merged 1 commit into
apache:masterfrom
TheR1sing3un:py-rest-http-config
May 9, 2026
Merged

[python] Make HTTP timeout/retry/keep-alive configurable via CatalogOptions#7732
JingsongLi merged 1 commit into
apache:masterfrom
TheR1sing3un:py-rest-http-config

Conversation

@TheR1sing3un
Copy link
Copy Markdown
Member

Purpose

The REST HttpClient currently hardcodes its retry count and ignores its own session timeout. Two practical issues fall out of this:

  1. Timeout is silently dropped. requests.Session.timeout is not consulted by the library — only session.request(timeout=...) is. The previous self.session.timeout = (180, 180) had no effect, so requests could hang indefinitely on a slow server.
  2. Retry counts are not tunable. Hardcoded max_retries=3 mixed connect / read retries; users could not disable connect retries (which often shouldn't retry) or boost read retries against flaky upstreams.

This PR introduces five CatalogOptions so REST behaviour can be tuned via standard catalog options:

Key Type Default Description
http.connect-timeout int (sec) 180 TCP connect timeout
http.read-timeout int (sec) 180 Response read timeout
http.max-connect-retries int 3 Retries for connect errors
http.max-read-retries int 3 Retries for read / status errors (429/502/503/504)
http.keep-alive bool true When false, sends Connection: close

HttpClient now accepts an optional Options argument, applies the timeout via session.request(timeout=...), separates connect / read retry counters in ExponentialRetry (with total=None so each type governs independently), and sets Connection: close when keep-alive is disabled. RESTApi forwards its options through. token_loader.py is updated to the new ExponentialRetry(connect_retries, read_retries) signature.

Linked issue

N/A — discovered while tuning REST clients against high-latency catalog servers and verifying that session.timeout is dead code.

Tests

  • pypaimon/tests/rest/test_exponential_retry_strategy.py — refreshed for the new signature; covers total=None, separated connect/read counters, zero-retries, and an end-to-end retry-on-connect-error case.
  • pypaimon/tests/rest/client_test.py — new HttpClientHttpOptionsTest:
    • Defaults applied when options=None
    • http.connect-timeout / http.read-timeout reach client._timeout
    • http.keep-alive=false sets Connection: close
    • http.max-connect-retries / http.max-read-retries reach the mounted adapter's Retry

Local: pytest pypaimon/tests/rest/client_test.py pypaimon/tests/rest/test_exponential_retry_strategy.py → 10 passed; flake8 --config=dev/cfg.ini clean.

API and format

HttpClient.__init__ adds an optional options kwarg (backward compatible — existing HttpClient(uri) callers still work and use the same default behaviour as before, except the timeout is now actually honoured).

ExponentialRetry.__init__ now takes connect_retries / read_retries instead of a single max_retries. The only internal caller (token_loader.py) is updated; no public API for ExponentialRetry.

No file format change.

Documentation

Option keys are self-described via with_description(...) in CatalogOptions. No additional doc change required.

Generative AI disclosure

Drafted with assistance from an AI coding tool; all logic reviewed by the author and validated by the tests above.

@JingsongLi
Copy link
Copy Markdown
Contributor

Why can't it be consistent with Java implementation?

@TheR1sing3un
Copy link
Copy Markdown
Member Author

Why can't it be consistent with Java implementation?

Thanks for the review @JingsongLi.

I do want to keep both the bug fix and the new options in this PR. The reason isn't just convenience — without these as catalog options the client's actual behaviour is not under our control. requests falls back to the OS / kernel for
things we can't choose at the application level: TCP connect retries (net.ipv4.tcp_syn_retries), keep-alive probe cadence, socket-level timeouts, etc. Today on the Python side self.session.timeout = (180, 180) is dead code
(Session.timeout isn't honoured), so:

  • connect/read timeouts are effectively unbounded — a slow or hung server can wedge the client indefinitely;
  • retry behaviour is whatever urllib3/the kernel decide, not what we promise.

Exposing the five keys (http.connect-timeout / http.read-timeout / http.max-connect-retries / http.max-read-retries / http.keep-alive) is what lets us pin those down deterministically per-catalog. The defaults already match Java's
hardcoded values (180s / 180s / 5 retries / keep-alive on), so out-of-the-box behaviour stays equivalent — only operators who today have no knob gain one.

On the Java consistency point: I take that seriously. My plan is, once this PR lands, to follow up with a dedicated PR that introduces the same option keys on the Java side (extending RESTCatalogOptions and wiring them through
HttpClientUtils), so the two SDKs converge on the same configuration surface. That follow-up needs its own Java-side validation (HttpClient5 test coverage etc.) which I'd rather not bundle into a Python PR.

If that two-step plan works for you, I'll keep this PR as-is. Otherwise happy to discuss alternatives.

@JingsongLi
Copy link
Copy Markdown
Contributor

@TheR1sing3un If Java is fine, can we just align Java's default behavior? There's no need to expose so many configurations.

@TheR1sing3un
Copy link
Copy Markdown
Member Author

@TheR1sing3un If Java is fine, can we just align Java's default behavior? There's no need to expose so many configurations.

At first, I thought the same way. But the core issue is that if these configurations cannot be adjusted, then some network configurations of the kernel will be used. However, in different machines and different clusters, these network configurations are not uniform. For example, in our internal cluster, the connect retry is only once. This will cause the client to fail directly once it encounters a little network fluctuation, and then lead to the failure of the entire ray job.

@TheR1sing3un
Copy link
Copy Markdown
Member Author

@TheR1sing3un If Java is fine, can we just align Java's default behavior? There's no need to expose so many configurations.

By the way, when these parameters are not configured, the behavior should be aligned.

@JingsongLi
Copy link
Copy Markdown
Contributor

@TheR1sing3un If Java is fine, can we just align Java's default behavior? There's no need to expose so many configurations.

At first, I thought the same way. But the core issue is that if these configurations cannot be adjusted, then some network configurations of the kernel will be used. However, in different machines and different clusters, these network configurations are not uniform. For example, in our internal cluster, the connect retry is only once. This will cause the client to fail directly once it encounters a little network fluctuation, and then lead to the failure of the entire ray job.

What I want to confirm is, is there a problem with the Java SDK?

@TheR1sing3un
Copy link
Copy Markdown
Member Author

What I want to confirm is, is there a problem with the Java SDK?

Checked the Java side too — same problem actually.

RESTCatalogOptions (paimon-api/.../rest/RESTCatalogOptions.java) exposes URI / token / DLF / user-agent only — no timeout / retry / keep-alive options. All HTTP behaviour is hardcoded in HttpClientUtils.createBuilder() (paimon-api/.../rest/HttpClientUtils.java:54-66): 3min timeouts, 5 retries, 100 connections, single retry counter (no connect/read split).

And DEFAULT_HTTP_CLIENT there is a public static final singleton with no plumbing from options, so a Java user hitting the cluster-network case I mentioned would have the same workaround need we have.

Happy to mirror the same option set on the Java side as a follow-up PR

@JingsongLi
Copy link
Copy Markdown
Contributor

@TheR1sing3un You may have misunderstood my meaning. Even if Java cannot configure these parameters, its default parameters run well, so there is no problem.

@TheR1sing3un
Copy link
Copy Markdown
Member Author

@TheR1sing3un You may have misunderstood my meaning. Even if Java cannot configure these parameters, its default parameters run well, so there is no problem.

Sorry, I've always misunderstood your meaning. I agree with you. Let me directly align the default behavior on the python side and temporarily not introduce configuration-based adjustment capabilities. What do you think?

@TheR1sing3un TheR1sing3un force-pushed the py-rest-http-config branch from 2434d26 to 93da407 Compare May 9, 2026 04:21
@TheR1sing3un
Copy link
Copy Markdown
Member Author

Sorry, I've always misunderstood your meaning. I agree with you. Let me directly align the default behavior on the python side and temporarily not introduce configuration-based adjustment capabilities. What do you think?

Done. Aligned with java version.

``HttpClient`` set ``self.session.timeout = (180, 180)`` and never
applied it: the requests library does not consult ``Session.timeout``
on outgoing calls, only ``Session.request(timeout=...)`` does. The
client could therefore hang indefinitely on a slow upstream.

Pass the timeout through ``session.request(timeout=...)`` on every
call so it actually fires. Also bump the hardcoded retry budget from
3 to 5 and keep connect failures non-retriable -- defaults that match
the existing reference implementation and work well across the
cluster shapes we see in practice.

No new ``CatalogOptions`` are introduced. Callers who need different
shapes can override the class-level constants on ``HttpClient``.

Tests:
* ``HttpClientTimeoutTest`` patches ``Session.request`` and asserts it
  is called with ``timeout=client._timeout``. This pins the bug fix.
* ``test_exponential_retry_strategy`` uses the single-counter API and
  asserts connect failures bail fast (``connect=0``).
@TheR1sing3un TheR1sing3un force-pushed the py-rest-http-config branch from 93da407 to 3e616f8 Compare May 9, 2026 05:52
@JingsongLi
Copy link
Copy Markdown
Contributor

@TheR1sing3un Cool! Thanks~

@JingsongLi
Copy link
Copy Markdown
Contributor

+1

@JingsongLi JingsongLi merged commit 21ec57a into apache:master May 9, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants