[python] Make HTTP timeout/retry/keep-alive configurable via CatalogOptions by TheR1sing3un · Pull Request #7732 · apache/paimon

TheR1sing3un · 2026-04-29T10:30:43Z

Purpose

The REST HttpClient currently hardcodes its retry count and ignores its own session timeout. Two practical issues fall out of this:

Timeout is silently dropped. requests.Session.timeout is not consulted by the library — only session.request(timeout=...) is. The previous self.session.timeout = (180, 180) had no effect, so requests could hang indefinitely on a slow server.
Retry counts are not tunable. Hardcoded max_retries=3 mixed connect / read retries; users could not disable connect retries (which often shouldn't retry) or boost read retries against flaky upstreams.

This PR introduces five CatalogOptions so REST behaviour can be tuned via standard catalog options:

Key	Type	Default	Description
`http.connect-timeout`	int (sec)	180	TCP connect timeout
`http.read-timeout`	int (sec)	180	Response read timeout
`http.max-connect-retries`	int	3	Retries for connect errors
`http.max-read-retries`	int	3	Retries for read / status errors (429/502/503/504)
`http.keep-alive`	bool	true	When false, sends `Connection: close`

HttpClient now accepts an optional Options argument, applies the timeout via session.request(timeout=...), separates connect / read retry counters in ExponentialRetry (with total=None so each type governs independently), and sets Connection: close when keep-alive is disabled. RESTApi forwards its options through. token_loader.py is updated to the new ExponentialRetry(connect_retries, read_retries) signature.

Linked issue

N/A — discovered while tuning REST clients against high-latency catalog servers and verifying that session.timeout is dead code.

Tests

pypaimon/tests/rest/test_exponential_retry_strategy.py — refreshed for the new signature; covers total=None, separated connect/read counters, zero-retries, and an end-to-end retry-on-connect-error case.
pypaimon/tests/rest/client_test.py — new HttpClientHttpOptionsTest:
- Defaults applied when options=None
- http.connect-timeout / http.read-timeout reach client._timeout
- http.keep-alive=false sets Connection: close
- http.max-connect-retries / http.max-read-retries reach the mounted adapter's Retry

Local: pytest pypaimon/tests/rest/client_test.py pypaimon/tests/rest/test_exponential_retry_strategy.py → 10 passed; flake8 --config=dev/cfg.ini clean.

API and format

HttpClient.__init__ adds an optional options kwarg (backward compatible — existing HttpClient(uri) callers still work and use the same default behaviour as before, except the timeout is now actually honoured).

ExponentialRetry.__init__ now takes connect_retries / read_retries instead of a single max_retries. The only internal caller (token_loader.py) is updated; no public API for ExponentialRetry.

No file format change.

Documentation

Option keys are self-described via with_description(...) in CatalogOptions. No additional doc change required.

Generative AI disclosure

Drafted with assistance from an AI coding tool; all logic reviewed by the author and validated by the tests above.

JingsongLi · 2026-04-29T14:53:39Z

Why can't it be consistent with Java implementation?

TheR1sing3un · 2026-04-29T15:14:08Z

Why can't it be consistent with Java implementation?

Thanks for the review @JingsongLi.

I do want to keep both the bug fix and the new options in this PR. The reason isn't just convenience — without these as catalog options the client's actual behaviour is not under our control. requests falls back to the OS / kernel for
things we can't choose at the application level: TCP connect retries (net.ipv4.tcp_syn_retries), keep-alive probe cadence, socket-level timeouts, etc. Today on the Python side self.session.timeout = (180, 180) is dead code
(Session.timeout isn't honoured), so:

connect/read timeouts are effectively unbounded — a slow or hung server can wedge the client indefinitely;
retry behaviour is whatever urllib3/the kernel decide, not what we promise.

Exposing the five keys (http.connect-timeout / http.read-timeout / http.max-connect-retries / http.max-read-retries / http.keep-alive) is what lets us pin those down deterministically per-catalog. The defaults already match Java's
hardcoded values (180s / 180s / 5 retries / keep-alive on), so out-of-the-box behaviour stays equivalent — only operators who today have no knob gain one.

On the Java consistency point: I take that seriously. My plan is, once this PR lands, to follow up with a dedicated PR that introduces the same option keys on the Java side (extending RESTCatalogOptions and wiring them through
HttpClientUtils), so the two SDKs converge on the same configuration surface. That follow-up needs its own Java-side validation (HttpClient5 test coverage etc.) which I'd rather not bundle into a Python PR.

If that two-step plan works for you, I'll keep this PR as-is. Otherwise happy to discuss alternatives.

JingsongLi · 2026-05-08T03:22:10Z

@TheR1sing3un If Java is fine, can we just align Java's default behavior? There's no need to expose so many configurations.

TheR1sing3un · 2026-05-08T03:26:25Z

@TheR1sing3un If Java is fine, can we just align Java's default behavior? There's no need to expose so many configurations.

At first, I thought the same way. But the core issue is that if these configurations cannot be adjusted, then some network configurations of the kernel will be used. However, in different machines and different clusters, these network configurations are not uniform. For example, in our internal cluster, the connect retry is only once. This will cause the client to fail directly once it encounters a little network fluctuation, and then lead to the failure of the entire ray job.

TheR1sing3un · 2026-05-08T03:55:10Z

@TheR1sing3un If Java is fine, can we just align Java's default behavior? There's no need to expose so many configurations.

By the way, when these parameters are not configured, the behavior should be aligned.

JingsongLi · 2026-05-08T14:38:44Z

@TheR1sing3un If Java is fine, can we just align Java's default behavior? There's no need to expose so many configurations.

At first, I thought the same way. But the core issue is that if these configurations cannot be adjusted, then some network configurations of the kernel will be used. However, in different machines and different clusters, these network configurations are not uniform. For example, in our internal cluster, the connect retry is only once. This will cause the client to fail directly once it encounters a little network fluctuation, and then lead to the failure of the entire ray job.

What I want to confirm is, is there a problem with the Java SDK?

TheR1sing3un · 2026-05-08T15:00:37Z

What I want to confirm is, is there a problem with the Java SDK?

Checked the Java side too — same problem actually.

RESTCatalogOptions (paimon-api/.../rest/RESTCatalogOptions.java) exposes URI / token / DLF / user-agent only — no timeout / retry / keep-alive options. All HTTP behaviour is hardcoded in HttpClientUtils.createBuilder() (paimon-api/.../rest/HttpClientUtils.java:54-66): 3min timeouts, 5 retries, 100 connections, single retry counter (no connect/read split).

And DEFAULT_HTTP_CLIENT there is a public static final singleton with no plumbing from options, so a Java user hitting the cluster-network case I mentioned would have the same workaround need we have.

Happy to mirror the same option set on the Java side as a follow-up PR

JingsongLi · 2026-05-09T01:11:01Z

@TheR1sing3un You may have misunderstood my meaning. Even if Java cannot configure these parameters, its default parameters run well, so there is no problem.

TheR1sing3un · 2026-05-09T02:19:33Z

@TheR1sing3un You may have misunderstood my meaning. Even if Java cannot configure these parameters, its default parameters run well, so there is no problem.

Sorry, I've always misunderstood your meaning. I agree with you. Let me directly align the default behavior on the python side and temporarily not introduce configuration-based adjustment capabilities. What do you think?

TheR1sing3un · 2026-05-09T05:50:05Z

Sorry, I've always misunderstood your meaning. I agree with you. Let me directly align the default behavior on the python side and temporarily not introduce configuration-based adjustment capabilities. What do you think?

Done. Aligned with java version.

``HttpClient`` set ``self.session.timeout = (180, 180)`` and never applied it: the requests library does not consult ``Session.timeout`` on outgoing calls, only ``Session.request(timeout=...)`` does. The client could therefore hang indefinitely on a slow upstream. Pass the timeout through ``session.request(timeout=...)`` on every call so it actually fires. Also bump the hardcoded retry budget from 3 to 5 and keep connect failures non-retriable -- defaults that match the existing reference implementation and work well across the cluster shapes we see in practice. No new ``CatalogOptions`` are introduced. Callers who need different shapes can override the class-level constants on ``HttpClient``. Tests: * ``HttpClientTimeoutTest`` patches ``Session.request`` and asserts it is called with ``timeout=client._timeout``. This pins the bug fix. * ``test_exponential_retry_strategy`` uses the single-counter API and asserts connect failures bail fast (``connect=0``).

JingsongLi · 2026-05-09T07:04:07Z

@TheR1sing3un Cool! Thanks~

JingsongLi · 2026-05-09T07:04:13Z

+1

TheR1sing3un force-pushed the py-rest-http-config branch from 2434d26 to 93da407 Compare May 9, 2026 04:21

TheR1sing3un force-pushed the py-rest-http-config branch from 93da407 to 3e616f8 Compare May 9, 2026 05:52

JingsongLi merged commit 21ec57a into apache:master May 9, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Make HTTP timeout/retry/keep-alive configurable via CatalogOptions#7732

[python] Make HTTP timeout/retry/keep-alive configurable via CatalogOptions#7732
JingsongLi merged 1 commit into
apache:masterfrom
TheR1sing3un:py-rest-http-config

TheR1sing3un commented Apr 29, 2026

Uh oh!

JingsongLi commented Apr 29, 2026

Uh oh!

TheR1sing3un commented Apr 29, 2026

Uh oh!

JingsongLi commented May 8, 2026

Uh oh!

TheR1sing3un commented May 8, 2026

Uh oh!

TheR1sing3un commented May 8, 2026

Uh oh!

JingsongLi commented May 8, 2026

Uh oh!

TheR1sing3un commented May 8, 2026

Uh oh!

JingsongLi commented May 9, 2026

Uh oh!

TheR1sing3un commented May 9, 2026

Uh oh!

TheR1sing3un commented May 9, 2026

Uh oh!

JingsongLi commented May 9, 2026

Uh oh!

JingsongLi commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TheR1sing3un commented Apr 29, 2026

Purpose

Linked issue

Tests

API and format

Documentation

Generative AI disclosure

Uh oh!

JingsongLi commented Apr 29, 2026

Uh oh!

TheR1sing3un commented Apr 29, 2026

Uh oh!

JingsongLi commented May 8, 2026

Uh oh!

TheR1sing3un commented May 8, 2026

Uh oh!

TheR1sing3un commented May 8, 2026

Uh oh!

JingsongLi commented May 8, 2026

Uh oh!

TheR1sing3un commented May 8, 2026

Uh oh!

JingsongLi commented May 9, 2026

Uh oh!

TheR1sing3un commented May 9, 2026

Uh oh!

TheR1sing3un commented May 9, 2026

Uh oh!

JingsongLi commented May 9, 2026

Uh oh!

JingsongLi commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants