Skip to content

[bugfix] ODPS reader: pin pyodps==0.12.5.1 (retry connection resets) + log session/table context on errors#537

Merged
tiankongdeguiji merged 8 commits into
alibaba:masterfrom
tiankongdeguiji:bugfix/odps-mc-connection-error-log
Jun 22, 2026
Merged

[bugfix] ODPS reader: pin pyodps==0.12.5.1 (retry connection resets) + log session/table context on errors#537
tiankongdeguiji merged 8 commits into
alibaba:masterfrom
tiankongdeguiji:bugfix/odps-mc-connection-error-log

Conversation

@tiankongdeguiji

@tiankongdeguiji tiankongdeguiji commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Problem

MaxCompute (ODPS) training jobs intermittently crash with a transient network drop during a storage-api read:

requests.exceptions.ConnectionError:
  ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

The exception escapes the DataLoader worker and tears down the whole distributed job, leaving only a raw stack trace — no session id, table, or row range to tell which read failed.

Root cause

pyodps stopped retrying connection resets in 0.12.6. Up to 0.12.5.1, the storage-api HTTP adapter mounted a urllib3 Retry(total=retry_times, allowed_methods={GET,HEAD,DELETE}, status_forcelist=[502,503,504]). A connection reset surfaces inside urllib3 as a ProtocolError, which Retry classifies as a retryable read error for idempotent methods — so it was retried up to retry_times times. 0.12.6 (and 0.13.0) dropped that adapter-level retry and moved retry into a Python filter that only retries ODPSError with status 502/503/504; a requests.ConnectionError is neither, so it is re-raised on the first attempt. Upgrading past 0.12.5.1 silently removed connection-reset resilience.

On the tzrec side, tzrec/datasets/odps_dataset.py also had two gaps: _read_rows_arrow_with_retry only caught ODPSError (and retried silently), and _get_session_record_count had no error handling at all — so an escaped reset died without any context.

Change

  • Pin pyodps==0.12.5.1 (requirements/runtime.txt) to restore the transport-level urllib3 retry of connection resets.
  • Log storage-api context before re-raising (diagnostics only — no new tzrec-side retry): _storage_debug_info formats session id / row range / table / quota / endpoint, and both call sites log it plus the exception repr (which carries the ODPS RequestId when present). The existing ODPSError retry path is made non-silent — warning per attempt, error on exhaustion.
  • Log client.table.full_table_name, not the Table objectstr(Table) dumps the entire column schema (hundreds of lines for wide feature tables) and touches lazily-loaded fields, which could fire a metadata reload inside the except block exactly when the connection is already broken.

Test

Ungated OdpsStorageErrorLogTest (no credentials/network) drives a stub client that raises ConnectionError and asserts both functions re-raise and log the session id / table:

python -m tzrec.datasets.odps_dataset_test OdpsStorageErrorLogTest

🤖 Generated with Claude Code

tiankongdeguiji and others added 3 commits June 2, 2026 11:29
…tion errors

_read_rows_arrow_with_retry only caught ODPSError, so a transient
requests.exceptions.ConnectionError ("Connection reset by peer", errno 104)
from the MaxCompute storage API escaped uncaught and killed the DataLoader
worker with no diagnostic context. _get_session_record_count had no error
handling at all.

Add a broad `except` to both call sites that logs the session_id, row range,
and table/quota/endpoint (plus the exception repr, which carries the ODPS
RequestId when present) before re-raising — diagnostics only, no new retry.
Also make the existing ODPSError retry non-silent: warn on each attempt and
error on exhaustion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…repr

_storage_debug_info put the raw pyodps Table object into the error string.
str()/repr() of a Table dumps the entire column schema (hundreds of lines
for wide feature tables) and accesses lazily-loaded fields, so formatting it
inside the except block can fire a metadata reload exactly when the storage
API is already failing.

Use client.table.full_table_name (project.`table`) -- a compact, I/O-free
identifier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
pyodps 0.12.6 replaced the urllib3 transport-level Retry on the storage-api
HTTP adapter with a Python-level filter that only retries ODPSError with
status 502/503/504. A transient connection reset ("Connection reset by
peer") is a requests.ConnectionError, not an ODPSError, so it is no longer
retried and kills the DataLoader worker. 0.13.0 has the same behavior. Pin
to 0.12.5.1, the last release whose urllib3 Retry retries connection resets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@tiankongdeguiji tiankongdeguiji force-pushed the bugfix/odps-mc-connection-error-log branch from ffb7800 to 30237b0 Compare June 18, 2026 10:02
@tiankongdeguiji tiankongdeguiji changed the title [bugfix] ODPS reader: log session/table context on storage-api connection errors [bugfix] ODPS reader: pin pyodps==0.12.5.1 (retry connection resets) + log session/table context on errors Jun 18, 2026
tiankongdeguiji and others added 2 commits June 18, 2026 18:04
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@tiankongdeguiji tiankongdeguiji added the claude-review Let Claude Review label Jun 18, 2026
@github-actions github-actions Bot removed the claude-review Let Claude Review label Jun 18, 2026
Comment thread tzrec/datasets/odps_dataset.py Outdated
Comment on lines +213 to +215
"table": client.table.full_table_name,
"quota": client._quota_name,
"endpoint": client._rest_endpoint,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug_info is built eagerly on every call, before the try (lines 227 & 263) — so this runs on the success path too, not just on errors. It dereferences private pyodps attrs (_quota_name, _rest_endpoint). If a future pyodps bump renames one — and this PR exists precisely because of pyodps version sensitivity — _storage_debug_info itself raises AttributeError and breaks every read with an error unrelated to the actual failure, instead of producing the diagnostics it's meant to provide.

Consider making it defensive (fall back to a placeholder per field) and/or building the string lazily only on the error branch. OdpsReader already holds the quota/endpoint as public state, which avoids the private-attr coupling for the read path.

self._raise()


class OdpsStorageErrorLogTest(unittest.TestCase):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new tests raise requests.exceptions.ConnectionError, which is not an ODPSError, so both tests only exercise the generic except Exception branch (odps_dataset.py:251-253). The ODPSError retry loop — per-attempt logger.warning, the time.sleep backoff, and the "after N retries" logger.error on exhaustion (lines 237-250) — has no coverage. A refactor that broke the retry loop entirely would still pass both tests.

Consider adding a stub whose read_rows_arrow raises ODPSError N times then returns a sentinel reader (patch time.sleep to keep it fast): assert success-after-retry (warning count, no error) and exhaustion-after-3-retries (error text contains the retry count + session id).

@github-actions

Copy link
Copy Markdown

Code review summary

Focused, well-scoped bugfix with an excellent PR description — the root-cause analysis (pyodps 0.12.6 dropping the urllib3-level connection-reset retry) and the deliberate full_table_name-over-str(Table) choice are clearly reasoned and correct. The new test is appropriately ungated (no creds/network), and the diagnostics genuinely improve on-call debuggability.

Two items worth addressing (inline):

  • _storage_debug_info is fragile + eager — built on every call before the try (success path included) and reads private pyodps attrs (_quota_name/_rest_endpoint). A future pyodps rename would make the diagnostic helper itself crash every read. Suggest defensive per-field fallback and/or lazy construction on the error branch.
  • Retry path untestedConnectionError isn't an ODPSError, so the tests only hit the except Exception branch; the ODPSError retry loop (warning-per-attempt, backoff, exhaustion error) has no coverage.

Minor / optional (not blocking):

  • The except Exception branch re-raises connection errors with no tzrec-side retry — intentional per the PR (transport-layer retry is restored via the pin), but a one-line comment stating that would help future readers.
  • {e!r} is logged at several sites; safe today (endpoint is a host, not a signed URL; creds are never passed in), but logging exception repr is a recurring leak vector — str(e) or a small redaction pass would future-proof it.

Performance and documentation review: clean — .full_table_name correctly avoids the lazy metadata reload, debug_info is per-interval/session (not per-batch), and the docstring/comment are accurate and consistent with the module's conventions.

tiankongdeguiji and others added 3 commits June 18, 2026 19:09
… path

_storage_debug_info was built before the try, so it ran on every (including
successful) read and dereferenced pyodps client internals on the happy path.
Build it only inside the except branches now -- via a small closure in
_read_rows_arrow_with_retry (referenced in three branches) and inline in
_get_session_record_count -- so the success path does no diagnostic work.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…with_retry

The existing tests raise ConnectionError, which only exercises the generic
except branch. Add a stub that raises ODPSError N times then returns a
sentinel (time.sleep patched) to cover: retry-then-success (2 warnings, no
error) and exhaustion after 3 retries (3 warnings + one error carrying the
retry count and session id).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make the private-attr reads in _storage_debug_info defensive so a missing
pyodps internal yields None instead of crashing the error-path diagnostic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@tiankongdeguiji tiankongdeguiji added the claude-review Let Claude Review label Jun 18, 2026
@github-actions github-actions Bot removed the claude-review Let Claude Review label Jun 18, 2026
def _storage_debug_info(client: StorageApiArrowClient, **req_fields: Any) -> str:
"""Format MaxCompute storage-api context for error logging."""
parts = {
"table": client.table.full_table_name,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor / defensive: this helper runs only on the error path, but client.table.full_table_name is the one attribute accessed without a guard, while quota/endpoint below use getattr(..., None). If client.table ever raises (e.g. a partially-constructed client, or an unexpected AttributeError), the diagnostic log this PR exists to produce won't be emitted, and the original ConnectionError/ODPSError gets buried behind a chained secondary exception.

(I checked pyodps: full_table_name resolves via _get_schema_name() + already-set project.name/name, so it does not trigger a network reload() — so this is purely robustness, not the I/O risk the PR description hints at.)

Consider making the whole helper best-effort so logging can never raise:

try:
    table = client.table.full_table_name
except Exception:
    table = None

self.assertIn("sess-read-123", logged)
self.assertIn("test_project.test_table", logged)

def test_get_read_session_logs_session_and_reraises(self):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coverage gap: _get_session_record_count's happy path is untested. This test only exercises the exception branch; the SessionStatus.INIT wait loop (time.sleep(1) → retry) and the final return scan_resp.record_count have no coverage, so a regression in the status comparison or the return value wouldn't be caught.

Consider a stub whose get_read_session returns INIT once then a ready status with a record_count, with time.sleep mocked, asserting the returned count and that it polled once. (Minor, also: none of the tests assert quota=/endpoint= actually appear in the logged string — worth one assertIn so a future pyodps rename of those private attrs doesn't silently log None while tests stay green.)

@github-actions

Copy link
Copy Markdown

Code review summary

Solid, well-scoped bugfix. The root-cause analysis (pyodps dropping urllib3-level connection-reset retry in 0.12.6) is convincing, the ==0.12.5.1 pin is the right call given the regression, and the diagnostic logging adds genuine value for debugging transient MaxCompute failures. The choice of full_table_name over str(Table) to avoid a lazy schema reload is correct, and the new tests cover the three meaningful paths (re-raise non-ODPSError, retry-then-succeed, retry-exhaustion). Nice diagnostics design (session_id/row_index per line aids disambiguation across distributed workers).

Reviewed across code quality, performance, tests, docs, and security. Nothing blocking. Two minor items posted inline:

  • _storage_debug_info robustnessclient.table.full_table_name is the one unguarded access in an error-only helper; making it best-effort (like the getattr guards on quota/endpoint) ensures the diagnostic log never raises a secondary exception. Defensive only — verified it doesn't trigger network I/O.
  • Test coverage — the happy path of _get_session_record_count (the INIT wait loop + record_count return) is untested; and no test asserts quota=/endpoint= actually appear in the logged string.

Cleared with no concerns: performance (all new logic is on per-session/error paths, the per-batch reader.read() hot path is untouched), security (logged fields — table/quota/endpoint/RequestId — carry no credentials; helpers are stateless and worker-safe), and docs (no conflicting pyodps version references; version bump and docstrings follow convention).

@tiankongdeguiji tiankongdeguiji merged commit e2bc17d into alibaba:master Jun 22, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants