Skip to content

fix(spec): don't panic on overwrite-truncate when previous totals exceed i32::MAX or are non-numeric#2511

Open
SreeramGarlapati wants to merge 1 commit into
apache:mainfrom
SreeramGarlapati:fix/snapshot-summary-overwrite-panic
Open

fix(spec): don't panic on overwrite-truncate when previous totals exceed i32::MAX or are non-numeric#2511
SreeramGarlapati wants to merge 1 commit into
apache:mainfrom
SreeramGarlapati:fix/snapshot-summary-overwrite-panic

Conversation

@SreeramGarlapati
Copy link
Copy Markdown

Which issue does this PR close?

What changes are included in this PR?

update_snapshot_summaries on the overwrite-with-truncate path was doing truncate_table_summary(...).map_err(...).unwrap(). The unwrap() turned a recoverable parse error into a process panic, and there are two ways to trip it from normal API usage.

The first is a sneaky one: get_prop parsed the previous snapshot's totals as i32, even though every other counter path in the same file already uses u64. So once a table accumulated more than ~2.15B data files / records / bytes (i.e. crossed i32::MAX), the next overwrite-with-truncate would parse-fail, hit the unwrap, and crash the writer. Quietly tableable too — nothing in the public API hints that i32 is the secret ceiling.

The second is corruption / cross-implementation interop: any non-numeric value in a previous total-* property (a foreign implementation, a manual edit, a half-corrupt metadata file) produces a parse error and the same panic.

The fix is small. Widen get_prop to u64 so it matches the rest of the file and lifts the artificial ceiling. Replace the .unwrap() with ?, keeping the existing \"Failed to truncate table summary.\" wrapper as caller context so the error chain still tells you what was being attempted when the inner parse blew up. get_prop's error message now includes the offending property name and value, which is what you actually need when triaging a malformed metadata file.

Scope-wise this only touches the overwrite-truncate path. update_totals has a structurally similar unwrap() antipattern that is a separate (lower-severity) issue and not included here.

Are these changes tested?

Yes — two unit tests in snapshot_summary.rs, both red-state proven against the unfixed code:

The first asserts that a previous summary with total-data-files = i32::MAX + 1 (and total-records similarly) survives the truncation and lands in deleted-data-files / deleted-records intact. Against the unfixed code this panics at snapshot_summary.rs:356:18 with \"number too large to fit in target type\" — that's the i32-ceiling bug demonstrated directly.

The second feeds \"not_a_number\" as a previous total and asserts the function returns an Err whose message carries the \"truncate table summary\" wrapper. Against the unfixed code this panics at the same site with \"invalid digit found in string\" — the malformed-input bug.

I ran each test against the unfixed code, observed the panic, applied the fix, and watched them go green. Existing test_truncate_table_summary and test_update_snapshot_summaries_append continue to pass, and cargo fmt --check / cargo clippy --all-features --tests -- -D warnings are clean.

…e, widen totals to u64

`update_snapshot_summaries` on the overwrite-with-truncate path called
`truncate_table_summary(...).map_err(...).unwrap()`, turning a recoverable
parse error into a process-level panic. Two ways to hit it from normal API
usage:

  1. `get_prop` parsed previous totals as `i32` (max ~2.15B). The rest of
     this file already parses these counters as `u64`, so `get_prop` was
     the odd one out and any table that legitimately accumulated more than
     `i32::MAX` data files / records / file-size bytes panicked the writer
     on the next overwrite-with-truncate.
  2. Any non-numeric previous-total value (corruption, manual edits, a
     foreign implementation) produced a parse `Err` and the same panic.

Switch `get_prop` to `u64` to match the rest of the file and give a more
specific error message including the offending property name and value.
Replace the `.unwrap()` on the truncation result with `?`, preserving the
existing "Failed to truncate table summary." wrapper as caller context.

Two unit tests cover the two failure modes, each pinned via the `expect`/
`expect_err` API so a future regression to the panic path fails the test
loudly rather than silently re-introducing the crash. Verified that both
tests panic at `snapshot_summary.rs:356:18` against the unfixed code
(red-state proof) and pass cleanly against the fix.

Closes apache#2510.
@SreeramGarlapati
Copy link
Copy Markdown
Author

Hi @blackmwk — when you have a moment, appreciate a look. Short fix in spec/snapshot_summary.rs, scoped to the overwrite-truncate panic; context in #2510.

Copy link
Copy Markdown
Contributor

@blackmwk blackmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SreeramGarlapati for this pr, just one minor point!

@@ -409,16 +407,16 @@ pub(crate) fn update_snapshot_summaries(
}

#[allow(dead_code)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove this atrribute?

Copy link
Copy Markdown
Author

@SreeramGarlapati SreeramGarlapati May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch - am keeping this PR's scope tight to the panic fix.
hence opened #2514 as a follow-up that drops all four stale #[allow(dead_code)] attributes in this file (this one plus the ones on update_snapshot_summaries, truncate_table_summary, and update_totals — all stale for the same reason).

verified cargo check and cargo clippy --all-features --tests -- -D warnings stay clean after the removals.

@blackmwk - given the follow up addresses the comment - could u complete merging the current PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(spec): update_snapshot_summaries panics on overwrite when previous totals exceed i32::MAX or are non-numeric

2 participants