Skip to content

fix(server): warn and skip non-numeric filenames in consumer offset directories instead of panicking#3135

Open
atharvalade wants to merge 5 commits into
apache:masterfrom
atharvalade:fix/panic-on-non-numeric-consumer-offset-filenames
Open

fix(server): warn and skip non-numeric filenames in consumer offset directories instead of panicking#3135
atharvalade wants to merge 5 commits into
apache:masterfrom
atharvalade:fix/panic-on-non-numeric-consumer-offset-filenames

Conversation

@atharvalade
Copy link
Copy Markdown
Contributor

@atharvalade atharvalade commented Apr 15, 2026

Which issue does this PR close?

Closes #3132

Rationale

An operator accidentally placing a file (e.g. .DS_Store, editor swap file, backup) in the consumer offsets directory crashes the server on startup due to a panic! on non-numeric filenames.

What changed?

Both load_consumer_offsets and load_consumer_group_offsets in core/server/src/streaming/partitions/storage.rs panicked when encountering files with non-numeric names during startup offset loading.

The fix replaces both panic! calls with warn! log messages and continue, so the server gracefully skips unexpected files. Added 9 integration tests in core/integration/tests/storage/consumer_offsets.rs covering valid files, non-numeric files (.DS_Store, backup.bak), empty directories, subdirectories, and nonexistent paths for both functions.

Local Execution

  • Passed
  • Pre-commit hooks ran

AI Usage

  1. Opus 4.6
  2. Minimal AI used
  3. All tests verified locally with cargo test, cargo clippy, cargo fmt --check, and CI scripts
  4. Yes, all code can be explained

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.69%. Comparing base (611fca0) to head (5ef2977).
⚠️ Report is 39 commits behind head on master.

Additional details and impacted files
@@              Coverage Diff              @@
##             master    #3135       +/-   ##
=============================================
- Coverage     74.10%   52.69%   -21.41%     
  Complexity      943      943               
=============================================
  Files          1159     1157        -2     
  Lines        102033    90321    -11712     
  Branches      79083    67389    -11694     
=============================================
- Hits          75607    47596    -28011     
- Misses        23765    40157    +16392     
+ Partials       2661     2568       -93     
Components Coverage Δ
Rust Core 46.47% <100.00%> (-28.86%) ⬇️
Java SDK 60.14% <ø> (ø)
C# SDK 69.07% <ø> (-0.31%) ⬇️
Python SDK 81.43% <ø> (ø)
Node SDK 91.53% <ø> (ø)
Go SDK 39.43% <ø> (ø)
Files with missing lines Coverage Δ
core/server/src/streaming/partitions/storage.rs 71.50% <100.00%> (+4.30%) ⬆️

... and 277 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs.

If you need a review, please ensure CI is green and the PR is rebased on the latest master. Don't hesitate to ping the maintainers - either @core on Discord or by mentioning them directly here on the PR.

Thank you for your contribution!

@github-actions github-actions Bot added the stale Inactive issue or pull request label Apr 23, 2026
@hubcio
Copy link
Copy Markdown
Contributor

hubcio commented Apr 23, 2026

can you make it a integration test instead of unit? code in core/server will be replaced when we introduce clustering, thus test will be lost.

@github-actions github-actions Bot removed the stale Inactive issue or pull request label Apr 24, 2026
@atharvalade atharvalade force-pushed the fix/panic-on-non-numeric-consumer-offset-filenames branch from 4fc2476 to 39db2e4 Compare April 24, 2026 15:30
@atharvalade
Copy link
Copy Markdown
Contributor Author

can you make it a integration test instead of unit? code in core/server will be replaced when we introduce clustering, thus test will be lost.

Good call, moved the tests from core/server unit tests to core/integration/tests/storage/consumer_offsets.rs. They follow the same pattern as tests/state/ (direct server:: imports, no full harness needed). This way they survive the server ti server-ng migration and can be re-run against the new binary.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs.

If you need a review, please ensure CI is green and the PR is rebased on the latest master. Don't hesitate to ping the maintainers - either @core on Discord or by mentioning them directly here on the PR.

Thank you for your contribution!

@github-actions github-actions Bot added stale Inactive issue or pull request and removed stale Inactive issue or pull request labels May 7, 2026
@hubcio
Copy link
Copy Markdown
Contributor

hubcio commented May 14, 2026

/ready

@github-actions github-actions Bot added the S-waiting-on-review PR is waiting on a reviewer label May 14, 2026
Copy link
Copy Markdown
Contributor

@hubcio hubcio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core idea is right - replacing the panic! on non-numeric filenames with warn + skip is the correct direction. but as posted this is a partial fix: into_string().unwrap() and dir_entry.unwrap() in both functions keep the exact startup-panic-on-stray-file behavior #3132 targets, and the regression tests only feed inputs the patch already handles (.DS_Store, backup.bak - all valid utf-8), so they confirm the patch runs but don't prove the bug class is gone. inline comments below.

/author

@@ -181,9 +181,16 @@ pub fn load_consumer_offsets(path: &str) -> Result<Vec<ConsumerOffset>, IggyErro
}

let name = dir_entry.file_name().into_string().unwrap();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fix is incomplete - into_string().unwrap() on this line still panics on a non-utf-8 filename, which is perfectly legal on linux filesystems. the pr's whole rationale is an operator dropping a stray file crashing startup, but a stray file with a non-utf-8 name crashes here before ever reaching the parse you patched. handle the Err the same way - to_string_lossy() or warn + continue.

same loop has two more startup-panic paths the pr leaves untouched: dir_entry.unwrap() a few lines up panics on a per-entry io error, and metadata.is_err() does break instead of continue - on a metadata error it bails the whole loop and silently drops every remaining offset file, including valid ones.

let consumer_id = name.parse::<u32>().unwrap_or_else(|_| {
panic!("Invalid consumer ID file with name: '{}'.", name);
});
let consumer_id = match name.parse::<u32>() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

behavior tradeoff worth calling out in the pr description: a genuinely corrupt or renamed offset file that should be numeric but isn't is now silently skipped, so that consumer loses its committed offset and re-consumes on restart. the warn! gives observability so this is probably acceptable, but it's a crash -> silent-data-skip swap, not a pure robustness win.

@@ -244,12 +251,16 @@ pub fn load_consumer_group_offsets(

let name = dir_entry.file_name().into_string().unwrap();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same non-utf-8 gap as in load_consumer_offsets - into_string().unwrap() here still panics on a non-utf-8 filename before the parse you patched is reached. and same as the other loop, dir_entry.unwrap() and the metadata.is_err() -> break (drops all remaining files on a metadata error) are left as startup-panic / data-loss paths.

assert_eq!(offsets.len(), 1);
assert_eq!(offsets[0].consumer_id, 1);
assert_eq!(offsets[0].offset.load(Ordering::Relaxed), 42);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing test case: a numeric-named file with fewer than 8 bytes of content. load_consumer_offsets does read_exact(&mut [0u8; 8]) on each file, so a truncated or partially-written offset file makes the whole function return Err and aborts the entire load - same "one bad file kills startup" class this pr targets, and more likely in practice than a non-numeric name (partial write, disk full). worth adding alongside the non-numeric test here.

}

#[test]
fn load_consumer_offsets_skips_directories() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test doesn't cover anything this pr changes. the directory skip it asserts happens via the metadata.unwrap().is_dir() check in load_consumer_offsets, which is untouched by this pr - and 123 is a numeric name, so it'd only reach the patched code if that is_dir check were removed. fine to keep as a regression test, but it's pinning pre-existing behavior, not this change.


#[test]
fn load_consumer_offsets_nonexistent_dir() {
let result = load_consumer_offsets("/tmp/nonexistent_iggy_test_dir_12345");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also at line 138 in load_consumer_group_offsets_nonexistent_dir. both tests hardcode /tmp/nonexistent_iggy_test_dir_12345 - if that path ever exists (leftover from a crashed run, or a parallel run) the test breaks. create a tempfile::tempdir(), grab its path, drop it, then use that path - guaranteed nonexistent and unique.

@hubcio
Copy link
Copy Markdown
Contributor

hubcio commented May 14, 2026

/author

@github-actions github-actions Bot added S-waiting-on-author PR is waiting on author response and removed S-waiting-on-review PR is waiting on a reviewer labels May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-author PR is waiting on author response

Projects

None yet

Development

Successfully merging this pull request may close these issues.

server: panic on non-numeric filenames in consumer offset directories

2 participants