fix(server): warn and skip non-numeric filenames in consumer offset directories instead of panicking#3135
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3135 +/- ##
=============================================
- Coverage 74.10% 52.69% -21.41%
Complexity 943 943
=============================================
Files 1159 1157 -2
Lines 102033 90321 -11712
Branches 79083 67389 -11694
=============================================
- Hits 75607 47596 -28011
- Misses 23765 40157 +16392
+ Partials 2661 2568 -93
🚀 New features to boost your workflow:
|
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need a review, please ensure CI is green and the PR is rebased on the latest master. Don't hesitate to ping the maintainers - either @core on Discord or by mentioning them directly here on the PR. Thank you for your contribution! |
|
can you make it a integration test instead of unit? code in core/server will be replaced when we introduce clustering, thus test will be lost. |
…irectories instead of panicking
4fc2476 to
39db2e4
Compare
Good call, moved the tests from |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need a review, please ensure CI is green and the PR is rebased on the latest master. Don't hesitate to ping the maintainers - either @core on Discord or by mentioning them directly here on the PR. Thank you for your contribution! |
|
/ready |
hubcio
left a comment
There was a problem hiding this comment.
core idea is right - replacing the panic! on non-numeric filenames with warn + skip is the correct direction. but as posted this is a partial fix: into_string().unwrap() and dir_entry.unwrap() in both functions keep the exact startup-panic-on-stray-file behavior #3132 targets, and the regression tests only feed inputs the patch already handles (.DS_Store, backup.bak - all valid utf-8), so they confirm the patch runs but don't prove the bug class is gone. inline comments below.
/author
| @@ -181,9 +181,16 @@ pub fn load_consumer_offsets(path: &str) -> Result<Vec<ConsumerOffset>, IggyErro | |||
| } | |||
|
|
|||
| let name = dir_entry.file_name().into_string().unwrap(); | |||
There was a problem hiding this comment.
this fix is incomplete - into_string().unwrap() on this line still panics on a non-utf-8 filename, which is perfectly legal on linux filesystems. the pr's whole rationale is an operator dropping a stray file crashing startup, but a stray file with a non-utf-8 name crashes here before ever reaching the parse you patched. handle the Err the same way - to_string_lossy() or warn + continue.
same loop has two more startup-panic paths the pr leaves untouched: dir_entry.unwrap() a few lines up panics on a per-entry io error, and metadata.is_err() does break instead of continue - on a metadata error it bails the whole loop and silently drops every remaining offset file, including valid ones.
| let consumer_id = name.parse::<u32>().unwrap_or_else(|_| { | ||
| panic!("Invalid consumer ID file with name: '{}'.", name); | ||
| }); | ||
| let consumer_id = match name.parse::<u32>() { |
There was a problem hiding this comment.
behavior tradeoff worth calling out in the pr description: a genuinely corrupt or renamed offset file that should be numeric but isn't is now silently skipped, so that consumer loses its committed offset and re-consumes on restart. the warn! gives observability so this is probably acceptable, but it's a crash -> silent-data-skip swap, not a pure robustness win.
| @@ -244,12 +251,16 @@ pub fn load_consumer_group_offsets( | |||
|
|
|||
| let name = dir_entry.file_name().into_string().unwrap(); | |||
There was a problem hiding this comment.
same non-utf-8 gap as in load_consumer_offsets - into_string().unwrap() here still panics on a non-utf-8 filename before the parse you patched is reached. and same as the other loop, dir_entry.unwrap() and the metadata.is_err() -> break (drops all remaining files on a metadata error) are left as startup-panic / data-loss paths.
| assert_eq!(offsets.len(), 1); | ||
| assert_eq!(offsets[0].consumer_id, 1); | ||
| assert_eq!(offsets[0].offset.load(Ordering::Relaxed), 42); | ||
| } |
There was a problem hiding this comment.
missing test case: a numeric-named file with fewer than 8 bytes of content. load_consumer_offsets does read_exact(&mut [0u8; 8]) on each file, so a truncated or partially-written offset file makes the whole function return Err and aborts the entire load - same "one bad file kills startup" class this pr targets, and more likely in practice than a non-numeric name (partial write, disk full). worth adding alongside the non-numeric test here.
| } | ||
|
|
||
| #[test] | ||
| fn load_consumer_offsets_skips_directories() { |
There was a problem hiding this comment.
this test doesn't cover anything this pr changes. the directory skip it asserts happens via the metadata.unwrap().is_dir() check in load_consumer_offsets, which is untouched by this pr - and 123 is a numeric name, so it'd only reach the patched code if that is_dir check were removed. fine to keep as a regression test, but it's pinning pre-existing behavior, not this change.
|
|
||
| #[test] | ||
| fn load_consumer_offsets_nonexistent_dir() { | ||
| let result = load_consumer_offsets("/tmp/nonexistent_iggy_test_dir_12345"); |
There was a problem hiding this comment.
also at line 138 in load_consumer_group_offsets_nonexistent_dir. both tests hardcode /tmp/nonexistent_iggy_test_dir_12345 - if that path ever exists (leftover from a crashed run, or a parallel run) the test breaks. create a tempfile::tempdir(), grab its path, drop it, then use that path - guaranteed nonexistent and unique.
|
/author |
Which issue does this PR close?
Closes #3132
Rationale
An operator accidentally placing a file (e.g.
.DS_Store, editor swap file, backup) in the consumer offsets directory crashes the server on startup due to apanic!on non-numeric filenames.What changed?
Both
load_consumer_offsetsandload_consumer_group_offsetsincore/server/src/streaming/partitions/storage.rspanicked when encountering files with non-numeric names during startup offset loading.The fix replaces both
panic!calls withwarn!log messages andcontinue, so the server gracefully skips unexpected files. Added 9 integration tests incore/integration/tests/storage/consumer_offsets.rscovering valid files, non-numeric files (.DS_Store,backup.bak), empty directories, subdirectories, and nonexistent paths for both functions.Local Execution
AI Usage
cargo test,cargo clippy,cargo fmt --check, and CI scripts