[fix](job) fix streaming job fails with "No new files found" on second scheduling #61249
[fix](job) fix streaming job fails with "No new files found" on second scheduling #61249JNSimba merged 6 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
/review |
There was a problem hiding this comment.
Code Review Summary
Goal & Correctness
The PR fixes a real bug where currentMaxFile was unconditionally set to the last raw S3 key in a listing page, even when that key did not match the glob pattern. This caused hasMoreDataToConsume() to return true incorrectly, triggering a second scheduling that found no matching files and failed with "No new files found".
The fix correctly tracks lastMatchedKey (the last glob-matched key) and, when the batch limit is reached, scans remaining objects in the current S3 page to find the next glob-matching key as currentMaxFile. If no next match is found, it falls back to lastMatchedKey instead of a raw non-matching S3 key.
Critical Checkpoint Conclusions
-
Does the code accomplish the goal? Yes — the core logic change correctly prevents non-matching sibling keys from being recorded as
currentMaxFile. The fix addresses both scenarios described in the PR body (reachLimit=false and reachLimit=true). However, the regression test is incomplete (see below). -
Is the modification minimal and focused? Yes — only the
globListInternalmethod inS3ObjStorage.javais modified, with a targeted change to howcurrentMaxFileis tracked. -
Concurrency? Not applicable —
globListInternalis a synchronous method with no shared state. -
Lifecycle management? Not applicable.
-
Configuration items? No new config added.
-
Incompatible changes? None.
-
Parallel code paths? The
AzureObjStoragehas a similarglobListimplementation but does NOT supportglobListWithLimit(no limit/pagination), so the bug is S3-specific and the fix is correctly scoped. -
Special conditional checks? The
if (currentMaxFile.isEmpty())guard at the end of each page iteration is clear in intent but has a minor side effect: in the non-limit multi-page path,currentMaxFilefreezes at the first page's last matched key rather than the overall last matched key. This is harmless because the non-limit caller (globList()) discardscurrentMaxFileentirely — it only usesgetStatus(). -
Test coverage? The regression test correctly reproduces the bug scenario using
example_[0-0].csv(matches onlyexample_0.csvwhile S3 also returnsexample_1.csv). However, the.outfile forqt_selectis missing — see inline comment. -
Observability? Existing debug logging is sufficient.
-
Performance? The post-limit scan adds negligible overhead (scanning remaining objects on a single already-fetched page). No extra S3 API calls.
-
Other observations: The post-limit scan at lines 658-665 only checks the raw S3 key against the glob matcher, not parent paths (unlike the normal matching loop which does parent-walk). This is acceptable for the streaming job use case (flat file patterns), but should be noted for future reference if
globListWithLimitis ever used with directory-level glob patterns.
regression-test/suites/job_p0/streaming_job/test_streaming_job_no_new_files_with_sibling.groovy
Show resolved
Hide resolved
|
run buildall |
|
run buildall |
TPC-H: Total hot run time: 28104 ms |
TPC-DS: Total hot run time: 153238 ms |
FE UT Coverage ReportIncrement line coverage |
|
run feut |
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 27766 ms |
TPC-DS: Total hot run time: 153115 ms |
…into fix_s3_no_newfiles
|
run buildall |
TPC-H: Total hot run time: 27725 ms |
TPC-DS: Total hot run time: 153538 ms |
FE UT Coverage ReportIncrement line coverage |
|
PR approved by at least one committer and no changes requested. |
…d scheduling (#61249) ### What problem does this PR solve? #### When a streaming job processes S3 files, the second scheduling fails with: No new files found in path: ... Root cause: In S3ObjStorage.globListInternal, currentMaxFile was unconditionally set to the last raw S3 object key returned in the response page, without checking whether it matched the glob pattern. This affects two scenarios: **Scenario 1** — reachLimit=false (all matched files consumed in one listing): The S3 page still contains non-matching keys after the last matched file (e.g. test_csv_comma_header.csv.lz4 sitting after test_csv_comma_header.csv). currentMaxFile gets set to the .lz4 key, so hasMoreDataToConsume() returns true. The next scheduling calls startAfter("...csv"), S3 returns only .lz4 which doesn't match the glob → rfiles empty → exception. **Scenario 2** — reachLimit=true (batch limit hit mid-page): After the limit is hit, the remaining page objects are not inspected. The original code set currentMaxFile to the last raw key in the entire page (which may be a non-matching sibling), causing the same failure on the next scheduling attempt. #### Fix Track lastMatchedKey (the last S3 key that actually matched the glob) during the listing loop. When reachLimit=true, instead of breaking out of the for loop immediately, continue scanning the remaining objects already fetched in the current page to find the first next glob-matching key as currentMaxFile. No extra S3 API call is needed. When no next matching key is found in the remaining page objects, fall back to lastMatchedKey instead of the raw last S3 page key. ####Regression Test Added test_streaming_job_no_new_files_with_sibling. The pattern example_[0-0].csv only matches example_0.csv; since getLongestPrefix strips at [, the S3 listing prefix becomes regression/load/data/example_ and returns both example_0.csv and example_1.csv — example_1.csv acts as the non-matching sibling. The test verifies that after the first successful task no failed tasks appear.
What problem does this PR solve?
When a streaming job processes S3 files, the second scheduling fails with:
No new files found in path: ...
Root cause: In S3ObjStorage.globListInternal, currentMaxFile was unconditionally set to the last raw S3
object key returned in the response page, without checking whether it matched the glob pattern.
This affects two scenarios:
Scenario 1 — reachLimit=false (all matched files consumed in one listing):
The S3 page still contains non-matching keys after the last matched file (e.g.
test_csv_comma_header.csv.lz4 sitting after test_csv_comma_header.csv). currentMaxFile gets set to the
.lz4 key, so hasMoreDataToConsume() returns true. The next scheduling calls startAfter("...csv"), S3
returns only .lz4 which doesn't match the glob → rfiles empty → exception.
Scenario 2 — reachLimit=true (batch limit hit mid-page):
After the limit is hit, the remaining page objects are not inspected. The original code set currentMaxFile
to the last raw key in the entire page (which may be a non-matching sibling), causing the same failure on
the next scheduling attempt.
Fix
Track lastMatchedKey (the last S3 key that actually matched the glob) during the listing loop.
When reachLimit=true, instead of breaking out of the for loop immediately, continue scanning the remaining
objects already fetched in the current page to find the first next glob-matching key as currentMaxFile.
No extra S3 API call is needed.
When no next matching key is found in the remaining page objects, fall back to lastMatchedKey instead of
the raw last S3 page key.
####Regression Test
Added test_streaming_job_no_new_files_with_sibling. The pattern example_[0-0].csv only matches
example_0.csv; since getLongestPrefix strips at [, the S3 listing prefix becomes
regression/load/data/example_ and returns both example_0.csv and example_1.csv — example_1.csv acts as the
non-matching sibling. The test verifies that after the first successful task no failed tasks appear.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)