Skip to content

[fix](job) fix streaming job fails with "No new files found" on second scheduling #61249

Merged
JNSimba merged 6 commits intoapache:masterfrom
JNSimba:fix_s3_no_newfiles
Mar 13, 2026
Merged

[fix](job) fix streaming job fails with "No new files found" on second scheduling #61249
JNSimba merged 6 commits intoapache:masterfrom
JNSimba:fix_s3_no_newfiles

Conversation

@JNSimba
Copy link
Member

@JNSimba JNSimba commented Mar 12, 2026

What problem does this PR solve?

When a streaming job processes S3 files, the second scheduling fails with:

No new files found in path: ...

Root cause: In S3ObjStorage.globListInternal, currentMaxFile was unconditionally set to the last raw S3
object key returned in the response page, without checking whether it matched the glob pattern.

This affects two scenarios:

Scenario 1 — reachLimit=false (all matched files consumed in one listing):
The S3 page still contains non-matching keys after the last matched file (e.g.
test_csv_comma_header.csv.lz4 sitting after test_csv_comma_header.csv). currentMaxFile gets set to the
.lz4 key, so hasMoreDataToConsume() returns true. The next scheduling calls startAfter("...csv"), S3
returns only .lz4 which doesn't match the glob → rfiles empty → exception.

Scenario 2 — reachLimit=true (batch limit hit mid-page):
After the limit is hit, the remaining page objects are not inspected. The original code set currentMaxFile
to the last raw key in the entire page (which may be a non-matching sibling), causing the same failure on
the next scheduling attempt.

Fix

Track lastMatchedKey (the last S3 key that actually matched the glob) during the listing loop.

When reachLimit=true, instead of breaking out of the for loop immediately, continue scanning the remaining
objects already fetched in the current page to find the first next glob-matching key as currentMaxFile.
No extra S3 API call is needed.

When no next matching key is found in the remaining page objects, fall back to lastMatchedKey instead of
the raw last S3 page key.

####Regression Test

Added test_streaming_job_no_new_files_with_sibling. The pattern example_[0-0].csv only matches
example_0.csv; since getLongestPrefix strips at [, the S3 listing prefix becomes
regression/load/data/example_ and returns both example_0.csv and example_1.csv — example_1.csv acts as the
non-matching sibling. The test verifies that after the first successful task no failed tasks appear.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@JNSimba JNSimba requested a review from CalvinKirs as a code owner March 12, 2026 06:37
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba
Copy link
Member Author

JNSimba commented Mar 12, 2026

run buildall

@JNSimba
Copy link
Member Author

JNSimba commented Mar 12, 2026

/review

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Goal & Correctness

The PR fixes a real bug where currentMaxFile was unconditionally set to the last raw S3 key in a listing page, even when that key did not match the glob pattern. This caused hasMoreDataToConsume() to return true incorrectly, triggering a second scheduling that found no matching files and failed with "No new files found".

The fix correctly tracks lastMatchedKey (the last glob-matched key) and, when the batch limit is reached, scans remaining objects in the current S3 page to find the next glob-matching key as currentMaxFile. If no next match is found, it falls back to lastMatchedKey instead of a raw non-matching S3 key.

Critical Checkpoint Conclusions

  1. Does the code accomplish the goal? Yes — the core logic change correctly prevents non-matching sibling keys from being recorded as currentMaxFile. The fix addresses both scenarios described in the PR body (reachLimit=false and reachLimit=true). However, the regression test is incomplete (see below).

  2. Is the modification minimal and focused? Yes — only the globListInternal method in S3ObjStorage.java is modified, with a targeted change to how currentMaxFile is tracked.

  3. Concurrency? Not applicable — globListInternal is a synchronous method with no shared state.

  4. Lifecycle management? Not applicable.

  5. Configuration items? No new config added.

  6. Incompatible changes? None.

  7. Parallel code paths? The AzureObjStorage has a similar globList implementation but does NOT support globListWithLimit (no limit/pagination), so the bug is S3-specific and the fix is correctly scoped.

  8. Special conditional checks? The if (currentMaxFile.isEmpty()) guard at the end of each page iteration is clear in intent but has a minor side effect: in the non-limit multi-page path, currentMaxFile freezes at the first page's last matched key rather than the overall last matched key. This is harmless because the non-limit caller (globList()) discards currentMaxFile entirely — it only uses getStatus().

  9. Test coverage? The regression test correctly reproduces the bug scenario using example_[0-0].csv (matches only example_0.csv while S3 also returns example_1.csv). However, the .out file for qt_select is missing — see inline comment.

  10. Observability? Existing debug logging is sufficient.

  11. Performance? The post-limit scan adds negligible overhead (scanning remaining objects on a single already-fetched page). No extra S3 API calls.

  12. Other observations: The post-limit scan at lines 658-665 only checks the raw S3 key against the glob matcher, not parent paths (unlike the normal matching loop which does parent-walk). This is acceptable for the streaming job use case (flat file patterns), but should be noted for future reference if globListWithLimit is ever used with directory-level glob patterns.

Copy link
Contributor

@sollhui sollhui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JNSimba
Copy link
Member Author

JNSimba commented Mar 12, 2026

run buildall

@JNSimba
Copy link
Member Author

JNSimba commented Mar 12, 2026

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 28104 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit bbb78e508df22d57e74d231ea32f8d7e6b1d097d, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17645	4493	4359	4359
q2	q3	10736	836	530	530
q4	4701	353	253	253
q5	7880	1197	1036	1036
q6	233	173	152	152
q7	833	856	687	687
q8	10623	1520	1431	1431
q9	6724	4827	4794	4794
q10	6297	1947	1760	1760
q11	464	270	235	235
q12	738	588	477	477
q13	18045	2973	2193	2193
q14	235	226	223	223
q15	945	797	807	797
q16	765	726	700	700
q17	715	884	420	420
q18	6051	5433	5191	5191
q19	1269	1000	632	632
q20	519	502	387	387
q21	4753	1990	1581	1581
q22	421	349	266	266
Total cold run time: 100592 ms
Total hot run time: 28104 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4867	4730	4690	4690
q2	q3	3942	4419	3835	3835
q4	920	1203	803	803
q5	4137	4475	4415	4415
q6	188	174	148	148
q7	1811	1654	1547	1547
q8	2523	2776	2632	2632
q9	7614	7438	7505	7438
q10	3799	3971	3607	3607
q11	521	441	410	410
q12	484	592	448	448
q13	2914	3553	2347	2347
q14	287	303	273	273
q15	883	844	832	832
q16	740	785	762	762
q17	1182	1497	1392	1392
q18	7413	6819	6666	6666
q19	920	903	904	903
q20	2130	2220	2029	2029
q21	4012	3499	3393	3393
q22	486	443	416	416
Total cold run time: 51773 ms
Total hot run time: 48986 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 153238 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit bbb78e508df22d57e74d231ea32f8d7e6b1d097d, data reload: false

query5	4356	639	509	509
query6	331	236	207	207
query7	4233	479	279	279
query8	350	264	242	242
query9	8690	2744	2732	2732
query10	546	387	351	351
query11	7417	5808	5572	5572
query12	194	129	132	129
query13	1267	467	349	349
query14	5554	3860	3553	3553
query14_1	2872	2814	2823	2814
query15	208	202	182	182
query16	999	489	502	489
query17	1115	718	635	635
query18	2451	462	354	354
query19	217	212	185	185
query20	138	135	130	130
query21	236	144	125	125
query22	4924	5078	4892	4892
query23	17003	15967	15931	15931
query23_1	15834	15683	15770	15683
query24	8529	1705	1260	1260
query24_1	1283	1304	1281	1281
query25	565	496	433	433
query26	1767	281	171	171
query27	3219	536	301	301
query28	4853	1883	1848	1848
query29	813	562	472	472
query30	300	255	215	215
query31	1369	1301	1218	1218
query32	81	77	76	76
query33	502	324	279	279
query34	938	919	570	570
query35	632	691	595	595
query36	1079	1143	943	943
query37	137	101	84	84
query38	2948	2926	2880	2880
query39	907	900	824	824
query39_1	824	820	843	820
query40	232	187	135	135
query41	65	58	61	58
query42	306	298	307	298
query43	240	250	216	216
query44	
query45	201	190	184	184
query46	899	987	607	607
query47	2147	2151	2055	2055
query48	315	327	231	231
query49	634	479	381	381
query50	672	292	209	209
query51	4140	4143	4041	4041
query52	288	290	288	288
query53	288	345	280	280
query54	306	270	266	266
query55	96	88	89	88
query56	319	318	303	303
query57	1384	1349	1287	1287
query58	283	280	276	276
query59	1331	1483	1292	1292
query60	341	333	324	324
query61	147	149	160	149
query62	645	606	550	550
query63	307	278	279	278
query64	5002	1265	985	985
query65	
query66	1476	463	359	359
query67	16431	16431	16347	16347
query68	
query69	402	317	278	278
query70	941	949	874	874
query71	345	306	303	303
query72	2808	2687	2464	2464
query73	543	551	315	315
query74	10009	10012	9784	9784
query75	2855	2779	2471	2471
query76	2271	1035	668	668
query77	364	395	315	315
query78	11133	11345	10652	10652
query79	2951	784	593	593
query80	1714	621	539	539
query81	600	279	241	241
query82	976	148	121	121
query83	340	262	241	241
query84	256	115	104	104
query85	920	564	514	514
query86	456	310	299	299
query87	3154	3101	3025	3025
query88	3599	2681	2670	2670
query89	421	373	347	347
query90	2031	180	183	180
query91	182	215	145	145
query92	77	75	74	74
query93	1391	843	497	497
query94	646	325	289	289
query95	578	398	315	315
query96	636	511	229	229
query97	2482	2471	2384	2384
query98	236	222	223	222
query99	996	1004	910	910
Total cold run time: 238581 ms
Total hot run time: 153238 ms

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 33.33% (4/12) 🎉
Increment coverage report
Complete coverage report

@JNSimba
Copy link
Member Author

JNSimba commented Mar 12, 2026

run feut

@JNSimba
Copy link
Member Author

JNSimba commented Mar 12, 2026

run buildall

liaoxin01
liaoxin01 previously approved these changes Mar 12, 2026
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 12, 2026
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 33.33% (4/12) 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

TPC-H: Total hot run time: 27766 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 97eb14b3c16f0b0a78a13cc2cefb310d7d30b8fe, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17612	4493	4298	4298
q2	q3	10641	809	527	527
q4	4684	361	259	259
q5	7558	1225	1038	1038
q6	175	176	145	145
q7	789	838	666	666
q8	9310	1497	1351	1351
q9	5033	4735	4775	4735
q10	6244	1887	1677	1677
q11	490	271	244	244
q12	720	571	473	473
q13	18052	2985	2195	2195
q14	232	231	224	224
q15	921	798	784	784
q16	736	729	704	704
q17	702	841	418	418
q18	6044	5513	5250	5250
q19	1139	996	623	623
q20	493	494	384	384
q21	4409	2169	1517	1517
q22	405	332	254	254
Total cold run time: 96389 ms
Total hot run time: 27766 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4674	4479	4484	4479
q2	q3	3919	4406	3866	3866
q4	891	1185	773	773
q5	4078	4358	4340	4340
q6	218	177	143	143
q7	1783	1609	1534	1534
q8	2458	2729	2664	2664
q9	7542	7398	7305	7305
q10	3731	3992	3656	3656
q11	527	439	410	410
q12	517	633	490	490
q13	2710	3157	2346	2346
q14	283	305	277	277
q15	856	788	778	778
q16	758	787	712	712
q17	1149	1454	1359	1359
q18	7151	6834	6716	6716
q19	924	894	920	894
q20	2040	2174	1991	1991
q21	3990	3444	3507	3444
q22	447	438	384	384
Total cold run time: 50646 ms
Total hot run time: 48561 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 153115 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 97eb14b3c16f0b0a78a13cc2cefb310d7d30b8fe, data reload: false

query5	4323	633	519	519
query6	336	220	202	202
query7	4212	483	273	273
query8	353	242	233	233
query9	8720	2752	2747	2747
query10	525	400	350	350
query11	7420	5898	5579	5579
query12	187	121	124	121
query13	1251	440	332	332
query14	6104	3821	3577	3577
query14_1	2788	2786	2822	2786
query15	201	191	173	173
query16	1015	464	462	462
query17	1086	721	586	586
query18	2470	434	337	337
query19	212	200	192	192
query20	138	130	129	129
query21	221	139	123	123
query22	4840	5110	4760	4760
query23	16042	15580	15351	15351
query23_1	15454	16268	15961	15961
query24	7301	1713	1277	1277
query24_1	1335	1329	1292	1292
query25	681	584	470	470
query26	1267	286	158	158
query27	3007	501	320	320
query28	4803	1908	1934	1908
query29	875	588	500	500
query30	314	247	217	217
query31	1390	1301	1226	1226
query32	82	80	76	76
query33	528	343	309	309
query34	937	925	590	590
query35	649	682	619	619
query36	1124	1139	1004	1004
query37	142	95	89	89
query38	2944	2959	2849	2849
query39	900	870	844	844
query39_1	818	841	829	829
query40	230	161	141	141
query41	68	65	64	64
query42	311	298	299	298
query43	260	254	218	218
query44	
query45	202	196	189	189
query46	875	983	609	609
query47	2107	2154	2066	2066
query48	305	323	245	245
query49	649	503	395	395
query50	682	288	214	214
query51	4231	4110	4053	4053
query52	290	294	284	284
query53	290	340	290	290
query54	315	285	278	278
query55	96	99	90	90
query56	344	338	321	321
query57	1360	1333	1313	1313
query58	302	290	287	287
query59	1300	1472	1293	1293
query60	353	326	317	317
query61	147	143	160	143
query62	635	589	536	536
query63	302	280	277	277
query64	5129	1264	989	989
query65	
query66	1444	451	375	375
query67	16530	16401	16305	16305
query68	
query69	408	309	292	292
query70	1016	958	962	958
query71	341	304	309	304
query72	2881	2892	2450	2450
query73	543	548	334	334
query74	9993	9881	9765	9765
query75	2863	2736	2461	2461
query76	2379	1029	671	671
query77	385	382	315	315
query78	11217	11355	10640	10640
query79	2614	833	602	602
query80	1754	632	542	542
query81	565	281	259	259
query82	1005	148	116	116
query83	339	263	242	242
query84	248	116	100	100
query85	893	496	441	441
query86	427	341	305	305
query87	3222	3108	3029	3029
query88	3570	2677	2662	2662
query89	414	369	346	346
query90	1986	175	172	172
query91	173	160	136	136
query92	76	79	69	69
query93	1206	859	505	505
query94	641	321	288	288
query95	594	396	311	311
query96	649	515	229	229
query97	2455	2506	2423	2423
query98	247	226	217	217
query99	1017	993	836	836
Total cold run time: 236650 ms
Total hot run time: 153115 ms

@JNSimba
Copy link
Member Author

JNSimba commented Mar 13, 2026

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Mar 13, 2026
@doris-robot
Copy link

TPC-H: Total hot run time: 27725 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 6faa4dab32cc5705396089c110733f1d2cffe270, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17619	4499	4331	4331
q2	q3	10648	788	513	513
q4	4672	360	248	248
q5	7548	1189	1035	1035
q6	174	176	145	145
q7	806	834	665	665
q8	9562	1466	1305	1305
q9	5264	4753	4737	4737
q10	6323	1910	1662	1662
q11	467	266	238	238
q12	749	572	469	469
q13	18066	2959	2173	2173
q14	227	225	219	219
q15	928	807	801	801
q16	759	714	693	693
q17	701	864	409	409
q18	5954	5314	5299	5299
q19	1214	988	611	611
q20	499	487	387	387
q21	4518	2198	1512	1512
q22	373	323	273	273
Total cold run time: 97071 ms
Total hot run time: 27725 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4785	4678	4568	4568
q2	q3	3924	4470	3887	3887
q4	881	1197	773	773
q5	4055	4383	4311	4311
q6	190	181	141	141
q7	1765	1637	1513	1513
q8	2486	2683	2571	2571
q9	7736	7368	7343	7343
q10	3717	3968	3640	3640
q11	501	426	405	405
q12	481	591	454	454
q13	2659	3395	2345	2345
q14	283	308	273	273
q15	893	825	839	825
q16	735	796	704	704
q17	1435	1422	1381	1381
q18	7163	6887	6582	6582
q19	916	847	861	847
q20	2059	2171	1997	1997
q21	3905	3533	3329	3329
q22	497	450	405	405
Total cold run time: 51066 ms
Total hot run time: 48294 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 153538 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 6faa4dab32cc5705396089c110733f1d2cffe270, data reload: false

query5	4328	637	502	502
query6	335	231	210	210
query7	4222	460	265	265
query8	355	261	249	249
query9	8729	2726	2746	2726
query10	538	378	358	358
query11	7449	5889	5616	5616
query12	185	130	125	125
query13	1274	440	347	347
query14	5728	3888	3578	3578
query14_1	2816	2847	2815	2815
query15	206	195	177	177
query16	1007	491	448	448
query17	1104	720	625	625
query18	2456	453	347	347
query19	214	215	187	187
query20	138	127	131	127
query21	258	143	129	129
query22	4840	4898	4926	4898
query23	16764	16116	15719	15719
query23_1	15933	15968	15811	15811
query24	8151	1709	1297	1297
query24_1	1335	1342	1342	1342
query25	593	487	447	447
query26	945	278	156	156
query27	2806	486	289	289
query28	4452	1854	1860	1854
query29	868	558	474	474
query30	308	247	221	221
query31	1329	1308	1208	1208
query32	84	71	75	71
query33	553	333	280	280
query34	905	911	561	561
query35	646	681	602	602
query36	1080	1113	994	994
query37	131	93	82	82
query38	2991	2972	2976	2972
query39	896	866	844	844
query39_1	830	977	827	827
query40	233	154	136	136
query41	63	62	59	59
query42	299	289	304	289
query43	243	241	223	223
query44	
query45	194	186	185	185
query46	881	972	612	612
query47	2152	2130	2065	2065
query48	316	313	221	221
query49	626	455	373	373
query50	672	283	220	220
query51	4177	4043	4029	4029
query52	287	287	281	281
query53	287	341	278	278
query54	293	277	287	277
query55	94	87	85	85
query56	314	325	332	325
query57	1356	1330	1285	1285
query58	296	283	277	277
query59	1318	1475	1322	1322
query60	344	366	318	318
query61	160	144	148	144
query62	621	598	537	537
query63	301	286	270	270
query64	4733	1318	1009	1009
query65	
query66	1415	467	352	352
query67	16239	16552	16294	16294
query68	
query69	391	305	278	278
query70	957	980	940	940
query71	336	307	295	295
query72	2776	2722	2520	2520
query73	537	545	327	327
query74	9953	9910	9824	9824
query75	2857	2803	2477	2477
query76	2286	1046	701	701
query77	371	381	305	305
query78	11166	11354	10647	10647
query79	1737	781	594	594
query80	1314	640	580	580
query81	556	288	247	247
query82	1015	159	122	122
query83	333	271	248	248
query84	260	118	100	100
query85	972	554	481	481
query86	421	313	304	304
query87	3147	3103	3016	3016
query88	3509	2651	2623	2623
query89	416	366	346	346
query90	2005	180	177	177
query91	164	156	136	136
query92	79	74	66	66
query93	995	842	494	494
query94	625	347	296	296
query95	578	399	318	318
query96	658	520	231	231
query97	2435	2522	2411	2411
query98	248	220	224	220
query99	1013	992	880	880
Total cold run time: 234914 ms
Total hot run time: 153538 ms

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 33.33% (4/12) 🎉
Increment coverage report
Complete coverage report

@JNSimba JNSimba merged commit fbe4993 into apache:master Mar 13, 2026
28 of 29 checks passed
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 13, 2026
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

github-actions bot pushed a commit that referenced this pull request Mar 13, 2026
…d scheduling (#61249)

### What problem does this PR solve?

#### When a streaming job processes S3 files, the second scheduling
fails with:

  No new files found in path: ...

Root cause: In S3ObjStorage.globListInternal, currentMaxFile was
unconditionally set to the last raw S3
object key returned in the response page, without checking whether it
matched the glob pattern.

  This affects two scenarios:

**Scenario 1** — reachLimit=false (all matched files consumed in one
listing):
The S3 page still contains non-matching keys after the last matched file
(e.g.
test_csv_comma_header.csv.lz4 sitting after test_csv_comma_header.csv).
currentMaxFile gets set to the
.lz4 key, so hasMoreDataToConsume() returns true. The next scheduling
calls startAfter("...csv"), S3
returns only .lz4 which doesn't match the glob → rfiles empty →
exception.

  **Scenario 2** — reachLimit=true (batch limit hit mid-page):
After the limit is hit, the remaining page objects are not inspected.
The original code set currentMaxFile
to the last raw key in the entire page (which may be a non-matching
sibling), causing the same failure on
   the next scheduling attempt.

#### Fix

Track lastMatchedKey (the last S3 key that actually matched the glob)
during the listing loop.

When reachLimit=true, instead of breaking out of the for loop
immediately, continue scanning the remaining
objects already fetched in the current page to find the first next
glob-matching key as currentMaxFile.
  No extra S3 API call is needed.

When no next matching key is found in the remaining page objects, fall
back to lastMatchedKey instead of
  the raw last S3 page key.

####Regression Test

Added test_streaming_job_no_new_files_with_sibling. The pattern
example_[0-0].csv only matches
example_0.csv; since getLongestPrefix strips at [, the S3 listing prefix
becomes
regression/load/data/example_ and returns both example_0.csv and
example_1.csv — example_1.csv acts as the
non-matching sibling. The test verifies that after the first successful
task no failed tasks appear.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.x dev/4.1.x reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants