Skip to content

[fix](job) fix streaming job stuck when S3 auth error is silently ignored in fetchRemoteMeta#61284

Merged
JNSimba merged 2 commits intoapache:masterfrom
JNSimba:fetch_remote_error
Mar 13, 2026
Merged

[fix](job) fix streaming job stuck when S3 auth error is silently ignored in fetchRemoteMeta#61284
JNSimba merged 2 commits intoapache:masterfrom
JNSimba:fetch_remote_error

Conversation

@JNSimba
Copy link
Member

@JNSimba JNSimba commented Mar 12, 2026

What problem does this PR solve?

Problem

When S3 credentials become invalid (e.g. 403 auth error), the streaming job neither pauses nor reports an error — it hang, even add new files.
indefinitely without making progress.

Root cause:

S3ObjStorage.globListInternal() catches all exceptions and returns a GlobListResult with a non-ok Status instead of
rethrowing. S3SourceOffsetProvider.fetchRemoteMeta() called globListWithLimit() but never checked the returned status.
Since objects was empty, the maxEndFile was never updated, hasMoreDataToConsume() kept returning false, and the scheduler
retried every 500ms forever without triggering a PAUSE.

The same status check was also missing in getNextOffset(), which would produce a misleading "No new files found" error
instead of the actual S3 error message.

Fix

  • In fetchRemoteMeta(): check globListResult status after globListWithLimit(); throw Exception with the real error message
    if not ok, so the upper-level StreamingInsertJob.fetchMeta() catch block can catch it, set GET_REMOTE_DATA_ERROR, and PAUSE
    the job for auto-resume.
  • In getNextOffset(): same status check, throw RuntimeException with accurate error message.
  • Add a debug point S3SourceOffsetProvider.fetchRemoteMeta.error to simulate a failed GlobListResult for testing.

Test

Added regression test test_streaming_insert_job_fetch_meta_error: enables the debug point to inject a failed
GlobListResult, creates a streaming job, waits for it to reach PAUSED status, and asserts the ErrorMsg contains "Failed to
list S3 files".

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@JNSimba JNSimba requested a review from Copilot March 12, 2026 12:11
@Thearas
Copy link
Contributor

Thearas commented Mar 12, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba
Copy link
Member Author

JNSimba commented Mar 12, 2026

run buildall

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a streaming insert job hang when S3 listing fails (e.g., auth/403) by surfacing the error and pausing the job, plus adds a regression test to prevent silent stalls.

Changes:

  • Validate GlobListResult status in getNextOffset() and fetchRemoteMeta() and throw with the real S3 error when listing fails.
  • Add a debug point to simulate failed S3 listing for testing.
  • Add a regression test asserting the job transitions to PAUSED and exposes the correct error message.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
regression-test/suites/job_p0/streaming_job/test_streaming_insert_job_fetch_meta_error.groovy Adds regression coverage for PAUSE-on-fetch-meta error behavior via debug point injection
fe/fe-core/src/main/java/org/apache/doris/job/offset/s3/S3SourceOffsetProvider.java Propagates S3 listing errors by checking GlobListResult status and adds a debug point for failure injection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@JNSimba
Copy link
Member Author

JNSimba commented Mar 12, 2026

/review

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 12, 2026
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

PR Goal

Fix streaming job hanging when S3 auth errors are silently ignored in fetchRemoteMeta(). The root cause is that S3ObjStorage.globListInternal() catches all exceptions and returns a GlobListResult with a non-ok Status, but S3SourceOffsetProvider.fetchRemoteMeta() never checked the returned status. This fix adds status checks in both fetchRemoteMeta() and getNextOffset() and adds a regression test.

Critical Checkpoint Conclusions

  1. Goal accomplishment: The fix correctly addresses the root cause. The fetchRemoteMeta() fix is sound — the thrown Exception is properly caught by StreamingInsertJob.fetchMeta() which sets GET_REMOTE_DATA_ERROR and pauses the job for auto-resume. The regression test proves the fix works.

  2. Modification size/focus: The change is small (18 additions, 3 deletions in the Java file) and focused on the single issue. Good.

  3. Concurrency: No new concurrency concerns introduced. The maxEndFile and currentOffset fields were already used without synchronization by the existing code; this PR doesn't change that pattern.

  4. Lifecycle/static init: No changes. N/A.

  5. Configuration items: None added. N/A.

  6. Incompatible changes: None. N/A.

  7. Parallel code paths: Both getNextOffset() and fetchRemoteMeta() are patched. Issue found in getNextOffset() — see inline comment about double-wrapping of RuntimeException.

  8. Special conditional checks: The globListResult == null check is defensive but reasonable since the interface contract of globListWithLimit doesn't guarantee non-null. The !globListResult.getStatus().ok() check is the core fix.

  9. Test coverage: A regression test is added that uses a debug point to inject the failure and verifies the job transitions to PAUSED with the correct error message. The test follows existing conventions in the streaming job test suite. No negative test for getNextOffset() is included, though the fetchRemoteMeta() path is the primary one that caused the hang.

  10. Observability: The existing log.warn in fetchMeta() and getNextOffset() catch blocks provide adequate logging. The error message includes the original S3 error detail.

  11. Transaction/persistence: N/A.

  12. FE-BE variable passing: N/A.

  13. Performance: No performance concerns. The status check is trivial.

  14. Other issues: One bug found — see inline comment.

Verdict

One bug found in getNextOffset(): the new RuntimeException thrown inside the try block gets caught by the outer catch (Exception e) and double-wrapped, losing the clear error message. See inline comment for details and suggested fix.

import org.apache.doris.common.util.DebugPointUtil;
import org.apache.doris.datasource.property.storage.StorageProperties;
import org.apache.doris.fs.FileSystemFactory;
import org.apache.doris.fs.GlobListResult;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Double-wrapped RuntimeException loses the clear error message.

This throw new RuntimeException("Failed to list S3 files: " + errMsg) is inside the try (RemoteFileSystem ...) block. It will be immediately caught by the catch (Exception e) at line 106 (original numbering), which re-wraps it as throw new RuntimeException(e). The result is a double-wrapped exception where the outer message is just the toString() of the inner RuntimeException, losing the clean "Failed to list S3 files" message that the caller (AbstractStreamingTask.execute()) uses for errMsg.

Suggested fix: either (a) catch only non-RuntimeException in the catch block by re-throwing RuntimeException before wrapping, or (b) move the status check outside/after the try-with-resources, or (c) add a specific catch clause. For example:

} catch (RuntimeException e) {
    log.warn("list path exception, path={}", filePath, e);
    throw e;
} catch (Exception e) {
    log.warn("list path exception, path={}", filePath, e);
    throw new RuntimeException(e);
}

Note: The same issue applies to the existing throw new RuntimeException("No new files found in path: " + filePath) at line 104 (pre-existing), but since this PR introduces the new throw, it would be good to fix the catch block as well.

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/11) 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

TPC-H: Total hot run time: 27850 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 6bd4864c2a5c349cd516568919e66b4807128aca, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17670	4431	4317	4317
q2	q3	10644	837	531	531
q4	4678	369	256	256
q5	7604	1198	1021	1021
q6	180	175	145	145
q7	826	877	675	675
q8	10223	1520	1344	1344
q9	6049	4825	4849	4825
q10	6316	1931	1642	1642
q11	441	268	253	253
q12	700	567	460	460
q13	18058	2958	2189	2189
q14	236	230	212	212
q15	945	799	821	799
q16	766	734	702	702
q17	718	871	417	417
q18	5990	5492	5342	5342
q19	1111	996	592	592
q20	497	505	399	399
q21	4696	1953	1431	1431
q22	360	337	298	298
Total cold run time: 98708 ms
Total hot run time: 27850 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4604	4664	4573	4573
q2	q3	3916	4366	3845	3845
q4	893	1234	774	774
q5	4120	4401	4340	4340
q6	190	178	134	134
q7	1771	1632	1608	1608
q8	2496	2836	2600	2600
q9	7606	7475	7521	7475
q10	3819	4013	3606	3606
q11	511	435	409	409
q12	482	595	454	454
q13	2765	3444	2404	2404
q14	330	317	272	272
q15	869	813	799	799
q16	707	793	738	738
q17	1139	1409	1327	1327
q18	7191	6884	6800	6800
q19	978	870	893	870
q20	2133	2181	2056	2056
q21	3939	3461	3317	3317
q22	465	433	370	370
Total cold run time: 50924 ms
Total hot run time: 48771 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 154485 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 6bd4864c2a5c349cd516568919e66b4807128aca, data reload: false

query5	4333	655	509	509
query6	333	240	206	206
query7	4227	469	274	274
query8	335	250	231	231
query9	8724	2745	2775	2745
query10	482	378	341	341
query11	7348	5867	5646	5646
query12	192	127	124	124
query13	1265	455	366	366
query14	5752	3907	3656	3656
query14_1	2852	2856	2822	2822
query15	208	198	177	177
query16	983	477	470	470
query17	854	705	599	599
query18	2426	448	341	341
query19	206	200	173	173
query20	134	132	129	129
query21	229	142	124	124
query22	4819	5023	4974	4974
query23	16716	16004	15919	15919
query23_1	15980	16287	15906	15906
query24	8467	1685	1328	1328
query24_1	1329	1295	1312	1295
query25	571	535	524	524
query26	1785	280	161	161
query27	2764	494	297	297
query28	4525	1889	1867	1867
query29	851	569	471	471
query30	320	253	212	212
query31	1357	1280	1213	1213
query32	80	77	76	76
query33	535	327	282	282
query34	944	919	573	573
query35	633	688	596	596
query36	1101	1150	967	967
query37	142	106	86	86
query38	2919	2901	2877	2877
query39	897	875	861	861
query39_1	802	833	811	811
query40	230	162	139	139
query41	63	61	58	58
query42	303	304	310	304
query43	248	255	221	221
query44	
query45	208	208	197	197
query46	902	995	617	617
query47	2129	2150	2044	2044
query48	321	344	243	243
query49	633	476	376	376
query50	718	280	216	216
query51	4152	4150	4059	4059
query52	286	289	284	284
query53	294	331	286	286
query54	299	279	269	269
query55	93	90	92	90
query56	315	330	307	307
query57	1371	1362	1310	1310
query58	294	294	284	284
query59	1366	1498	1298	1298
query60	353	361	334	334
query61	154	149	147	147
query62	635	591	552	552
query63	307	282	283	282
query64	5091	1307	992	992
query65	
query66	1470	500	376	376
query67	16484	16684	16422	16422
query68	
query69	399	323	293	293
query70	1024	1011	960	960
query71	357	318	314	314
query72	3033	2895	2568	2568
query73	543	561	342	342
query74	10004	9983	9858	9858
query75	2883	2815	2500	2500
query76	2298	1056	697	697
query77	379	406	325	325
query78	11140	11245	10715	10715
query79	3040	817	590	590
query80	1752	629	559	559
query81	595	288	247	247
query82	1020	154	120	120
query83	333	276	257	257
query84	303	127	97	97
query85	925	478	430	430
query86	500	322	297	297
query87	3161	3126	3033	3033
query88	3629	2670	2789	2670
query89	433	371	358	358
query90	2111	184	182	182
query91	171	166	142	142
query92	84	76	69	69
query93	1790	855	489	489
query94	645	333	283	283
query95	592	335	313	313
query96	666	557	234	234
query97	2465	2514	2418	2418
query98	232	219	220	219
query99	1012	988	876	876
Total cold run time: 239533 ms
Total hot run time: 154485 ms

@JNSimba JNSimba merged commit a69cf85 into apache:master Mar 13, 2026
29 of 32 checks passed
github-actions bot pushed a commit that referenced this pull request Mar 13, 2026
…ored in fetchRemoteMeta (#61284)

### What problem does this PR solve?

#### Problem

When S3 credentials become invalid (e.g. 403 auth error), the streaming
job neither pauses nor reports an error — it hang, even add new files.
  indefinitely without making progress.

#### Root cause:

S3ObjStorage.globListInternal() catches all exceptions and returns a
GlobListResult with a non-ok Status instead of
rethrowing. S3SourceOffsetProvider.fetchRemoteMeta() called
globListWithLimit() but never checked the returned status.
Since objects was empty, the maxEndFile was never updated,
hasMoreDataToConsume() kept returning false, and the scheduler
  retried every 500ms forever without triggering a PAUSE.

The same status check was also missing in getNextOffset(), which would
produce a misleading "No new files found" error
  instead of the actual S3 error message.

#### Fix

- In fetchRemoteMeta(): check globListResult status after
globListWithLimit(); throw Exception with the real error message
if not ok, so the upper-level StreamingInsertJob.fetchMeta() catch block
can catch it, set GET_REMOTE_DATA_ERROR, and PAUSE
   the job for auto-resume.
- In getNextOffset(): same status check, throw RuntimeException with
accurate error message.
- Add a debug point S3SourceOffsetProvider.fetchRemoteMeta.error to
simulate a failed GlobListResult for testing.

#### Test

Added regression test test_streaming_insert_job_fetch_meta_error:
enables the debug point to inject a failed
GlobListResult, creates a streaming job, waits for it to reach PAUSED
status, and asserts the ErrorMsg contains "Failed to
  list S3 files".
yiguolei pushed a commit that referenced this pull request Mar 13, 2026
…silently ignored in fetchRemoteMeta #61284 (#61296)

Cherry-picked from #61284

Co-authored-by: wudi <wudi@selectdb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.5-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants