Search before asking
Fluss version
0.9.0 (latest release)
Please describe the bug 🐞
This job is a long-running backfill task. After enabling checkpointing, it had been running normally for about a week. However, after a restart, the job failed to start successfully. The error indicates that one of the subtasks was unable to subscribe to a bucket. The detailed error is described below.
Failed flink app that restored from the checkpoint
JobManager
2026-04-15 00:14:44,708 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: dwd_qixiao_feature_snapshot_cover_5min_v7[1] -> Calc[2] -> TableToDataSteam (188/240) (488ed58bf3787820d1f4604276a8c07d_cbc357ccb763df2852fee8c4fc7d55f2_187_0) switched from INITIALIZING to RUNNING.
2026-04-15 00:14:44,721 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: dwd_qixiao_feature_snapshot_cover_5min_v7[1] -> Calc[2] -> TableToDataSteam (6/240) (488ed58bf3787820d1f4604276a8c07d_cbc357ccb763df2852fee8c4fc7d55f2_5_0) switched from RUNNING to FAILED on container_e83_1769589104014_23426144_01_000016 @ node77-73-228-bddx.qiyi.hadoop (dataPort=37209).
java.lang.RuntimeException: One or more fetchers have encountered exception
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager.checkErrors(SplitFetcherManager.java:263) ~[flink-connector-files-1.18.0.jar:1.18.0]
at org.apache.flink.connector.base.source.reader.SourceReaderBase.getNextFetch(SourceReaderBase.java:185) ~[flink-connector-files-1.18.0.jar:1.18.0]
at org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:147) ~[flink-connector-files-1.18.0.jar:1.18.0]
at org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:419) ~[flink-dist-1.18.0.jar:1.18.0]
at org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68) ~[flink-dist-1.18.0.jar:1.18.0]
at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) ~[flink-dist-1.18.0.jar:1.18.0]
at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:562) ~[flink-dist-1.18.0.jar:1.18.0]
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231) ~[flink-dist-1.18.0.jar:1.18.0]
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:858) ~[flink-dist-1.18.0.jar:1.18.0]
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:807) ~[flink-dist-1.18.0.jar:1.18.0]
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:971) ~[flink-dist-1.18.0.jar:1.18.0]
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:950) ~[flink-dist-1.18.0.jar:1.18.0]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:764) ~[flink-dist-1.18.0.jar:1.18.0]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) ~[flink-dist-1.18.0.jar:1.18.0]
at java.lang.Thread.run(Thread.java:840) ~[?:?]
Caused by: java.lang.RuntimeException: SplitFetcher thread 0 received unexpected exception while polling the records
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:168) ~[flink-connector-files-1.18.0.jar:1.18.0]
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:117) ~[flink-connector-files-1.18.0.jar:1.18.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
... 1 more
Caused by: org.apache.fluss.exception.FetchException: The fetching offset 1546148 is out of range: org.apache.fluss.exception.LogOffsetOutOfRangeException: Received request for offset 1546148 for table bucket TableBucket{tableId=53, partitionId=525, bucket=50}, but we only have log segments from offset 2199904 up to 2799328.
at org.apache.fluss.client.table.scanner.log.LogFetchCollector.handleInitializeErrors(LogFetchCollector.java:268) ~[?:?]
at org.apache.fluss.client.table.scanner.log.LogFetchCollector.initialize(LogFetchCollector.java:203) ~[?:?]
at org.apache.fluss.client.table.scanner.log.LogFetchCollector.collectFetch(LogFetchCollector.java:101) ~[?:?]
at org.apache.fluss.client.table.scanner.log.LogFetcher.collectFetch(LogFetcher.java:165) ~[?:?]
at org.apache.fluss.client.table.scanner.log.LogScannerImpl.pollForFetches(LogScannerImpl.java:251) ~[?:?]
at org.apache.fluss.client.table.scanner.log.LogScannerImpl.poll(LogScannerImpl.java:144) ~[?:?]
at org.apache.fluss.flink.source.reader.FlinkSourceSplitReader.fetch(FlinkSourceSplitReader.java:174) ~[?:?]
at org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:58) ~[flink-connector-files-1.18.0.jar:1.18.0]
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:165) ~[flink-connector-files-1.18.0.jar:1.18.0]
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:117) ~[flink-connector-files-1.18.0.jar:1.18.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
... 1 more
2026-04-15 00:14:44,769 INFO org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Removing registered reader after failure for subtask 5 (#0) of source Source: dwd_qixiao_feature_snapshot_cover_5min_v7[1].
and then to find the first failed subtask, the logs show as below
Failed taskmanager
2026-04-15 00:14:42,649 ERROR org.apache.fluss.client.table.scanner.log.LogFetcher [] - Failed to fetch log from node 2 for bucket TableBucket{tableId=53, partitionId=525, bucket=50}
org.apache.fluss.exception.LogOffsetOutOfRangeException: The log offset is out of range.
2026-04-15 00:14:42,653 ERROR org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager [] - Received uncaught exception.
java.lang.RuntimeException: SplitFetcher thread 0 received unexpected exception while polling the records
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:168) ~[flink-connector-files-1.18.0.jar:1.18.0]
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:117) [flink-connector-files-1.18.0.jar:1.18.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:840) [?:?]
Caused by: org.apache.fluss.exception.FetchException: The fetching offset 1546148 is out of range: org.apache.fluss.exception.LogOffsetOutOfRangeException: Received request for offset 1546148 for table bucket TableBucket{tableId=53, partitionId=525, bucket=50}, but we only have log segments from offset 2199904 up to 2799328.
at org.apache.fluss.client.table.scanner.log.LogFetchCollector.handleInitializeErrors(LogFetchCollector.java:268) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
at org.apache.fluss.client.table.scanner.log.LogFetchCollector.initialize(LogFetchCollector.java:203) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
at org.apache.fluss.client.table.scanner.log.LogFetchCollector.collectFetch(LogFetchCollector.java:101) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
at org.apache.fluss.client.table.scanner.log.LogFetcher.collectFetch(LogFetcher.java:165) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
at org.apache.fluss.client.table.scanner.log.LogScannerImpl.pollForFetches(LogScannerImpl.java:251) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
at org.apache.fluss.client.table.scanner.log.LogScannerImpl.poll(LogScannerImpl.java:144) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
at org.apache.fluss.flink.source.reader.FlinkSourceSplitReader.fetch(FlinkSourceSplitReader.java:174) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
at org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:58) ~[flink-connector-files-1.18.0.jar:1.18.0]
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:165) ~[flink-connector-files-1.18.0.jar:1.18.0]
... 6 more
for the subtask's splits are as follows
[LogSplit{tableBucket=TableBucket{tableId=53, partitionId=573, bucket=2}, partitionName='2026-04-14', startingOffset=5284909, stoppingOffset=-9223372036854775808},
LogSplit{tableBucket=TableBucket{tableId=53, partitionId=541, bucket=34}, partitionName='2026-04-10', startingOffset=4901521, stoppingOffset=-9223372036854775808},
LogSplit{tableBucket=TableBucket{tableId=53, partitionId=525, bucket=50}, partitionName='2026-04-07', startingOffset=1546148, stoppingOffset=-9223372036854775808},
LogSplit{tableBucket=TableBucket{tableId=53, partitionId=565, bucket=10}, partitionName='2026-04-13', startingOffset=4973472, stoppingOffset=-9223372036854775808},
LogSplit{tableBucket=TableBucket{tableId=53, partitionId=533, bucket=42}, partitionName='2026-04-09', startingOffset=4694976, stoppingOffset=-9223372036854775808},
LogSplit{tableBucket=TableBucket{tableId=53, partitionId=549, bucket=26}, partitionName='2026-04-11', startingOffset=5815237, stoppingOffset=-9223372036854775808},
LogSplit{tableBucket=TableBucket{tableId=53, partitionId=526, bucket=19}, partitionName='2026-04-08', startingOffset=4835470, stoppingOffset=-9223372036854775808},
LogSplit{tableBucket=TableBucket{tableId=53, partitionId=557, bucket=18}, partitionName='2026-04-12', startingOffset=5765776, stoppingOffset=-9223372036854775808}]
and then I want to list the above splits the consuming progress, that's retrieved by the fluss cluster prometheus dashboard by promQL of fluss_tabletserver_table_bucket_log_endOffset{table="xxxx", bucket="50", partition="2026_04_07"}
| partition |
partition-id |
bucket |
Subtask offset |
cluster side bucket offset |
| 2026_04_07 |
525 |
50 |
1546148 |
2799328 |
| 2026_04_08 |
526 |
19 |
4835470 |
4835470 |
| 2026_04_09 |
533 |
42 |
4694976 |
4694976 |
| 2026_04_10 |
541 |
34 |
4901521 |
4901521 |
| 2026_04_11 |
549 |
26 |
5815237 |
5815237 |
| 2026_04_12 |
557 |
18 |
5765776 |
5765776 |
| 2026_04_13 |
565 |
10 |
4973472 |
4973472 |
| 2026_04_14 |
|
|
|
|
It appears that consumption of 2026_04_07 (partition=525, bucket=50) is stuck, which is unexpected. The Flink job had been running for about a week and had already caught up to the latest offset. However, part of the partition data is now hanging, and the job does not fail fast, which is concerning.
Unfortunately, I cannot determine the root cause of the consumption hang due to missing logs from 2026_04_07.
Solution
No response
Are you willing to submit a PR?
Search before asking
Fluss version
0.9.0 (latest release)
Please describe the bug 🐞
This job is a long-running backfill task. After enabling checkpointing, it had been running normally for about a week. However, after a restart, the job failed to start successfully. The error indicates that one of the subtasks was unable to subscribe to a bucket. The detailed error is described below.
Failed flink app that restored from the checkpoint
JobManager
and then to find the first failed subtask, the logs show as below
Failed taskmanager
for the subtask's splits are as follows
and then I want to list the above splits the consuming progress, that's retrieved by the fluss cluster prometheus dashboard by promQL of
fluss_tabletserver_table_bucket_log_endOffset{table="xxxx", bucket="50", partition="2026_04_07"}It appears that consumption of 2026_04_07 (partition=525, bucket=50) is stuck, which is unexpected. The Flink job had been running for about a week and had already caught up to the latest offset. However, part of the partition data is now hanging, and the job does not fail fast, which is concerning.
Unfortunately, I cannot determine the root cause of the consumption hang due to missing logs from 2026_04_07.
Solution
No response
Are you willing to submit a PR?