Skip to content

Flink log table consumer threw an out-of-range offset exception #3084

@zuston

Description

@zuston

Search before asking

  • I searched in the issues and found nothing similar.

Fluss version

0.9.0 (latest release)

Please describe the bug 🐞

This job is a long-running backfill task. After enabling checkpointing, it had been running normally for about a week. However, after a restart, the job failed to start successfully. The error indicates that one of the subtasks was unable to subscribe to a bucket. The detailed error is described below.

Failed flink app that restored from the checkpoint

JobManager

2026-04-15 00:14:44,708 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: dwd_qixiao_feature_snapshot_cover_5min_v7[1] -> Calc[2] -> TableToDataSteam (188/240) (488ed58bf3787820d1f4604276a8c07d_cbc357ccb763df2852fee8c4fc7d55f2_187_0) switched from INITIALIZING to RUNNING.
2026-04-15 00:14:44,721 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: dwd_qixiao_feature_snapshot_cover_5min_v7[1] -> Calc[2] -> TableToDataSteam (6/240) (488ed58bf3787820d1f4604276a8c07d_cbc357ccb763df2852fee8c4fc7d55f2_5_0) switched from RUNNING to FAILED on container_e83_1769589104014_23426144_01_000016 @ node77-73-228-bddx.qiyi.hadoop (dataPort=37209).
java.lang.RuntimeException: One or more fetchers have encountered exception
        at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager.checkErrors(SplitFetcherManager.java:263) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at org.apache.flink.connector.base.source.reader.SourceReaderBase.getNextFetch(SourceReaderBase.java:185) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:147) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:419) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:562) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:858) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:807) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:971) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:950) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:764) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) ~[flink-dist-1.18.0.jar:1.18.0]
        at java.lang.Thread.run(Thread.java:840) ~[?:?]
Caused by: java.lang.RuntimeException: SplitFetcher thread 0 received unexpected exception while polling the records
        at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:168) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:117) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        ... 1 more
Caused by: org.apache.fluss.exception.FetchException: The fetching offset 1546148 is out of range: org.apache.fluss.exception.LogOffsetOutOfRangeException: Received request for offset 1546148 for table bucket TableBucket{tableId=53, partitionId=525, bucket=50}, but we only have log segments from offset 2199904 up to 2799328.
        at org.apache.fluss.client.table.scanner.log.LogFetchCollector.handleInitializeErrors(LogFetchCollector.java:268) ~[?:?]
        at org.apache.fluss.client.table.scanner.log.LogFetchCollector.initialize(LogFetchCollector.java:203) ~[?:?]
        at org.apache.fluss.client.table.scanner.log.LogFetchCollector.collectFetch(LogFetchCollector.java:101) ~[?:?]
        at org.apache.fluss.client.table.scanner.log.LogFetcher.collectFetch(LogFetcher.java:165) ~[?:?]
        at org.apache.fluss.client.table.scanner.log.LogScannerImpl.pollForFetches(LogScannerImpl.java:251) ~[?:?]
        at org.apache.fluss.client.table.scanner.log.LogScannerImpl.poll(LogScannerImpl.java:144) ~[?:?]
        at org.apache.fluss.flink.source.reader.FlinkSourceSplitReader.fetch(FlinkSourceSplitReader.java:174) ~[?:?]
        at org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:58) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:165) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:117) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        ... 1 more
2026-04-15 00:14:44,769 INFO  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Removing registered reader after failure for subtask 5 (#0) of source Source: dwd_qixiao_feature_snapshot_cover_5min_v7[1].

and then to find the first failed subtask, the logs show as below

Failed taskmanager

2026-04-15 00:14:42,649 ERROR org.apache.fluss.client.table.scanner.log.LogFetcher         [] - Failed to fetch log from node 2 for bucket TableBucket{tableId=53, partitionId=525, bucket=50}
org.apache.fluss.exception.LogOffsetOutOfRangeException: The log offset is out of range.
2026-04-15 00:14:42,653 ERROR org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager [] - Received uncaught exception.
java.lang.RuntimeException: SplitFetcher thread 0 received unexpected exception while polling the records
        at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:168) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:117) [flink-connector-files-1.18.0.jar:1.18.0]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:840) [?:?]
Caused by: org.apache.fluss.exception.FetchException: The fetching offset 1546148 is out of range: org.apache.fluss.exception.LogOffsetOutOfRangeException: Received request for offset 1546148 for table bucket TableBucket{tableId=53, partitionId=525, bucket=50}, but we only have log segments from offset 2199904 up to 2799328.
        at org.apache.fluss.client.table.scanner.log.LogFetchCollector.handleInitializeErrors(LogFetchCollector.java:268) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
        at org.apache.fluss.client.table.scanner.log.LogFetchCollector.initialize(LogFetchCollector.java:203) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
        at org.apache.fluss.client.table.scanner.log.LogFetchCollector.collectFetch(LogFetchCollector.java:101) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
        at org.apache.fluss.client.table.scanner.log.LogFetcher.collectFetch(LogFetcher.java:165) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
        at org.apache.fluss.client.table.scanner.log.LogScannerImpl.pollForFetches(LogScannerImpl.java:251) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
        at org.apache.fluss.client.table.scanner.log.LogScannerImpl.poll(LogScannerImpl.java:144) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
        at org.apache.fluss.flink.source.reader.FlinkSourceSplitReader.fetch(FlinkSourceSplitReader.java:174) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
        at org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:58) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:165) ~[flink-connector-files-1.18.0.jar:1.18.0]
        ... 6 more
Image

for the subtask's splits are as follows

[LogSplit{tableBucket=TableBucket{tableId=53, partitionId=573, bucket=2}, partitionName='2026-04-14', startingOffset=5284909, stoppingOffset=-9223372036854775808}, 

LogSplit{tableBucket=TableBucket{tableId=53, partitionId=541, bucket=34}, partitionName='2026-04-10', startingOffset=4901521, stoppingOffset=-9223372036854775808}, 

LogSplit{tableBucket=TableBucket{tableId=53, partitionId=525, bucket=50}, partitionName='2026-04-07', startingOffset=1546148, stoppingOffset=-9223372036854775808}, 

LogSplit{tableBucket=TableBucket{tableId=53, partitionId=565, bucket=10}, partitionName='2026-04-13', startingOffset=4973472, stoppingOffset=-9223372036854775808}, 

LogSplit{tableBucket=TableBucket{tableId=53, partitionId=533, bucket=42}, partitionName='2026-04-09', startingOffset=4694976, stoppingOffset=-9223372036854775808}, 

LogSplit{tableBucket=TableBucket{tableId=53, partitionId=549, bucket=26}, partitionName='2026-04-11', startingOffset=5815237, stoppingOffset=-9223372036854775808}, 

LogSplit{tableBucket=TableBucket{tableId=53, partitionId=526, bucket=19}, partitionName='2026-04-08', startingOffset=4835470, stoppingOffset=-9223372036854775808}, 

LogSplit{tableBucket=TableBucket{tableId=53, partitionId=557, bucket=18}, partitionName='2026-04-12', startingOffset=5765776, stoppingOffset=-9223372036854775808}]

and then I want to list the above splits the consuming progress, that's retrieved by the fluss cluster prometheus dashboard by promQL of fluss_tabletserver_table_bucket_log_endOffset{table="xxxx", bucket="50", partition="2026_04_07"}

partition partition-id bucket Subtask offset cluster side bucket offset
2026_04_07 525 50 1546148 2799328
2026_04_08 526 19 4835470 4835470
2026_04_09 533 42 4694976 4694976
2026_04_10 541 34 4901521 4901521
2026_04_11 549 26 5815237 5815237
2026_04_12 557 18 5765776 5765776
2026_04_13 565 10 4973472 4973472
2026_04_14        

It appears that consumption of 2026_04_07 (partition=525, bucket=50) is stuck, which is unexpected. The Flink job had been running for about a week and had already caught up to the latest offset. However, part of the partition data is now hanging, and the job does not fail fast, which is concerning.

Unfortunately, I cannot determine the root cause of the consumption hang due to missing logs from 2026_04_07.

Solution

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions