Flink log table consumer threw an out-of-range offset exception

### Search before asking

- [x] I searched in the [issues](https://github.com/apache/fluss/issues) and found nothing similar.

### Fluss version

0.9.0 (latest release)

### Please describe the bug 🐞

This job is a long-running backfill task. After enabling checkpointing, it had been running normally for about a week. However, after a restart, the job failed to start successfully. The error indicates that one of the subtasks was unable to subscribe to a bucket. The detailed error is described below.

# Failed flink app that restored from the checkpoint

## JobManager

```
2026-04-15 00:14:44,708 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: dwd_qixiao_feature_snapshot_cover_5min_v7[1] -> Calc[2] -> TableToDataSteam (188/240) (488ed58bf3787820d1f4604276a8c07d_cbc357ccb763df2852fee8c4fc7d55f2_187_0) switched from INITIALIZING to RUNNING.
2026-04-15 00:14:44,721 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: dwd_qixiao_feature_snapshot_cover_5min_v7[1] -> Calc[2] -> TableToDataSteam (6/240) (488ed58bf3787820d1f4604276a8c07d_cbc357ccb763df2852fee8c4fc7d55f2_5_0) switched from RUNNING to FAILED on container_e83_1769589104014_23426144_01_000016 @ node77-73-228-bddx.qiyi.hadoop (dataPort=37209).
java.lang.RuntimeException: One or more fetchers have encountered exception
        at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager.checkErrors(SplitFetcherManager.java:263) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at org.apache.flink.connector.base.source.reader.SourceReaderBase.getNextFetch(SourceReaderBase.java:185) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:147) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:419) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:562) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:858) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:807) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:971) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:950) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:764) ~[flink-dist-1.18.0.jar:1.18.0]
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) ~[flink-dist-1.18.0.jar:1.18.0]
        at java.lang.Thread.run(Thread.java:840) ~[?:?]
Caused by: java.lang.RuntimeException: SplitFetcher thread 0 received unexpected exception while polling the records
        at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:168) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:117) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        ... 1 more
Caused by: org.apache.fluss.exception.FetchException: The fetching offset 1546148 is out of range: org.apache.fluss.exception.LogOffsetOutOfRangeException: Received request for offset 1546148 for table bucket TableBucket{tableId=53, partitionId=525, bucket=50}, but we only have log segments from offset 2199904 up to 2799328.
        at org.apache.fluss.client.table.scanner.log.LogFetchCollector.handleInitializeErrors(LogFetchCollector.java:268) ~[?:?]
        at org.apache.fluss.client.table.scanner.log.LogFetchCollector.initialize(LogFetchCollector.java:203) ~[?:?]
        at org.apache.fluss.client.table.scanner.log.LogFetchCollector.collectFetch(LogFetchCollector.java:101) ~[?:?]
        at org.apache.fluss.client.table.scanner.log.LogFetcher.collectFetch(LogFetcher.java:165) ~[?:?]
        at org.apache.fluss.client.table.scanner.log.LogScannerImpl.pollForFetches(LogScannerImpl.java:251) ~[?:?]
        at org.apache.fluss.client.table.scanner.log.LogScannerImpl.poll(LogScannerImpl.java:144) ~[?:?]
        at org.apache.fluss.flink.source.reader.FlinkSourceSplitReader.fetch(FlinkSourceSplitReader.java:174) ~[?:?]
        at org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:58) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:165) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:117) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        ... 1 more
2026-04-15 00:14:44,769 INFO  org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Removing registered reader after failure for subtask 5 (#0) of source Source: dwd_qixiao_feature_snapshot_cover_5min_v7[1].
```
and then to find the first failed subtask, the logs show as below

## Failed taskmanager

```
2026-04-15 00:14:42,649 ERROR org.apache.fluss.client.table.scanner.log.LogFetcher         [] - Failed to fetch log from node 2 for bucket TableBucket{tableId=53, partitionId=525, bucket=50}
org.apache.fluss.exception.LogOffsetOutOfRangeException: The log offset is out of range.
2026-04-15 00:14:42,653 ERROR org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager [] - Received uncaught exception.
java.lang.RuntimeException: SplitFetcher thread 0 received unexpected exception while polling the records
        at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:168) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.run(SplitFetcher.java:117) [flink-connector-files-1.18.0.jar:1.18.0]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:840) [?:?]
Caused by: org.apache.fluss.exception.FetchException: The fetching offset 1546148 is out of range: org.apache.fluss.exception.LogOffsetOutOfRangeException: Received request for offset 1546148 for table bucket TableBucket{tableId=53, partitionId=525, bucket=50}, but we only have log segments from offset 2199904 up to 2799328.
        at org.apache.fluss.client.table.scanner.log.LogFetchCollector.handleInitializeErrors(LogFetchCollector.java:268) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
        at org.apache.fluss.client.table.scanner.log.LogFetchCollector.initialize(LogFetchCollector.java:203) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
        at org.apache.fluss.client.table.scanner.log.LogFetchCollector.collectFetch(LogFetchCollector.java:101) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
        at org.apache.fluss.client.table.scanner.log.LogFetcher.collectFetch(LogFetcher.java:165) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
        at org.apache.fluss.client.table.scanner.log.LogScannerImpl.pollForFetches(LogScannerImpl.java:251) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
        at org.apache.fluss.client.table.scanner.log.LogScannerImpl.poll(LogScannerImpl.java:144) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
        at org.apache.fluss.flink.source.reader.FlinkSourceSplitReader.fetch(FlinkSourceSplitReader.java:174) ~[blob_p-9d65ea063382535f6612c858630d275bd68df51f-fd80e1bde6ebe28e6bdabfb655051fb3:?]
        at org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:58) ~[flink-connector-files-1.18.0.jar:1.18.0]
        at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:165) ~[flink-connector-files-1.18.0.jar:1.18.0]
        ... 6 more
```

<img width="1832" height="540" alt="Image" src="https://github.com/user-attachments/assets/3d93a93d-edd2-4c8f-be92-3aa1f06ea95a" />

for the subtask's splits are as follows

```
[LogSplit{tableBucket=TableBucket{tableId=53, partitionId=573, bucket=2}, partitionName='2026-04-14', startingOffset=5284909, stoppingOffset=-9223372036854775808}, 

LogSplit{tableBucket=TableBucket{tableId=53, partitionId=541, bucket=34}, partitionName='2026-04-10', startingOffset=4901521, stoppingOffset=-9223372036854775808}, 

LogSplit{tableBucket=TableBucket{tableId=53, partitionId=525, bucket=50}, partitionName='2026-04-07', startingOffset=1546148, stoppingOffset=-9223372036854775808}, 

LogSplit{tableBucket=TableBucket{tableId=53, partitionId=565, bucket=10}, partitionName='2026-04-13', startingOffset=4973472, stoppingOffset=-9223372036854775808}, 

LogSplit{tableBucket=TableBucket{tableId=53, partitionId=533, bucket=42}, partitionName='2026-04-09', startingOffset=4694976, stoppingOffset=-9223372036854775808}, 

LogSplit{tableBucket=TableBucket{tableId=53, partitionId=549, bucket=26}, partitionName='2026-04-11', startingOffset=5815237, stoppingOffset=-9223372036854775808}, 

LogSplit{tableBucket=TableBucket{tableId=53, partitionId=526, bucket=19}, partitionName='2026-04-08', startingOffset=4835470, stoppingOffset=-9223372036854775808}, 

LogSplit{tableBucket=TableBucket{tableId=53, partitionId=557, bucket=18}, partitionName='2026-04-12', startingOffset=5765776, stoppingOffset=-9223372036854775808}]
```

and then I want to list the above splits the consuming progress, that's retrieved by the fluss cluster prometheus dashboard by promQL of `fluss_tabletserver_table_bucket_log_endOffset{table="xxxx", bucket="50", partition="2026_04_07"}`

<meta charset="utf-8"><div data-page-id="ScU0d7qo0oNJjlx3gRVcp7J9nCh" data-lark-html-role="root" data-docx-has-block-data="true"><div>
partition | partition-id | bucket | Subtask offset | cluster side bucket offset
-- | -- | -- | -- | --
2026_04_07 | 525 | 50 | 1546148 | 2799328
2026_04_08 | 526 | 19 | 4835470 | 4835470
2026_04_09 | 533 | 42 | 4694976 | 4694976
2026_04_10 | 541 | 34 | 4901521 | 4901521
2026_04_11 | 549 | 26 | 5815237 | 5815237
2026_04_12 | 557 | 18 | 5765776 | 5765776
2026_04_13 | 565 | 10 | 4973472 | 4973472
2026_04_14 |   |   |   |  

It appears that consumption of 2026_04_07 (partition=525, bucket=50) is stuck, which is unexpected. The Flink job had been running for about a week and had already caught up to the latest offset. However, part of the partition data is now hanging, and the job does not fail fast, which is concerning.

Unfortunately, I cannot determine the root cause of the consumption hang due to missing logs from 2026_04_07.

### Solution

_No response_

### Are you willing to submit a PR?

- [x] I'm willing to submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink log table consumer threw an out-of-range offset exception #3084

Search before asking

Fluss version

Please describe the bug 🐞

Failed flink app that restored from the checkpoint

JobManager

Failed taskmanager

Solution

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

partition	partition-id	bucket	Subtask offset	cluster side bucket offset
2026_04_07	525	50	1546148	2799328
2026_04_08	526	19	4835470	4835470
2026_04_09	533	42	4694976	4694976
2026_04_10	541	34	4901521	4901521
2026_04_11	549	26	5815237	5815237
2026_04_12	557	18	5765776	5765776
2026_04_13	565	10	4973472	4973472
2026_04_14

Flink log table consumer threw an out-of-range offset exception #3084

Description

Search before asking

Fluss version

Please describe the bug 🐞

Failed flink app that restored from the checkpoint

JobManager

Failed taskmanager

Solution

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions