Add start and end offset for each source topic partition in iceberg snapshot as a property by kumarpritam863 · Pull Request #15206 · apache/iceberg

kumarpritam863 · 2026-02-01T11:01:21Z

Summary

This PR enhances the Kafka Connect sink to track offset ranges (start and end offsets) for each topic partition, providing better traceability of which source data was included in each Iceberg
snapshot.

Motivation

Previously, the connector only tracked a single offset value per partition, which represented the next offset to consume. This made it difficult to determine the exact range of data that was
committed in a particular snapshot. By tracking both start and end offsets, we can now:

Precisely identify the range of Kafka messages included in each Iceberg snapshot
Improve debugging and data lineage capabilities
Enable better auditing and compliance tracking
Facilitate data recovery and reprocessing scenarios
Also this will enable switching to exact topic partition for Flink jobs if running in hybrid mode with model of iceberg -> kafka

Changes

Core Data Models

TopicPartitionOffset class:

Changed from single offset field to startOffset and endOffset fields
Updated Iceberg schema to include both offset fields
Field IDs: START_OFFSET = 10_702, END_OFFSET = 10_703, TIMESTAMP = 10_704
Added getter methods: startOffset() and endOffset()

Offset class:

Modified to track both startOffset and endOffset instead of a single offset
Added backward compatibility method offset() that returns endOffset
Updated constructor signature: Offset(Long startOffset, Long endOffset, OffsetDateTime timestamp)

Offset Tracking

SinkWriter class:

Enhanced save() method to track offset ranges per partition
For the first record in a partition: startOffset = currentOffset, endOffset = currentOffset + 1
For subsequent records: preserves original startOffset, updates endOffset = currentOffset + 1
Ensures accurate range tracking across multiple records in the same commit cycle

Worker class:

Updated to pass both startOffset and endOffset when creating TopicPartitionOffset objects

Snapshot Metadata

CommitState class:

Added topicPartitionOffsets() method to extract all topic partition offsets from ready buffer
Returns complete offset information for the current commit

Coordinator class:

Added new snapshot property: kafka.connect.topic-partition-offsets
Implemented topicPartitionOffsetsToJson() to serialize offset ranges to JSON
JSON format includes: topic, partition, startOffset, endOffset, and timestamp for each partition
Updated commitToTable() to store topic partition offsets in both append and delta operations

Example Snapshot Metadata

After this change, Iceberg snapshots will include metadata like:

{
  "kafka.connect.topic-partition-offsets": [
    {
      "topic": "events",
      "partition": 0,
      "startOffset": 100,
      "endOffset": 250,
      "timestamp": "2024-01-15T10:30:00Z"
    },
    {
      "topic": "events",
      "partition": 1,
      "startOffset": 50,
      "endOffset": 175,
      "timestamp": "2024-01-15T10:30:05Z"
    }
  ]
}

This reverts commit 67619ec.

This reverts commit c0a2665.

…g and transition purpose

Copilot

Pull request overview

This PR enhances the Kafka Connect sink to track offset ranges (start and end offsets) for each topic partition in Iceberg snapshots, improving data lineage and traceability capabilities.

Changes:

Modified core data models (Offset and TopicPartitionOffset) to track both start and end offsets instead of a single offset value
Updated offset tracking logic in SinkWriter to maintain offset ranges across multiple records in the same commit cycle
Added snapshot metadata property kafka.connect.topic-partition-offsets containing JSON-serialized offset ranges for all partitions

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/Offset.java	Changed from single offset to startOffset/endOffset fields with backward compatibility method
kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SinkWriter.java	Enhanced to track offset ranges per partition using compute() to preserve startOffset
kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/channel/Worker.java	Updated to pass both startOffset and endOffset when creating TopicPartitionOffset objects
kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/channel/CommitState.java	Added topicPartitionOffsets() method to extract offset information for current commit
kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/channel/Coordinator.java	Added JSON serialization of offset ranges and storage in snapshot metadata properties
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/TopicPartitionOffset.java	Updated schema to include start_offset and end_offset fields with adjusted field IDs
kafka-connect/kafka-connect/src/test/java/org/apache/iceberg/connect/data/TestSinkWriter.java	Updated test assertions to verify both startOffset and endOffset
kafka-connect/kafka-connect/src/test/java/org/apache/iceberg/connect/channel/TestWorker.java	Updated test to use new Offset constructor with startOffset and endOffset
kafka-connect/kafka-connect/src/test/java/org/apache/iceberg/connect/channel/TestCoordinator.java	Updated test to use new TopicPartitionOffset constructor with offset range
kafka-connect/kafka-connect-events/src/test/java/org/apache/iceberg/connect/events/TestEventSerialization.java	Updated test data to include both startOffset and endOffset values

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Pritam Kumar Mishra and others added 12 commits August 9, 2025 11:27

added metadat and data path in case of dynamic routing

c0a2665

spotless

67619ec

Revert "spotless"

6b15ae4

This reverts commit 67619ec.

Revert "added metadat and data path in case of dynamic routing"

8398e4c

This reverts commit c0a2665.

Merge branch 'apache:main' into main

fbf52a9

Merge branch 'apache:main' into main

c92ec66

Merge branch 'apache:main' into main

9392a6d

Merge branch 'apache:main' into main

ecd8b55

Merge branch 'apache:main' into main

5e76e04

Merge branch 'apache:main' into main

a1ec7e6

Merge branch 'apache:main' into main

4eaf70b

added start and end offset to table snapshot as property for debuggin…

80853fe

…g and transition purpose

github-actions bot added the KAFKACONNECT label Feb 1, 2026

manuzhang requested a review from Copilot February 1, 2026 13:28

Copilot AI reviewed Feb 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add start and end offset for each source topic partition in iceberg snapshot as a property#15206

Add start and end offset for each source topic partition in iceberg snapshot as a property#15206
kumarpritam863 wants to merge 12 commits intoapache:mainfrom
kumarpritam863:add_start_end_source_offset_snapshot_prop

kumarpritam863 commented Feb 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kumarpritam863 commented Feb 1, 2026

Summary

Motivation

Changes

Core Data Models

Offset Tracking

Snapshot Metadata

Example Snapshot Metadata

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant