Fix Kafka metadata for sparse partition recovery#18687
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes Kafka stream partition metadata computation for realtime validation/recovery when Pinot’s current LLC consumption status list is sparse (missing partition ids). The Kafka 3.0 and 4.0 stream metadata providers now enumerate actual Kafka partition ids, preserving existing offsets where present and fetching offsets from Kafka for partitions missing from Pinot metadata.
Changes:
- Update Kafka 3.0/4.0
computePartitionGroupMetadata()to build partition metadata from actual Kafka partition ids (notcurrentStatuses.size()). - Preserve existing end offsets for known partitions while fetching stream offsets for missing partitions.
- Add regression tests for sparse current statuses (Kafka 3.0 and Kafka 4.0), plus partition-level integration coverage.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pinot-plugins/pinot-stream-ingestion/pinot-kafka-4.0/src/main/java/org/apache/pinot/plugin/stream/kafka40/KafkaStreamMetadataProvider.java | Fix metadata computation to enumerate real Kafka partition ids and recover missing partitions correctly. |
| pinot-plugins/pinot-stream-ingestion/pinot-kafka-3.0/src/main/java/org/apache/pinot/plugin/stream/kafka30/KafkaStreamMetadataProvider.java | Same fix as Kafka 4.0 provider for sparse-status recovery. |
| pinot-plugins/pinot-stream-ingestion/pinot-kafka-4.0/src/test/java/org/apache/pinot/plugin/stream/kafka40/KafkaStreamMetadataProviderTest.java | New unit test covering sparse current-status recovery and offset selection. |
| pinot-plugins/pinot-stream-ingestion/pinot-kafka-3.0/src/test/java/org/apache/pinot/plugin/stream/kafka30/KafkaStreamMetadataProviderTest.java | New unit test mirroring Kafka 4.0 sparse-status recovery coverage. |
| pinot-plugins/pinot-stream-ingestion/pinot-kafka-4.0/src/test/java/org/apache/pinot/plugin/stream/kafka40/KafkaPartitionLevelConsumerTest.java | Add embedded/provider regression ensuring partition ids come from Kafka, not list size. |
| pinot-plugins/pinot-stream-ingestion/pinot-kafka-3.0/src/test/java/org/apache/pinot/plugin/stream/kafka30/KafkaPartitionLevelConsumerTest.java | Same regression coverage for Kafka 3.0 provider. |
| pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java | Update comment clarifying the forceGetOffsetFromStream behavior for offset fetching. |
f92c924 to
a7ea04e
Compare
Review notesTraced the change end-to-end (Kafka providers → Why the fix is correctThe empty-subset branch used to delegate to the SPI default, which enumerates new partitions with The new code keys current statuses by One precision note on impact: the manifestation is silent non-recovery, not an NPE. The offset-repair path is guarded by Parity / safety verified
Minor follow-ups (non-blocking)
Tests
|
|
Follow-up status:
All review threads are resolved/outdated after the latest push. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18687 +/- ##
============================================
+ Coverage 64.48% 64.50% +0.01%
Complexity 1291 1291
============================================
Files 3371 3372 +1
Lines 208552 208604 +52
Branches 32570 32577 +7
============================================
+ Hits 134483 134550 +67
+ Misses 63273 63254 -19
- Partials 10796 10800 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pinot only has statuses for 0,1,3,4,5,6,7, the old SPI default metadata computation returned [0,1,3,4,5,6,7,7]. Partition 2 never reached the validation manager repair loop
Is this based on real issue? The logic in RVM is applicable for kafka where new partitions are added upstream, exp: Cases like partition count going from 8 -> 16. In kafka an intermittent partitioin missing is unlikely.
|
Yes, this is based on the real issue we saw, but the missing partition is not missing from Kafka. Kafka still has the partitions. The sparse state is in Pinot LLC metadata: for some existing Kafka partition ids, Pinot no longer has consuming/online segments or latest LLC metadata in ZK. In that state, RVM should be able to recreate consuming segments because the existing validation flow sets up partition groups that exist in stream metadata but are missing in Pinot metadata. The bug is that Kafka's no-subset metadata path used the SPI default, which derives new ids from This PR does not assume Kafka partitions disappear intermittently. It asks Kafka for the actual live partition ids and gives RVM correct metadata for partitions that still exist in Kafka but are missing from Pinot metadata. |
a7ea04e to
56d16e5
Compare
|
Added a focused Kafka provider regression for the normal topic expansion case in The new test covers Kafka reporting partitions |
Summary
User Manual
No table config change is required for Kafka realtime tables.
After upgrading the controller and Kafka stream plugin code, wait for the scheduled
RealtimeSegmentValidationManagerrun or trigger realtime validation through the existing operational path. If Kafka still has the missing partitions, Pinot can recreate missing consuming segments for partitions that exist in Kafka but no longer have consuming/online segments or latest LLC ZK metadata.Recovered partitions start from the offset selected by validation for new partition repair, typically the smallest currently available Kafka offset. Data older than Kafka retention cannot be recovered by this repair.
Sample Table Config
Existing Kafka realtime configs continue to work. No new config key is required.
{ "tableName": "asset", "tableType": "REALTIME", "ingestionConfig": { "streamIngestionConfig": { "streamConfigMaps": [ { "streamType": "kafka", "stream.kafka.topic.name": "asset", "stream.kafka.broker.list": "broker-1:9092,broker-2:9092", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory", "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder" } ] } } }How RealtimeSegmentValidationManager Gets Fixed
RealtimeSegmentValidationManageralready has the repair flow needed to recreate missing LLC consuming segments. During realtime validation, it asksPinotLLCRealtimeSegmentManagerto compute stream partition group metadata and then runs the existing "set up new partitions if not exist" path.The bug was that Kafka's no-partition-subset metadata provider did not return the missing partition ids when Pinot's current LLC status list was sparse. For example, if Kafka has partitions
0..7but Pinot only has statuses for0,1,3,4,5,6,7, the old SPI default metadata computation returned[0,1,3,4,5,6,7,7]. Partition2never reached the validation manager repair loop, sosetupNewPartitionGroup()was never called for it.This PR fixes the Kafka 3.0 and Kafka 4.0 metadata providers so validation receives metadata keyed by actual Kafka partition ids:
getStreamPartitionGroupId()instead of by list position.endOffsetfor partitions Pinot still knows about.After this change, the same sparse example produces
[0,1,2,3,4,5,6,7]. On the next scheduled or manually triggeredRealtimeSegmentValidationManagerrun, the existing repair loop sees the missing ids and recreates consuming segments for them, assuming those partition ids still exist in Kafka.Operationally, this means the validation manager is fixed by giving it correct Kafka partition metadata. No new validation-manager config or manual ZK metadata surgery is required. Recreated partitions can only start from offsets still available under Kafka retention, so historical data older than retention remains unrecoverable.
Why This Fixes The Issue
Before this change, the Kafka no-partition-subset path delegated to the SPI default implementation. That implementation first copied current Pinot statuses and then added "new" partition ids using:
For sparse statuses, for example Kafka partitions
0..7with Pinot statuses for0,1,3,4,5,6,7, this produced[0,1,3,4,5,6,7,7]: partition2stayed missing and partition7was duplicated in metadata.The Kafka providers now fetch actual Kafka partition ids and key current statuses by stream partition id, so the metadata list becomes
[0,1,2,3,4,5,6,7]and realtime validation can callsetupNewPartitionGroup()for the missing id. The same path also preserves the normal topic expansion behavior: if Kafka expands from0..3to0..7, statuses for0..3reuse their existing end offsets and new partitions4..7fetch start offsets from Kafka.Test Plan
./mvnw spotless:apply checkstyle:check license:format license:check -pl pinot-controller,pinot-plugins/pinot-stream-ingestion/pinot-kafka-3.0,pinot-plugins/pinot-stream-ingestion/pinot-kafka-4.0./mvnw checkstyle:check -pl pinot-controller./mvnw -pl pinot-plugins/pinot-stream-ingestion/pinot-kafka-3.0,pinot-plugins/pinot-stream-ingestion/pinot-kafka-4.0 -am -Dtest=KafkaStreamMetadataProviderTest -Dsurefire.failIfNoSpecifiedTests=false test./mvnw -pl pinot-plugins/pinot-stream-ingestion/pinot-kafka-3.0 -Dtest=KafkaPartitionLevelConsumerTest#testComputePartitionGroupMetadataUsesKafkaPartitionIds -Dsurefire.failIfNoSpecifiedTests=false testKnown local limitation: running the Kafka 4.0 embedded
KafkaPartitionLevelConsumerTestin this workspace is blocked by missing Docker/Testcontainers support (/var/run/docker.socknot found). The non-Docker Kafka 4.0 provider regression passes.