Fix replica task failures with metadata inconsistency while running concurrent append replace #16614

kfaraz · 2024-06-16T04:34:37Z

Description

A streaming ingestion with multiple replicas may sometimes run into the following error
when publishing segments to an interval. This can happen only when there is a concurrent
REPLACE task has recently committed higher version segments to that same interval.

java.util.concurrent.ExecutionException: org.apache.druid.java.util.common.ISE:
Failed to publish segments because of 
[java.lang.RuntimeException: Inconsistency between stored metadata state[KafkaDataSourceMetadata{xxx}]
and target state[KafkaDataSourceMetadata{yyy}]. Try resetting the supervisor.]

This error does not cause any data loss but is an operational overhead since it leads to unnecessary task failures.

The situation typically plays out as below:

Streaming supervisor launches 2 replicas to append to an interval I
Both replicas allocate and start appending to a segment S(v0,p1), i.e. segment partition 1 on version 0.
Concurrent REPLACE task commits a new version v1 in interval I, say segments S(v1, p0), S(v1, p1)
Version v0 is now completely overshadowed by version v1
First replica publishes segment S(v0, p1), which is upgraded by the overlord to S(v1, p2)
Second replica tries to publish segment S(v0, p1) but fails because higher offsets have already been committed
Second replica then checks if its segments have already been published by someone else
In the absence of concurrent replace, the previous step would succeed thus causing second replica to move ahead with the ingestion.
But since concurrent replace has already overshadowed version v0, the second replica is not able to find it in the set of "used and visible" segments and thus it fails.

Fix

While looking for published segments, the second replica should search not only visible, but also overshadowed and unused segments.

Changes

Add new task action RetrieveSegmentsByIdAction
Use new task action to retrieve segments irrespective of their visibility
During rolling upgrades, this task action would fail as Overlord would be on old version
If new action fails, fall back to just fetching used segments as before

Testing

Setup

Local Druid cluster with 3 MMs, 3 task slots each
Kafka streaming ingestion with 3 task replicas
Concurrent compaction enabled with skipOffsetFromLatest = PT0S

Observation
All the replicas finish successfully even if they are not able to publish segments due to a concurrent replace task.

Replica 1

Published 4 segments
Upgraded 1 segment kafka_super_1976-04-16T00:00:00.000Z_1976-04-16T01:00:00.000Z_2024-06-21T10:22:46.040Z_1 which overshadowed the published segment kafka_super_1976-04-16T00:00:00.000Z_1976-04-16T01:00:00.000Z_1970-01-01T00:00:00.000Z_1

Logs

2024-06-21T10:23:08,891 INFO [[index_kafka_kafka_super_a6ce1c843bb750d_kbipcfpk]-appenderator-merge] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Push complete...
2024-06-21T10:23:08,915 INFO [[index_kafka_kafka_super_a6ce1c843bb750d_kbipcfpk]-publish] org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Published [4] segments with commit metadata[{nextPartitions=SeekableStreamStartSequenceNumbers{stream='abc', partitionSequenceNumberMap={KafkaTopicPartition{partition=0, topic='null', multiTopicPartition=false}=177}, exclusivePartitions=[]}, publishPartitions=SeekableStreamEndSequenceNumbers{stream='abc', partitionSequenceNumberMap={KafkaTopicPartition{partition=0, topic='null', multiTopicPartition=false}=177}}}].
2024-06-21T10:23:08,916 INFO [[index_kafka_kafka_super_a6ce1c843bb750d_kbipcfpk]-publish] org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Published segments: [kafka_super_1976-04-16T00:00:00.000Z_1976-04-16T01:00:00.000Z_1970-01-01T00:00:00.000Z_1, kafka_super_1976-04-17T00:00:00.000Z_1976-04-17T01:00:00.000Z_1970-01-01T00:00:00.000Z, kafka_super_1976-04-14T00:00:00.000Z_1976-04-14T01:00:00.000Z_1970-01-01T00:00:00.000Z_3, kafka_super_1976-04-15T00:00:00.000Z_1976-04-15T01:00:00.000Z_2024-06-21T10:22:26.059Z_2]
2024-06-21T10:23:08,916 INFO [[index_kafka_kafka_super_a6ce1c843bb750d_kbipcfpk]-publish] org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Published [1] upgraded segments.
2024-06-21T10:23:08,916 INFO [[index_kafka_kafka_super_a6ce1c843bb750d_kbipcfpk]-publish] org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Upgraded segments: [kafka_super_1976-04-16T00:00:00.000Z_1976-04-16T01:00:00.000Z_2024-06-21T10:22:46.040Z_1]

Replica 2

Failed to publish the segments
But still managed to find all the segments already present in the metadata store, even the overshadowed one kafka_super_1976-04-16T00:00:00.000Z_1976-04-16T01:00:00.000Z_1970-01-01T00:00:00.000Z_1.

Logs

2024-06-21T10:23:08,893 INFO [[index_kafka_kafka_super_a6ce1c843bb750d_copcbpef]-appenderator-merge] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Push complete...
2024-06-21T10:23:08,927 INFO [[index_kafka_kafka_super_a6ce1c843bb750d_copcbpef]-publish] org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Could not publish [4] segments, but they have already been published by another task.
2024-06-21T10:23:08,928 INFO [[index_kafka_kafka_super_a6ce1c843bb750d_copcbpef]-publish] org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Could not publish segments: [kafka_super_1976-04-16T00:00:00.000Z_1976-04-16T01:00:00.000Z_1970-01-01T00:00:00.000Z_1, kafka_super_1976-04-17T00:00:00.000Z_1976-04-17T01:00:00.000Z_1970-01-01T00:00:00.000Z, kafka_super_1976-04-14T00:00:00.000Z_1976-04-14T01:00:00.000Z_1970-01-01T00:00:00.000Z_3, kafka_super_1976-04-15T00:00:00.000Z_1976-04-15T01:00:00.000Z_2024-06-21T10:22:26.059Z_2]

Benchmarking

Setup

Local Druid cluster with a single datasource with 1M used and 1M unused segments

mysql> select used, count(*) from druid_segments group by 1;
+------+----------+
| used | count(*) |
+------+----------+
|    0 |  1215000 |
|    1 |  1000001 |
+------+----------+
2 rows in set (0.33 sec)

Using a bash script, post the new task action segmentListById to fetch 500 segments (250 used, 250 unused)
Enable logging emitter and note task/action/run/time

Observation

2024-06-22T07:39:00,349 INFO [qtp2062755811-141] org.apache.druid.java.util.emitter.core.LoggingEmitter - [metrics] {"feed":"metrics","taskType":"noop","metric":"task/action/run/time","service":"druid/coordinator","groupId":"noop_2024-06-22T07:39:00.312Z_02192b28-d721-4654-9f33-2aab4470d8d4","host":"localhost:8081","taskActionType":"segmentListById","version":"31.0.0-SNAPSHOT","value":26,"dataSource":"none","taskId":"noop_2024-06-22T07:39:00.312Z_02192b28-d721-4654-9f33-2aab4470d8d4","timestamp":"2024-06-22T07:39:00.349Z"}
2024-06-22T07:39:15,245 INFO [qtp2062755811-146] org.apache.druid.java.util.emitter.core.LoggingEmitter - [metrics] {"feed":"metrics","taskType":"noop","metric":"task/action/run/time","service":"druid/coordinator","groupId":"noop_2024-06-22T07:39:15.229Z_5d31b801-d874-4888-ae9a-f82666997969","host":"localhost:8081","taskActionType":"segmentListById","version":"31.0.0-SNAPSHOT","value":16,"dataSource":"none","taskId":"noop_2024-06-22T07:39:15.229Z_5d31b801-d874-4888-ae9a-f82666997969","timestamp":"2024-06-22T07:39:15.245Z"}
2024-06-22T07:39:17,037 INFO [qtp2062755811-144] org.apache.druid.java.util.emitter.core.LoggingEmitter - [metrics] {"feed":"metrics","taskType":"noop","metric":"task/action/run/time","service":"druid/coordinator","groupId":"noop_2024-06-22T07:39:17.019Z_4c12a954-c96a-4cc6-886c-e42d9261a4a9","host":"localhost:8081","taskActionType":"segmentListById","version":"31.0.0-SNAPSHOT","value":17,"dataSource":"none","taskId":"noop_2024-06-22T07:39:17.019Z_4c12a954-c96a-4cc6-886c-e42d9261a4a9","timestamp":"2024-06-22T07:39:17.037Z"}

Run times are 26ms, 16ms, 17ms.

server/src/main/java/org/apache/druid/segment/realtime/appenderator/BaseAppenderatorDriver.java

AmatyaAvadhanula

Have a few questions about the approach

...c/main/java/org/apache/druid/indexing/appenderator/ActionBasedPublishedSegmentRetriever.java

…consistency

kfaraz · 2024-06-20T03:42:26Z

@AmatyaAvadhanula , I have added a new task action to fetch segments by ID. I have also reverted all the refactoring changes so that the PR is easier to review. The refactoring changes can be made later.

AmatyaAvadhanula · 2024-06-20T08:43:36Z

The overall approach looks good to me. However, could you please add benchmarks for task/action/run/time while fetching 500 segments (including both used and unused) from the metadata store for the new action with 1M+ used segments and 2-3M+ unused segments?

AmatyaAvadhanula · 2024-06-21T09:17:40Z

Could you please also add details about any cluster testing that has been done with this patch?

…consistency

kfaraz · 2024-06-21T11:42:36Z

Could you please also add details about any cluster testing that has been done with this patch?

@AmatyaAvadhanula , I have added cluster testing details in the PR description, hope this suffices.
I will update the details of the benchmarking soon.

Update: Added benchmarking details too.

…consistency

AmatyaAvadhanula

LGTM!
Have minor suggestions about naming

AmatyaAvadhanula · 2024-06-24T03:51:11Z

indexing-service/src/main/java/org/apache/druid/indexing/common/actions/TaskAction.java

@@ -38,9 +38,8 @@
    @JsonSubTypes.Type(name = "segmentTransactionalInsert", value = SegmentTransactionalInsertAction.class),
    @JsonSubTypes.Type(name = "segmentTransactionalAppend", value = SegmentTransactionalAppendAction.class),
    @JsonSubTypes.Type(name = "segmentTransactionalReplace", value = SegmentTransactionalReplaceAction.class),
-    // Type name doesn't correspond to the name of the class for backward compatibility.
+    @JsonSubTypes.Type(name = "segmentListById", value = RetrieveSegmentsByIdAction.class),


Nit: Please rename type name to "retrieveSegmentsById" since there is no need to maintain backward compatibility for this action

I named it segmentListById to be consistent with segmentListUsed and segmentListUnused. If you feel strongly about this, I can include the rename in my follow up PR.

The names are supposed to be consistent with the task action's.
The comment that was moved indicates that certain actions have a different key than the name because of backward compatibility.

The names are supposed to be consistent with the task action's.

Sure, this is preferable, but not a requirement.

In this case, it made more sense to me to adhere to the nomenclature that we are now going to support forever i.e. segmentListXXX.

But I agree that it is better to stick to the convention used by all the other task actions rather than the 2 bad ones. Thanks for calling this out!

AmatyaAvadhanula · 2024-06-24T03:53:18Z

server/src/main/java/org/apache/druid/segment/realtime/appenderator/BaseAppenderatorDriver.java

@@ -274,7 +275,7 @@ Stream<SegmentsOfInterval> getAllSegmentsOfInterval()
  {
    this.appenderator = Preconditions.checkNotNull(appenderator, "appenderator");
    this.segmentAllocator = Preconditions.checkNotNull(segmentAllocator, "segmentAllocator");
-    this.usedSegmentChecker = Preconditions.checkNotNull(usedSegmentChecker, "usedSegmentChecker");
+    this.usedSegmentChecker = Preconditions.checkNotNull(usedSegmentChecker, "segmentRetriever");


Please rename this variable to segmentRetriever

Will do this in a follow up PR that renames the other things too. Don't want to trigger CI for this.

zargor · 2024-06-28T08:51:24Z

We hit this issue. I guess we can expect this in the next release of Druid...

Fix metadata inconsistency error with concurrent append replace

988c41c

kfaraz requested a review from AmatyaAvadhanula June 16, 2024 04:34

github-actions bot added the Area - Ingestion label Jun 16, 2024

kfaraz added 2 commits June 16, 2024 10:10

Minor fixes

effdb0f

Cleanup, add tests

3791a4a

github-advanced-security bot found potential problems Jun 16, 2024

View reviewed changes

server/src/main/java/org/apache/druid/segment/realtime/appenderator/BaseAppenderatorDriver.java Fixed Show fixed Hide fixed

kfaraz added 2 commits June 16, 2024 23:11

Fix tests

87e7ed2

Fix spotted bug

0da4b62

AmatyaAvadhanula reviewed Jun 19, 2024

View reviewed changes

...c/main/java/org/apache/druid/indexing/appenderator/ActionBasedPublishedSegmentRetriever.java Outdated Show resolved Hide resolved

...c/main/java/org/apache/druid/indexing/appenderator/ActionBasedPublishedSegmentRetriever.java Outdated Show resolved Hide resolved

kfaraz added 3 commits June 19, 2024 11:29

Merge branch 'master' of github.com:apache/druid into fix_metadata_in…

48561bf

…consistency

Revert all changes

dbda9cd

Add new task action segmentListById

b51c952

kfaraz requested a review from AmatyaAvadhanula June 20, 2024 03:42

Minor revert

1ea1e15

Merge branch 'master' of github.com:apache/druid into fix_metadata_in…

c64584b

…consistency

Merge branch 'master' of github.com:apache/druid into fix_metadata_in…

acc1a9c

…consistency

AmatyaAvadhanula approved these changes Jun 24, 2024

View reviewed changes

kfaraz merged commit 0fe6a2a into apache:master Jun 24, 2024
87 checks passed

kfaraz deleted the fix_metadata_inconsistency branch June 24, 2024 04:26

abhishekagarwal87 added the Bug label Jun 24, 2024

kfaraz mentioned this pull request Jun 24, 2024

Refactor: Rename UsedSegmentChecker and cleanup task actions #16644

Merged

kfaraz added this to the 31.0.0 milestone Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix replica task failures with metadata inconsistency while running concurrent append replace #16614

Fix replica task failures with metadata inconsistency while running concurrent append replace #16614

kfaraz commented Jun 16, 2024 •

edited

Loading

AmatyaAvadhanula left a comment

kfaraz commented Jun 20, 2024

AmatyaAvadhanula commented Jun 20, 2024

AmatyaAvadhanula commented Jun 21, 2024

kfaraz commented Jun 21, 2024 •

edited

Loading

AmatyaAvadhanula left a comment

AmatyaAvadhanula Jun 24, 2024

kfaraz Jun 24, 2024

AmatyaAvadhanula Jun 24, 2024 •

edited

Loading

kfaraz Jun 24, 2024

AmatyaAvadhanula Jun 24, 2024

kfaraz Jun 24, 2024

zargor commented Jun 28, 2024

Fix replica task failures with metadata inconsistency while running concurrent append replace #16614

Fix replica task failures with metadata inconsistency while running concurrent append replace #16614

Conversation

kfaraz commented Jun 16, 2024 • edited Loading

Description

Fix

Changes

Testing

Benchmarking

AmatyaAvadhanula left a comment

Choose a reason for hiding this comment

kfaraz commented Jun 20, 2024

AmatyaAvadhanula commented Jun 20, 2024

AmatyaAvadhanula commented Jun 21, 2024

kfaraz commented Jun 21, 2024 • edited Loading

AmatyaAvadhanula left a comment

Choose a reason for hiding this comment

AmatyaAvadhanula Jun 24, 2024

Choose a reason for hiding this comment

kfaraz Jun 24, 2024

Choose a reason for hiding this comment

AmatyaAvadhanula Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

kfaraz Jun 24, 2024

Choose a reason for hiding this comment

AmatyaAvadhanula Jun 24, 2024

Choose a reason for hiding this comment

kfaraz Jun 24, 2024

Choose a reason for hiding this comment

zargor commented Jun 28, 2024

kfaraz commented Jun 16, 2024 •

edited

Loading

kfaraz commented Jun 21, 2024 •

edited

Loading

AmatyaAvadhanula Jun 24, 2024 •

edited

Loading