KAFKA-15661: KIP-951: Server side changes #14444

chb2ab · 2023-09-25T15:39:03Z

This is the server side changes to populate the fields in KIP-951. On NOT_LEADER_OR_FOLLOWER errors in both FETCH and PRODUCE the new leader ID and epoch are retrieved from the local cache through ReplicaManager and included in the response, falling back to the metadata cache if they are unavailable there. The endpoint for the new leader is retrieved from the metadata cache. The new fields are all optional (tagged) and an IBP bump is not required.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-951%3A+Leader+discovery+optimisations+for+the+client

https://issues.apache.org/jira/browse/KAFKA-15661

Protocol changes: #14627

Testing

Benchmarking described here https://cwiki.apache.org/confluence/display/KAFKA/KIP-951%3A+Leader+discovery+optimisations+for+the+client#KIP951:Leaderdiscoveryoptimisationsfortheclient-BenchmarkResults
./gradlew core:test --tests kafka.server.KafkaApisTest

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

zhengyd2014 · 2023-09-25T21:37:35Z

clients/src/main/java/org/apache/kafka/common/requests/FetchResponse.java

+                .setErrorCode(error.code())
+                .setSessionId(sessionId)
+                .setResponses(topicResponseList);
+        nodeEndpoints.forEach(endpoint -> data.nodeEndpoints().add(


I don't think we need to populate nodeEndpoints field for all the Fetch/Produce response, which has certain overhead. to my understanding, the info needed only for error condition.

that should be getting handled in KafkaApis where we only add to nodeEndpoints if there is an error, otherwise it should be an empty list.

it makes sense, thanks

Nit: It would be good to add a comment here, something like
KafkaApis will manage the response, returning nodeEndpoints information only in case of an error.

Could we also create the nodeEndpoints together so we keep the final data object relatively clean. (Ie we generate the response data structure above, we could also generate the enpoints data structure above.)

reordered to make things cleaner

kirktrue

Thanks for the PR @chb2ab!

Looks good overall. Since the KIP is awaiting a final vote, should this PR wait until that goes through before merging?

kirktrue · 2023-09-27T15:43:53Z

clients/src/main/java/org/apache/kafka/common/requests/FetchResponse.java

        List<FetchResponseData.FetchableTopicResponse> topicResponseList = new ArrayList<>();
+        FetchResponseData data = new FetchResponseData();


nit: can we move the object creation closer to where it's updated and returned at the bottom of the method?

kirktrue · 2023-09-27T15:45:15Z

clients/src/main/java/org/apache/kafka/common/requests/ProduceResponse.java

+     * @param throttleTimeMs Time in milliseconds the response was throttled
+     * @param nodeEndpoints List of node endpoints
+     */
+    @Deprecated


I'm confused—why is the new constructor is marked as @Deprecated? If that's intentional, can you add a comment about what should be used instead? Thanks.

I didn't look too deep into this before, but this was deprecated in https://issues.apache.org/jira/browse/KAFKA-9628 and the follow up is described here https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/KafkaApis.scala#L605-L608. I'm not sure how big of a change that would be, and it is a critical part of the code, so I suspect it would be better to separate it into a different PR. I'm also not sure how noticeable the performance benefit will be. For now I think adding to this comment should be enough.

I think all we would need to do is create the produceResponseData object. I don't think it would be too much work. I'm a little wary of adding more deprecated constructors.

Taking a closer look, it looks like there was an effort to build the response directly and not pass in data structures (new maps) that will just be converted via toData.

I started implementing this but I do think it's getting out of scope of this PR. These are my initial changes, to finish I think we want to remove PartitionResponse completely and replace it with PartitionProduceResponse, otherwise were just moving around the conversion. Having a deprecated constructor isn't ideal but I think we should remove it with KAFKA-10730, not this. @jolshan what do you think?

KAFKA-10730 is a pretty dormant JIRA. I do agree that there is some level of conversion. I wonder if folks have a strong opinion about this conversion still.

Looking into this further, I see the change would need to be made to appendRecords and the ProducePartitionStatus. It doesn't look too crazy, but also understandable this is not the scope for this PR.

I wonder if KAFKA-9682 was premature in deprecating the constructor. I guess our options are leaving it deprecated and adding a deprecated method or removing the deprecation until KAFKA-10730 is completed. (I almost just want to fix it so this doesn't happen in the future 😂 )

I'll try to find time to finish the changes for KAFKA-10730, I think refactoring the tests would take some time but overall I agree it doesn't seem too big.

I'm ok with removing the deprecation, but I suspect the incentive to do the refactoring will be lost, so leaving it for now.

@chb2ab It makes sense to do KAFKA-10730. But i agree with @jolshan that it is outside of the scope of the PR, should be done independently.

How about merging the PR with ctor as @deprecated? And then do follow-up PR for KAFKA-10730(tackling the new ctor as well).

kirktrue · 2023-09-27T15:47:44Z

clients/src/main/java/org/apache/kafka/common/requests/ProduceResponse.java

@@ -210,6 +238,12 @@ public String toString() {
            b.append(logStartOffset);
            b.append(", recordErrors: ");
            b.append(recordErrors);
+            b.append(", currentLeader: ");
+            if (currentLeader != null) {


I think that the following lines could be simply String.valueOf(currentLeader), right?

In fact, the errorMessage bit could be redone that way too.

In fact, I think that the StringBuilder code checks for nulls in its append() method.

looking at the java docs I think you're right, this could be replaced by b.append(currentLeader). I'm not sure why errorMessage was written this way, it looks like it was changed explicitly in this commit but I don't see a reason for it, I could probably change this as well.

kirktrue · 2023-09-27T15:51:10Z

clients/src/main/resources/common/message/FetchRequest.json

@@ -53,7 +53,9 @@
  //
  // Version 15 adds the ReplicaState which includes new field ReplicaEpoch and the ReplicaId. Also,
  // deprecate the old ReplicaId field and set its default value to -1. (KIP-903)
-  "validVersions": "0-15",
+  //
+  // Version 16 is the same as version 15.


At the risk of proving my ignorance, why do we bump the version number if nothing has changed? Is it so that the request the same version number as the response (which is bumped)?

Yes, from my reading the version of the response is based on the version of the request, so we need to bump both.

@chb2ab that's correct.

Request and response version should always be the same :)

kirktrue · 2023-09-27T15:51:49Z

clients/src/main/resources/common/message/ProduceRequest.json

@@ -33,7 +33,7 @@
  // Starting in Version 8, response has RecordErrors and ErrorMessage. See KIP-467.
  //
  // Version 9 enables flexible versions.
-  "validVersions": "0-9",
+  "validVersions": "0-10",


Does it make sense to add a comment about the version bump?

yes, forgot to include that.

kirktrue · 2023-09-27T15:54:14Z

core/src/main/scala/kafka/server/KafkaApis.scala

+      case Left(x) =>
+        debug(s"Unable to retrieve local leaderId and Epoch with error $x, falling back to metadata cache")
+        val partitionInfo = metadataCache.getPartitionInfo(tp.topic, tp.partition)
+        partitionInfo.foreach { info =>


This foreach loop is overwriting the leaderId and leaderEpoch each time. Is that intentional? Is there a benefit to looping vs. just grabbing the last entry in the collection?

Looking at other uses of partitionInfo I think this is a style choice. There can only be 1 partitionInfo in the getPartitionInfo object, so the forEach should only ever access 1 entry, I think this is just a more succinct way of accessing it.

getPartitionInfo returns an option. If it exists, foreach will access it. If it doesn't foreach does nothing. This is a common pattern in scala. Are we considering the case when the partition is not present?

We also don't need to set vars here. We could have a statement where we return a tuple or even just the partitionInfo.

partitionInfo = partitionInfoOrError match { case Right(partitionInfo) => partitionInfo case Left(error) => debug(s"Unable to retrieve local leaderId and Epoch with error $error, falling back to metadata cache") metadataCache.getPartitionInfo(tp.topic, tp.partition) match { case Some(partitionInfo) => partitionInfo case None => handle case where we don't have the partition } }

I think the confusion was that foreach implies a list of elements, but there can only be 1 here. I like the use of tuple and match/case here though, I will update to that

Yeah, using foreach for scala options is a common pattern.

chb2ab · 2023-09-28T21:35:00Z

Thanks for the PR @chb2ab!

Looks good overall. Since the KIP is awaiting a final vote, should this PR wait until that goes through before merging?

Thank you Kirk, yeah I am going to wait for the KIP to be approved before making more changes

yangy0000 · 2023-09-30T23:12:20Z

core/src/main/scala/kafka/server/KafkaApis.scala

+          leaderEpoch = x.getLeaderEpoch
+      case Left(x) =>
+        debug(s"Unable to retrieve local leaderId and Epoch with error $x, falling back to metadata cache")
+        val partitionInfo = metadataCache.getPartitionInfo(tp.topic, tp.partition)


any chance partitionInfo can be null?

I don't think so, getPartitionInfo returns an Option, the equivalent of null would be an empty option. We don't seem to null check this value elsewhere either.

hachikuji · 2023-10-05T16:50:34Z

clients/src/main/resources/common/message/ProduceResponse.json

@@ -32,7 +32,9 @@
  // records that cause the whole batch to be dropped.  See KIP-467 for details.
  //
  // Version 9 enables flexible versions.
-  "validVersions": "0-9",
+  //
+  // Version 10 adds 'CurrentLeader' and 'NodeEndpoints' as tagged fields


Why do we need to bump the version if we are just adding tagged fields?

I'm not sure if this is absolutely necessary, I was going based off the KIP, but I do think there could be an issue with leaving the version the same. If a client is still using the old protocol definition and the server returns a message based on the new definition but with the same version number, wouldn't the client deserialize it incorrectly?

New tagged fields would be unknown to older clients, so they would ignore them. It would not affect their ability to deserialize.

ok, I'm fine with keeping the version the same, but we should update the KIP. @msn-tldr do you see any issues with this?

I didn't realize this was brought up in previous discussions, it looks like we decided to bump up the version # to make it clearer which clients have implemented the feature, in an email from @dajac

Personally, I would rather prefer to bump both versions and to add the
tagged fields. This would allow us to better reason about what the client
is supposed to do when we see the version on the server side. Otherwise, we
will never know if the client uses this or not.

@hachikuji does this sound good?

@jolshan +1

@hachikuji Yeah, your point is totally valid. I was pushing for this with the java client (and potentially librdkafka) in mind. I think that it will make request analysis easier as you said.

@jolshan I think that it is hard to really define a policy for this. It mainly depends on whether there is a justification to require an epoch bump or not. In this case, I believe that there is one but this may not always be true.

I'm also wondering if there is a flaw with this approach. This would mean we need to bump MV after all for inter broker requests right? #14444 (comment)

I bumped the MV for fetch requests in 3.7, I think it's fine since it isn't released yet, lmk if you see any issues.

jolshan · 2023-10-09T23:00:52Z

server-common/src/main/java/org/apache/kafka/server/common/MetadataVersion.java

@@ -338,7 +338,7 @@ public boolean isControllerRegistrationSupported() {

    public short fetchRequestVersion() {
        if (this.isAtLeast(IBP_3_5_IV1)) {
-            return 15;
+            return 16;


Hmm. Is this correct? In the upgrade scenario we will send request version 16 to brokers that may not have that version yet. I know we just ignore tagged fields, but I'm not sure I recall if we can handle version bumps.

If this is always just the latest version, should it be hardcoded?

I think you're right that this could cause issues during upgrade, I think using the latest version should be safe.

I'm just trying to figure out if we would see unsupported version errors during upgrades. I think we might.

Clusters with ibp 3.5 were guaranteed to support version 15 since that was the version we defined the ibp. I don't think we can just change the version because some clusters will have ibp 3.5, but not version 16 fetch requests. We could avoid this with the tagged fields, but since we are bumping the version, we run into a problem.

Ah I hadn't thought this through. I think this would need an IBP version bump to avoid errors during upgrades?

I'll remove this version bump, I hadn't realized it was only for replication fetches.

I remember why I added this now, we have a test that checks we're using the latest version for various API's. I guess we would need to remove that for FETCH.

kafka/core/src/test/scala/unit/kafka/server/ReplicaFetcherThreadTest.scala

Lines 120 to 131 in 13b2edd

@Test

def shouldSendLatestRequestVersionsByDefault(): Unit = {

val props = TestUtils.createBrokerConfig(1, "localhost:1234")

val config = KafkaConfig.fromProps(props)

val replicaManager: ReplicaManager = mock(classOf[ReplicaManager])

when(replicaManager.brokerTopicStats).thenReturn(mock(classOf[BrokerTopicStats]))

assertEquals(ApiKeys.FETCH.latestVersion, config.interBrokerProtocolVersion.fetchRequestVersion())

assertEquals(ApiKeys.OFFSET_FOR_LEADER_EPOCH.latestVersion, config.interBrokerProtocolVersion.offsetForLeaderEpochRequestVersion)

assertEquals(ApiKeys.LIST_OFFSETS.latestVersion, config.interBrokerProtocolVersion.listOffsetRequestVersion)

}

I chatted with David J about this offline. Since we are using MV/IBP bumps for fetches, a simple thing to do would be to pick up the newest MV for this release and include the fetch bump here.

The alternative is setting up the fetch path to use ApiVersions to ensure the correct version. But that might be out of scope for this change.

With either of these approaches we can keep the latest version for the replication fetches which would make things a little clearer.

Since we are using MV/IBP bumps for fetches, a simple thing to do would be to pick up the newest MV for this release and include the fetch bump here

ok, so my understanding is instead of bumping IBP to IBP_3_7_IV1 we would wait for the release of IBP_3_8_IV0 to bump the fetch version here to 16.

The alternative is setting up the fetch path to use ApiVersions to ensure the correct version. But that might be out of scope for this change.

yeah, I would need to look more into this, I'm not familiar enough to know how it might look.

Since we are using MV/IBP bumps for fetches, a simple thing to do would be to pick up the newest MV for this release and include the fetch bump here

ok, so my understanding is instead of bumping IBP to IBP_3_7_IV1 we would wait for the release of IBP_3_8_IV0 to bump the fetch version here to 16.

I should have made this a question. I confused myself with the release versions, I think we can use IBP_3_7_IV0 to bump the fetch version to 16 since it hasn't been released yet, let me know if it makes sense to you.

splett2

left a few comments

splett2 · 2023-10-13T16:31:18Z

core/src/main/scala/kafka/server/KafkaApis.scala

+          case None => (-1, -1)
+        }
+    }
+    val leaderNode: Node = metadataCache.getAliveBrokerNode(leaderId, config.interBrokerListenerName).getOrElse({


we shouldn't be passing through the interbroker listener name, we should be using the listener used by the original request to be consistent with the metadata request.

Is it simpler if we just consult the metadata cache? In KRaft mode, the metadata cache is the source of truth for partition leadership and is updated before the partition state gets updated.

will update the listener name.

I think the motivation for checking replica manager first is it may be faster than metadata cache.

my point in the previous comment is that will never be the case with KRaft.

ok, I think replica manager is the more commonly used path which may have caching benefits but I'm not sure, does that make sense?

Hmmm -- I'm not sure I understand "more commonly used path".
ReplicaManager will have the partition if the broker hosts it. The metadatacache is meant to be a cache of all the partitions, so I don't think it loses out on "caching benefits"

The benefit of the replica manager is that it also contains the log of the partition. If the metadata cache is sufficient (which it seems to be) we should probably just use that.

I looked more into this and I see replica manager looks up the partition from a Pool object while metadata cache looks it up in the current image and creates a new UpdateMetadataPartitionState to return. I think we can avoid an allocation using the replica manager, also since the fetch/produce paths should have recently tried to read through replica manager I think it's more likely to give an in-memory cache hit than the metadata path. It still seems better to me to try from replica manager first, what do you all think?

the extra object allocation is not a big issue, since the new leader and new leader lookup are not done in the common case, only in erroring cases.

populating the new leader state from the Partition also doesn't work for cases where the partition gets deleted from the leader, for instance in cases with reassignments, so populating from the metadata cache is both more likely to have up-to-date information (in KRaft mode, which we should assume to be the default) and it handles NotLeader in more cases.

The metadata cache having more up to date information makes sense to me, but I don't follow the deletion case, would reading from the replica manager not return NOT_LEADER_OR_FOLLOWER there? It seems like we should still fallback to the metadata cache in that case

We would but we would actually look it up in the metadata cache twice in that path :)

ok, so my understanding is in the case of partition reassignments it would be better to go directly to metadata cache, but when moving leadership within the replica set it is better to go to replica manager first. I think we should prioritize moving leadership within the replica set here since it seems more common, what do you all think?

splett2 · 2023-10-13T16:32:59Z

core/src/main/scala/kafka/server/KafkaApis.scala

+        if (request.header.apiVersion >= 10) {
+          status.currentLeader = {
+            status.error match {
+              case Errors.NOT_LEADER_OR_FOLLOWER | Errors.FENCED_LEADER_EPOCH =>


produce requests should never receive FENCED_LEADER_EPOCH.

also, shouldn't this go in the above if block?

will move this into the if block.

why can't produce receive FENCED_LEADER_EPOCH?

I think the error is only returned on fetch requests

ok, removed FENCED_LEADER_EPOCH from the produce path

produce requests do not include a leader epoch => they can never get fenced leader epoch.

chb2ab · 2023-10-18T13:52:27Z

I reran all the failing tests locally and they passed, I'm not sure if there's anything else that needs to be done but they seem like flaky tests.

jolshan · 2023-10-19T18:29:56Z

clients/src/main/java/org/apache/kafka/common/requests/ProduceResponse.java

+        }
+
+        public PartitionResponse(Errors error, long baseOffset, long lastOffset, long logAppendTime, long logStartOffset, List<RecordError> recordErrors, String errorMessage) {
+            this(error, baseOffset, lastOffset, logAppendTime, logStartOffset, recordErrors, errorMessage, new ProduceResponseData.LeaderIdAndEpoch());


can we remove the ProduceResponseData prefixes here?

Or even better, can we just leave empty and use a default? Or does that bloat the constructors more?

I think I addressed this, the currentLeader parameter was unused so I removed it and set it to be a new LeaderIdAndEpoch by default. I also removed the prefix.

jolshan · 2023-10-19T18:31:20Z

clients/src/main/java/org/apache/kafka/common/requests/ProduceResponse.java

@@ -98,13 +116,20 @@ private static ProduceResponseData toData(Map<TopicPartition, PartitionResponse>
                    .setLogAppendTimeMs(response.logAppendTime)
                    .setErrorMessage(response.errorMessage)
                    .setErrorCode(response.error.code())
+                    .setCurrentLeader(response.currentLeader != null ? response.currentLeader : new LeaderIdAndEpoch())


do we need to set anything here if the response is null?

Or alternatively pass in the default and not have to do a check here.

based on the change I made in handleProduceRequest I don't think currentLeader can be null anymore, I removed this check

jolshan · 2023-10-19T18:32:51Z

clients/src/main/resources/common/message/ProduceResponse.json

      ]}
    ]},
    { "name": "ThrottleTimeMs", "type": "int32", "versions": "1+", "ignorable": true, "default": "0",
-      "about": "The duration in milliseconds for which the request was throttled due to a quota violation, or zero if the request did not violate any quota." }
+      "about": "The duration in milliseconds for which the request was throttled due to a quota violation, or zero if the request did not violate any quota." },
+    { "name": "NodeEndpoints", "type": "[]NodeEndpoint", "versions": "10+", "taggedVersions": "10+", "tag": 0,


Should we be using the same tag as the CurrentLeader field?

same response as in FetchResponse.json

jolshan · 2023-10-19T18:34:06Z

clients/src/main/resources/common/message/FetchResponse.json

@@ -102,6 +104,15 @@
          "about": "The preferred read replica for the consumer to use on its next fetch request"},
        { "name": "Records", "type": "records", "versions": "0+", "nullableVersions": "0+", "about": "The record data."}
      ]}
+    ]},


Should we be using the same tag here as diverging epoch?

it looks like tags are scoped to the list level so this isn't really the same tag. They also need to be contiguous within their scope so this gives an error if I try to tag NodeEndpoints to something other than 0.

jolshan · 2023-10-19T18:41:27Z

core/src/main/scala/kafka/server/KafkaApis.scala

+                    .setLeaderId(leaderNode.leaderId)
+                    .setLeaderEpoch(leaderNode.leaderEpoch)
+                case _ =>
+                  null


could this just be the default leaderIdAndEpoch?

yeah, looking again the currentLeader should already be set to the default, I removed the allocation

jolshan · 2023-10-19T18:42:11Z

core/src/test/scala/unit/kafka/server/ReplicaFetcherThreadTest.scala

@@ -125,7 +125,6 @@ class ReplicaFetcherThreadTest {
    val replicaManager: ReplicaManager = mock(classOf[ReplicaManager])
    when(replicaManager.brokerTopicStats).thenReturn(mock(classOf[BrokerTopicStats]))

-    assertEquals(ApiKeys.FETCH.latestVersion, config.interBrokerProtocolVersion.fetchRequestVersion())


Hmmm -- not sure if we want to remove this.

If we do plan to address the MV on a followup, we should definitely call it out and file a JIRA that is a blocker for 3.7

it looks like IBP_3_7_IV0 was added already, I was confused. I bumped up the fetch version so removing this isn't necessary anymore

jolshan · 2023-10-19T18:43:43Z

@chb2ab is there a JIRA for this work? If not, can we create one and format the title as the jira title?

chb2ab · 2023-10-20T15:17:02Z

@chb2ab is there a JIRA for this work? If not, can we create one and format the title as the jira title?

done, lmk if anything's missing, it's my first AK JIRA. https://issues.apache.org/jira/browse/KAFKA-15661

Thank you for your reviews everyone btw.

Separating out the protocol changes from #14444 in an effort to more quickly unblock the client side PR. This is the protocol changes to populate the fields in KIP-951. On NOT_LEADER_OR_FOLLOWER errors in both FETCH and PRODUCE the new leader ID and epoch are included in the response. The endpoint for the new leader is retrieved from the metadata cache. The new fields are all optional (tagged) and an IBP bump is required. https://cwiki.apache.org/confluence/display/KAFKA/KIP-951%3A+Leader+discovery+optimisations+for+the+client Reviewers: Justine Olshan <jolshan@confluent.io>, Mayank Shekhar Narula <mayanks.narula@gmail.com>

…ection. Populate new leaderid, epoch, and node to return with NOT_LEADER_OR_FOLLOWER errors

chb2ab · 2023-11-09T14:34:19Z

Thank you. I think the build is failing for something unrelated, could we try restarting it? This was the error

> Task :raft:checkstyleTest FAILED

Failed to execute org.gradle.cache.internal.AsyncCacheAccessDecoratedCache$$Lambda$344/799627478@4ae4fd80.

org.gradle.api.UncheckedIOException: Could not add entry ':raft:checkstyleTest' to cache executionHistory.bin (/home/jenkins/jenkins-agent/712657a4/workspace/Kafka_kafka-pr_PR-14444/.gradle/8.3/executionHistory/executionHistory.bin).

jolshan · 2023-11-09T23:39:35Z

These build failures are out of control. I have to rebuild again.
I do not think it is these changes since I'm having the same issues on the 5 prs I'm trying to review right now :(

jolshan · 2023-11-10T04:59:05Z

I've seen all these failures on other builds (both 3.6 and trunk) today and yesterday.
I've commented on the applicable JIRAs.

This reverts commit f38b0d8.

This reverts commit f38b0d8. Trying to find the root cause of org.apache.kafka.tiered.storage.integration.ReassignReplicaShrinkTest failing in CI. Reviewers: Justine Olshan <jolshan@confluent.io>

…)" (apache#14738)" This reverts commit a98bd7d.

…14738)" (#14747) This KIP-951 commit was reverted to investigate the org.apache.kafka.tiered.storage.integration.ReassignReplicaShrinkTest test failure (#14738). A fix for that was merged in #14757, hence unreverting this change. This reverts commit a98bd7d. Reviewers: Justine Olshan <jolshan@confluent.io>, Mayank Shekhar Narula <mayanks.narula@gmail.com>

This is the server side changes to populate the fields in KIP-951. On NOT_LEADER_OR_FOLLOWER errors in both FETCH and PRODUCE the new leader ID and epoch are retrieved from the local cache through ReplicaManager and included in the response, falling back to the metadata cache if they are unavailable there. The endpoint for the new leader is retrieved from the metadata cache. The new fields are all optional (tagged) and an IBP bump was required. https://cwiki.apache.org/confluence/display/KAFKA/KIP-951%3A+Leader+discovery+optimisations+for+the+client https://issues.apache.org/jira/browse/KAFKA-15661 Protocol changes: apache#14627 Testing Benchmarking described here https://cwiki.apache.org/confluence/display/KAFKA/KIP-951%3A+Leader+discovery+optimisations+for+the+client#KIP951:Leaderdiscoveryoptimisationsfortheclient-BenchmarkResults ./gradlew core:test --tests kafka.server.KafkaApisTest Reviewers: Justine Olshan <jolshan@confluent.io>, David Jacot <djacot@confluent.io>, Jason Gustafson <jason@confluent.io>, Fred Zheng <zhengyd2014@gmail.com>, Mayank Shekhar Narula <mayanks.narula@gmail.com>, Yang Yang <yayang@uber.com>, David Mao <dmao@confluent.io>, Kirk True <ktrue@confluent.io>

…ache#14738) This reverts commit f38b0d8. Trying to find the root cause of org.apache.kafka.tiered.storage.integration.ReassignReplicaShrinkTest failing in CI. Reviewers: Justine Olshan <jolshan@confluent.io>

…)" (apache#14738)" (apache#14747) This KIP-951 commit was reverted to investigate the org.apache.kafka.tiered.storage.integration.ReassignReplicaShrinkTest test failure (apache#14738). A fix for that was merged in apache#14757, hence unreverting this change. This reverts commit a98bd7d. Reviewers: Justine Olshan <jolshan@confluent.io>, Mayank Shekhar Narula <mayanks.narula@gmail.com>

I was using the ZERO_UUID topicId instead of the actual topicId in the testFetchResponseContainsNewLeaderOnNotLeaderOrFollower introduced in #14444, updating as the actual topicId is more correct. Reviewers: Justine Olshan <jolshan@confluent.io>

Separating out the protocol changes from apache#14444 in an effort to more quickly unblock the client side PR. This is the protocol changes to populate the fields in KIP-951. On NOT_LEADER_OR_FOLLOWER errors in both FETCH and PRODUCE the new leader ID and epoch are included in the response. The endpoint for the new leader is retrieved from the metadata cache. The new fields are all optional (tagged) and an IBP bump is required. https://cwiki.apache.org/confluence/display/KAFKA/KIP-951%3A+Leader+discovery+optimisations+for+the+client Reviewers: Justine Olshan <jolshan@confluent.io>, Mayank Shekhar Narula <mayanks.narula@gmail.com>

This is the server side changes to populate the fields in KIP-951. On NOT_LEADER_OR_FOLLOWER errors in both FETCH and PRODUCE the new leader ID and epoch are retrieved from the local cache through ReplicaManager and included in the response, falling back to the metadata cache if they are unavailable there. The endpoint for the new leader is retrieved from the metadata cache. The new fields are all optional (tagged) and an IBP bump was required. https://cwiki.apache.org/confluence/display/KAFKA/KIP-951%3A+Leader+discovery+optimisations+for+the+client https://issues.apache.org/jira/browse/KAFKA-15661 Protocol changes: apache#14627 Testing Benchmarking described here https://cwiki.apache.org/confluence/display/KAFKA/KIP-951%3A+Leader+discovery+optimisations+for+the+client#KIP951:Leaderdiscoveryoptimisationsfortheclient-BenchmarkResults ./gradlew core:test --tests kafka.server.KafkaApisTest Reviewers: Justine Olshan <jolshan@confluent.io>, David Jacot <djacot@confluent.io>, Jason Gustafson <jason@confluent.io>, Fred Zheng <zhengyd2014@gmail.com>, Mayank Shekhar Narula <mayanks.narula@gmail.com>, Yang Yang <yayang@uber.com>, David Mao <dmao@confluent.io>, Kirk True <ktrue@confluent.io>

…ache#14738) This reverts commit f38b0d8. Trying to find the root cause of org.apache.kafka.tiered.storage.integration.ReassignReplicaShrinkTest failing in CI. Reviewers: Justine Olshan <jolshan@confluent.io>

…)" (apache#14738)" (apache#14747) This KIP-951 commit was reverted to investigate the org.apache.kafka.tiered.storage.integration.ReassignReplicaShrinkTest test failure (apache#14738). A fix for that was merged in apache#14757, hence unreverting this change. This reverts commit a98bd7d. Reviewers: Justine Olshan <jolshan@confluent.io>, Mayank Shekhar Narula <mayanks.narula@gmail.com>

I was using the ZERO_UUID topicId instead of the actual topicId in the testFetchResponseContainsNewLeaderOnNotLeaderOrFollower introduced in apache#14444, updating as the actual topicId is more correct. Reviewers: Justine Olshan <jolshan@confluent.io>