Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-6501: Dynamic broker config tests updates and metrics fix #4539

Merged

Conversation

@rajinisivaram
Copy link
Contributor

commented Feb 7, 2018

  1. Handle listener-not-found in MetadataCache since this can occur when listeners are being updated. To avoid breaking clients, this is handled in the same way as broker-not-available so that clients retry
  2. Set retries=1000 for listener reconfiguration tests to avoid transient failures when metadata cache has not been updated
  3. Remove IdlePercent metric when Processor is deleted, add test
  4. Reduce log segment size used during reconfiguration to avoid timeout while waiting for log rolling
    5.Test markPartitionsForTruncation after fetcher thread resize

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)
@rajinisivaram

This comment has been minimized.

Copy link
Contributor Author

commented Feb 7, 2018

@hachikuji Can you review, please? Thank you!

Copy link
Contributor

left a comment

Thanks for the patch. Left one question and a couple minor comments.

core/src/main/scala/kafka/server/MetadataCache.scala Outdated
val node = nodeMap.get(listenerName)
warn(s"Broker endpoint not found for broker $brokerId listenerName $listenerName")
node
}.getOrElse(None)

This comment has been minimized.

Copy link
@hachikuji

hachikuji Feb 8, 2018

Contributor

nit: I think you don't need this if you use flatMap.

core/src/test/scala/integration/kafka/server/DynamicBrokerReconfigurationTest.scala Outdated
val numProcessors = servers.head.config.numNetworkThreads * 2 // 2 listeners

val kafkaMetrics = servers.head.metrics.metrics().keySet.asScala
.filter(_.tags.containsKey("networkProcessor"))

This comment has been minimized.

Copy link
@hachikuji

hachikuji Feb 8, 2018

Contributor

There is also a response queue size metric which uses the "processor" tag. Maybe we can add a check to ensure its deletion as well?

core/src/main/scala/kafka/server/MetadataCache.scala Outdated
aliveNodes.get(brokerId).map { nodeMap =>
nodeMap.getOrElse(listenerName,
throw new BrokerEndPointNotAvailableException(s"Broker `$brokerId` does not have listener with name `$listenerName`"))

This comment has been minimized.

Copy link
@hachikuji

hachikuji Feb 8, 2018

Contributor

To check my understanding, previously when we raised this exception, the Metadata request would have failed with an unknown server error (since this exception has no error code) which would have probably been raised to the user. Is that right? Now we will return LEADER_NOT_AVAILABLE instead and the client will retry.

I am wondering in this case if we really should have a separate error code to indicate that there is no listener provided so that we can at least log a warning in the client. It seems more likely that this is the result of a misconfiguration than a delayed config update.

This comment has been minimized.

Copy link
@ijuma

ijuma Feb 8, 2018

Contributor

Yes, we should be careful here. Otherwise we may create a hard to diagnose problem for a common case (misconfigured listener).

This comment has been minimized.

Copy link
@rajinisivaram

rajinisivaram Feb 8, 2018

Author Contributor

@hachikuji @ijuma Thanks for the reviews. At the moment, as Jason said, user sees an unknown server error which is not retried. Neither client nor broker has any errors in the logs to show what went wrong. I did initially consider adding a new error code for this, but the problem is that old clients wouldn't recognize the error code and I thought they wouldn't retry as a result (may be I am wrong). So I thought LEADER_NOT_AVAILABLE is a reasonable error to send to the client. Since the problem occurs only if some brokers have a listener and others don't, I was thinking it was sufficient to have a log entry in the broker logs.

This comment has been minimized.

Copy link
@ijuma

ijuma Feb 8, 2018

Contributor

@rajinisivaram if we bumped the version of the relevant protocols in the 1.1 cycle, we could conditionally return a new error code. And fallback to LEADER_NOT_AVAILABLE, otherwise. If we didn't bump them, then it's less clear if it's worth doing it just for this case.

This comment has been minimized.

Copy link
@rajinisivaram

rajinisivaram Feb 8, 2018

Author Contributor

We don't seem to have bumped up the version of MetadataRequest or FindCoordinatorRequest in 1.1 (not sure if there are other requests which use this code).

This comment has been minimized.

Copy link
@hachikuji

hachikuji Feb 8, 2018

Contributor

Hmm.. That's a fair point. Something else to consider is whether we should log the message when we receive the UpdateMetadata request from the controller rather than the Metadata request from the clients.

This comment has been minimized.

Copy link
@rajinisivaram

rajinisivaram Feb 8, 2018

Author Contributor

@hachikuji Yes, that makes sense, updated. Do we still want to change protocol version and add a new error code for 1.1?

This comment has been minimized.

Copy link
@ijuma

ijuma Feb 8, 2018

Contributor

@hachikuji @rajinisivaram How about we add it to the KIP but implement it in the next version? I believe we have KIPs in progress that suggest changing MetadataRequest and we could piggyback on one of them.

This comment has been minimized.

Copy link
@hachikuji

hachikuji Feb 8, 2018

Contributor

That sounds good to me. I think this is still an improvement over existing behavior.

This comment has been minimized.

Copy link
@rajinisivaram

rajinisivaram Feb 8, 2018

Author Contributor

@hachikuji @ijuma Sounds good. I will update the KIP and create a JIRA for the next version. I think all the other comments on this one have been addressed. Let me know if anything else needs to be done for 1.1. Thanks.

core/src/main/scala/kafka/network/SocketServer.scala Outdated
@@ -740,6 +740,7 @@ private[kafka] class Processor(val id: Int,
close(channel.id)
}
selector.close()
removeMetric("IdlePercent", Map("networkProcessor" -> id.toString))

This comment has been minimized.

Copy link
@ijuma

ijuma Feb 8, 2018

Contributor

Can we add a constant for the metric name?

This comment has been minimized.

Copy link
@rajinisivaram

rajinisivaram Feb 8, 2018

Author Contributor

We don't seem to use constants for other metric names, looks odd to have just for this one?

This comment has been minimized.

Copy link
@ijuma

ijuma Feb 8, 2018

Contributor

I don't see any other metrics in this method. Generally, if the same magic value is used in 2 places, we should definitely use a constant. We don't follow this rule consistently, which is a shame.

core/src/main/scala/kafka/server/MetadataCache.scala Outdated
throw new BrokerEndPointNotAvailableException(s"Broker `$brokerId` does not have listener with name `$listenerName`"))
}
val node = nodeMap.get(listenerName)
warn(s"Broker endpoint not found for broker $brokerId listenerName $listenerName")

This comment has been minimized.

Copy link
@ijuma

ijuma Feb 8, 2018

Contributor

Hmm, shouldn't this be logged if node is None?

@rajinisivaram rajinisivaram force-pushed the rajinisivaram:KAFKA-6501-dynamic-broker-tests branch Feb 8, 2018
@hachikuji

This comment has been minimized.

Copy link
Contributor

commented Feb 8, 2018

@rajinisivaram Note that the test failure appears related.

@rajinisivaram

This comment has been minimized.

Copy link
Contributor Author

commented Feb 8, 2018

@hachikuji Yes, thank you, I will fix the test failure and rebase.

@hachikuji hachikuji referenced this pull request Feb 8, 2018
0 of 3 tasks complete
@rajinisivaram rajinisivaram force-pushed the rajinisivaram:KAFKA-6501-dynamic-broker-tests branch to d005c79 Feb 8, 2018
Copy link
Contributor

left a comment

LGTM, and thanks for fixing the metric that I broke.

@hachikuji hachikuji merged commit 15bc405 into apache:trunk Feb 9, 2018
2 of 3 checks passed
2 of 3 checks passed
JDK 8 and Scala 2.12 FAILURE 8573 tests run, 17 skipped, 0 failed.
Details
JDK 7 and Scala 2.11 SUCCESS 8573 tests run, 17 skipped, 0 failed.
Details
JDK 9 and Scala 2.12 SUCCESS 8573 tests run, 17 skipped, 0 failed.
Details
hachikuji added a commit that referenced this pull request Feb 9, 2018
1. Handle listener-not-found in MetadataCache since this can occur when listeners are being updated. To avoid breaking clients, this is handled in the same way as broker-not-available so that clients may retry.
2. Set retries=1000 for listener reconfiguration tests to avoid transient failures when metadata cache has not been updated 
3. Remove IdlePercent metric when Processor is deleted, add test
4. Reduce log segment size used during reconfiguration to avoid timeout while waiting for log rolling
5.Test markPartitionsForTruncation after fetcher thread resize
6. Move per-processor ResponseQueueSize metric back to RequestChannel.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
nimosunbit added a commit to sunbit-dev/kafka that referenced this pull request Nov 6, 2018
…he#4539)

1. Handle listener-not-found in MetadataCache since this can occur when listeners are being updated. To avoid breaking clients, this is handled in the same way as broker-not-available so that clients may retry.
2. Set retries=1000 for listener reconfiguration tests to avoid transient failures when metadata cache has not been updated 
3. Remove IdlePercent metric when Processor is deleted, add test
4. Reduce log segment size used during reconfiguration to avoid timeout while waiting for log rolling
5.Test markPartitionsForTruncation after fetcher thread resize
6. Move per-processor ResponseQueueSize metric back to RequestChannel.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.