KAFKA-8855; Collect and Expose Client's Name and Version in the Brokers (KIP-511 Part 2) #7398

dajac · 2019-09-26T15:58:50Z

This PR implements the second part of KIP-511:

Connection Registry
Integration in the Processor and the KafkaApis
Integration with SaslServerAuthenticator
New Metrics
Enrichment of the RequestLog

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

dajac · 2019-09-26T15:59:30Z

The PR is based on #7381. I will rebased it on trunk once #7381 will be merged.

dajac · 2019-10-02T16:37:03Z

core/src/main/scala/kafka/network/SocketServer.scala

+            var connection = connectionRegistry.get(connectionId)
+            if (connection == null) {
+              connection = connectionRegistry.register(connectionId, header.clientId(), channel.clientSoftwareName(),
+                channel.clientSoftwareVersion(), listenerName, securityProtocol, channel.socketAddress(), channel.principal())
+            }


👀 This is the critical part of the PR.

…rs (KIP-511 Part 2)

dajac · 2019-10-05T12:51:54Z

retest this please

dajac · 2019-10-07T15:23:18Z

retest this please

dajac · 2019-10-07T20:01:22Z

Failed test:

JDK 8 and Scala 2.11

kafka.network.DynamicConnectionQuotaTest.testDynamicConnectionQuota

It does not seem related to this PR.

clients/src/main/java/org/apache/kafka/common/network/KafkaChannel.java

clients/src/main/java/org/apache/kafka/common/requests/RequestContext.java

clients/src/test/java/org/apache/kafka/clients/NetworkClientTest.java

cmccabe · 2019-10-08T22:16:56Z

clients/src/main/java/org/apache/kafka/common/network/ConnectionRegistry.java

+     * should rarely happen in practice because the metadata is only updated by the ApiVersionsRequest
+     * and there are normally no concurrent requests within the connection.
+     */
+    public ConnectionMetadata get(String connectionId) {


This is wrong. Unsynchronized access to a map could cause more than just "stale or inconsistent" data. It could cause null pointer exceptions or other issues. We cannot access this without synchronization.

cmccabe · 2019-10-08T22:18:11Z

clients/src/main/java/org/apache/kafka/common/network/ConnectionRegistry.java

+
+/**
+ * Maintains metadata about each active connections and exposed various metrics about the connections
+ * and the clients.


I don't think it's necessary to keep a central registry of all this information for all connections. We really just need the metrics, most of which can just be simple counters. If we need more information about a connection, we can look at the request context of that connection. But it doesn't have to be stored here.

Let me explain why I went down this path.

As you said, the request context can be used to get the information about the connection. The issue is that that information must come from some place. It could come directly from the KafkaChannel if SASL is used and an ApiVersionsRequest with the information is received during the SASL initialisation. Unfortunately, some clients, including the java one, does not provide the information during the SASL initialisation but in a second ApiVersionsRequest which is handled in the KafkaApis layer this time. This means that a place is required to store and update the name and the version.

KIP-511 has proposed to add a new Gauge<List<Map<String, String>> kafka.server:type=ClientMetrics,name=Connections which lists all the active connections with their metadata (client id, software name, software version, etc.). This requires a list of all the active connections to be maintained somewhere to back the gauge. We could argue that this could be replaced by counters for each combination. I preferred the Gauge to limit the number of metrics.

I have considered the following alternative to the registry but did not choose it. You may prefer it.

As the Selector already maintains a list of the active KafkaChannels (equivalent to connections), we could add all the require metadata as attributes in the KafkaChannel and use the selector as source of truth. This would require to add the clientId in channel for instance.

To avoid having the pass the KafkaChannel up to the KafkaApis layer to update the name and the version when the ApiVersionsRequest is received, we could partially or completely move the handling of the AVR in the processor to update the KafkaChannel. The later would impact the throttling, the metrics and the request logs.

As many selectors are run (one per processor if not mistaken), we could have counters per selector to limit the locking required to update them. For the Gauge which lists all the connections, we could also have one per Selector but it is less convenient so for this one, I would propose to have a way to list all the KafkaChannel across all the selectors.

The major advantage of this approach is that the "per selector/processor" does not require any locking.

The downsides:

Moving the handling of the AVR in the processor has wide impact on the throttling, the metrics of the AVR and the request log which is based on the request send to the KafkaApis layer.

It makes exposing the metadata via other API little harder. For instance, one could think of adding an Admin API to list all the active connections of a broker (already got such request from our colleagues).

I was not comfortable with moving the handling of the AVR in the processor when I looked at the options so I went with the current approach.

What do you think?

Thanks for the explanation, and thanks for fixing the synchronization. As an optimization, I think we should use something like ConcurrentMap here rather than synchronized blocks, so that we can minimize the amount of time threads spend waiting. This will be a bit more tricky to use, but more scalable.

I guess when I thought about KIP-511, I thought of it in terms of metrics and perhaps occasional samples of connections (similar to how we do request sampling by logging a few selected requests). I don't see why we'd ever need a full snapshot of all existing connections. The snapshot would probably be out of date by the time it had been returned, since new connections are closed and opened all the time.

Reading the KIP more carefully, I see that KIP-511 does specify a metric which essentially requires each connection to register itself. Considering we don't even have a way to visualize or graph this metric, I'm not sure this belongs in JMX. I have to think about this more...

twmb · 2019-10-14T06:15:54Z

I've noticed, as a client, the client must pessimistically assume that the broker it is talking can only handle a max ApiVersions request version of 2. Otherwise, with flexible versions, if the client assumes the latest version, the request will be written compact and an old broker will not understand the request and will close the connection. Compact requests can only be written if the client knows the broker is at least 2.4.0.

This seems to break the spirit of ApiVersions, which historically has been able to be handled even if the broker does not know the version.

When requesting with flexible ApiVersions v3 and a short client name / version:

org.apache.kafka.common.errors.InvalidRequestException: Error parsing request header. Our best guess of the apiKey is: 18
Caused by: org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'client_id': Error reading string of length 1131, only 15 bytes available

twmb · 2019-10-14T06:24:08Z

My question is: should the client behavior going forward be to

open a conection, pessimistically assume max ApiVersions v2, issue ApiVersions v2
see if the response indicates support for at least v3, and if so, reissue the request with the latest version (including the client software name / client software version)
?

If yes, I think the KIP should be updated to describe this new required behavior. If no, then I do not think ApiVersions request v3 should be flexible, as well as versions going forward. If not that, then I am not sure about this flexible dilemma.

dajac · 2019-10-14T07:55:40Z

@twmb You are right, the ApiVersionsRequest must remain backward and forward compatible. We have fixed this here: #7479. Could you try with this patch and let us know if it does not work?

twmb · 2019-10-14T08:12:12Z

Ah, that's yet another reasonable fix! That makes sense. Sorry I missed that; I followed the links in KAFKA-8855!

dajac · 2019-11-28T15:00:45Z

I have opened a new PR with uses a different approach: #7749 Closing this one.

dajac force-pushed the kafka-8855-collect-client-name-and-version-2-part2 branch 2 times, most recently from fb03f4f to 34b831b Compare October 1, 2019 15:04

dajac commented Oct 2, 2019

View reviewed changes

dajac force-pushed the kafka-8855-collect-client-name-and-version-2-part2 branch from 34b831b to 44aab56 Compare October 4, 2019 07:32

dajac added 4 commits October 4, 2019 18:42

KAFKA-8855; Collect and Expose Client's Name and Version in the Broke…

471eb9c

…rs (KIP-511 Part 2)

Cosmetic changes

34af73b

Cosmetic changes

acd18ef

javadoc

a9a0439

dajac force-pushed the kafka-8855-collect-client-name-and-version-2-part2 branch from 732bc9c to a9a0439 Compare October 4, 2019 16:43