[SPARK-28367][SS] Use new KafkaConsumer.poll API in Kafka connector #25135

gaborgsomogyi · 2019-07-12T12:08:52Z

What changes were proposed in this pull request?

Spark uses an old and deprecated API named KafkaConsumer.poll(long) which never returns and stays in live lock if metadata is not updated (for instance when broker disappears at consumer creation). Please see Kafka documentation and standalone test application for further details.

In this PR I've applied the new KafkaConsumer.poll(Duration) API.

How was this patch tested?

Existing unit tests + started Kafka to Kafka query on cluster.

SparkQA · 2019-07-12T12:39:37Z

Test build #107595 has finished for PR 25135 at commit ad1863a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-07-12T12:41:32Z

cc @HeartSaVioR

HeartSaVioR

The change mostly looks good. Some minors.

Btw, just wondering, if you managed to test with situation delaying updating metadata (finally timed-out), then what would be occurred? Query fails? Or reattempt of relevant tasks happen?

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala

...l/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceProviderSuite.scala

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala

gaborgsomogyi · 2019-07-15T11:05:21Z

Btw, just wondering, if you managed to test with situation delaying updating metadata (finally timed-out), then what would be occurred? Query fails? Or reattempt of relevant tasks happen?

Presume you're interested in what is the new behavior. If exception arrives withRetriesWithoutInterrupt catches all NonFatal and takes care of the reattempt.

SparkQA · 2019-07-15T11:32:55Z

Test build #107681 has finished for PR 25135 at commit f92c758.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-07-15T22:38:28Z

Ah OK there's already a retry loop. Thanks for the info.

HeartSaVioR

LGTM

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala

HeartSaVioR · 2019-07-16T09:39:16Z

LGTM again.

SparkQA · 2019-07-16T09:49:52Z

Test build #107731 has finished for PR 25135 at commit 2968c2a.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-07-16T09:56:28Z

retest this please

SparkQA · 2019-07-16T10:05:19Z

Test build #107732 has finished for PR 25135 at commit 2968c2a.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-07-16T10:19:08Z

Created a quick fix #25171

gaborgsomogyi · 2019-07-16T11:08:00Z

retest this please

SparkQA · 2019-07-16T11:52:50Z

Test build #107735 has finished for PR 25135 at commit 2968c2a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-07-16T12:34:52Z

Checking the issue...

SparkQA · 2019-07-16T14:04:09Z

Test build #107739 has finished for PR 25135 at commit 2ed24d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-07-17T08:45:26Z

Little bit explanation to the last change:

While I was debugging the issue I've realized requirement failed message is not really talkative so added a meaningful message
When consumer.poll(Duration.ZERO) with Thread.sleep(100) used then under some circumstances even 2 minutes was not enough to get the assignment. To test this I've put the standalone application into a 1k execution loop. The end result is that with non-zero timeout (including Duration.ofMillis(1)) everything works fine.

So the end judgement is that the actual implementation works but as a side track I'm going to analyze the consumer.poll(Duration.ZERO) behavior with the Kafka guys.

zsxwing · 2019-07-17T21:53:35Z

@gaborgsomogyi yeah, I think it's better to ask Kafka guys to clarify how to use this new poll API. It seems it cannot be used directly without retrying.

gaborgsomogyi · 2019-07-18T07:42:05Z

@zsxwing the main approach is created in agreement with the Kafka guys. The only thing which bothers me is why producing the following approaches different results (maybe an edge case hit with 0?!).
Works:

...
    consumer.poll(jt.Duration.ZERO)
    var partitions = consumer.assignment()
    while (partitions.isEmpty && System.currentTimeMillis() - startTimeMs < pollTimeoutMs) {
      consumer.poll(jt.Duration.ofMillis(100))
      partitions = consumer.assignment()
    }
...

Fails consistently and no flakyness (worth to mention only couple of tests, so maybe tests are wrongly implemented):

...
    var partitions = Set.empty[TopicPartition].asJava
    while (partitions.isEmpty && System.currentTimeMillis() - startTimeMs < pollTimeoutMs) {
      consumer.poll(jt.Duration.ZERO)
      partitions = consumer.assignment()
      if (partitions.isEmpty) {
        Thread.sleep(100)
      }
    }
...

HeartSaVioR · 2019-07-18T08:27:14Z

It feels me like consumer.poll(jt.Duration.ZERO) just doesn't work.

Maybe it's only me, but I'm not sure I understand why there's no alternative being proposed on widely used patterns. If we can't use consumer.poll(jt.Duration.ZERO) to avoid the unnecessary wait where record to pull is not available, I feel Kafka should have separate public API on explicitly saying "no record" but interested on others like metadata.

gaborgsomogyi · 2019-07-18T13:44:34Z

It feels me like consumer.poll(jt.Duration.ZERO) just doesn't work.

If that would be true several users would protest with blocker bugs on Kafka side but haven't seen such situation (polling without assignment is not possible). I've seen several heavy users who are doing the polling (and no other magic) with consumer.poll(Duration.ZERO). The use-case what I've seen is something like this:

wile (!interrupted) {
  val data = consumer.poll(Duration.ZERO)
  process(data)
}

Anyway, I'm going to have 2 weeks vacation but after that I'm going to sit together with the Kafka guys and having a deeper look at this (again, only couple of Spark tests are failing consistently and it may happen we're using the API not properly).

HeartSaVioR · 2019-07-23T04:22:40Z

Here's a part of test code Kafka has been doing with new poll.

https://github.com/apache/kafka/blob/f98e176746d663fadedbcd3c18312a7f476a20c8/core/src/test/scala/integration/kafka/api/PlaintextConsumerTest.scala#L1748-L1752

  private def awaitAssignment(consumer: Consumer[_, _], expectedAssignment: Set[TopicPartition]): Unit = {
    TestUtils.pollUntilTrue(consumer, () => consumer.assignment() == expectedAssignment.asJava,
      s"Timed out while awaiting expected assignment $expectedAssignment. " +
        s"The current assignment is ${consumer.assignment()}")
  }

https://github.com/apache/kafka/blob/f98e176746d663fadedbcd3c18312a7f476a20c8/core/src/test/scala/unit/kafka/utils/TestUtils.scala#L767-L775

  def pollUntilTrue(consumer: Consumer[_, _],
                    action: () => Boolean,
                    msg: => String,
                    waitTimeMs: Long = JTestUtils.DEFAULT_MAX_WAIT_MS): Unit = {
    waitUntilTrue(() => {
      consumer.poll(Duration.ofMillis(50))
      action()
    }, msg = msg, pause = 0L, waitTimeMs = waitTimeMs)
  }

Kafka has still some parts of test code relying on deprecated poll(0) (so co-usage on both poll(Duration) and poll(long)). It might not be technical reason to do so, but they're still relying on old favor, which might mean they indicate the needs of usage on poll(0).

Sometimes Kafka calls updateAssignmentMetadataIfNeeded directly which deals with metadata update in poll() with max long timer, effectively blocking. The method is for testing: defined as package private.

consumer.updateAssignmentMetadataIfNeeded(time.timer(Long.MAX_VALUE));

In many cases of calling poll(Duration.ZERO) in test code, updateAssignmentMetadataIfNeeded is called prior. In other cases the verification codes just seem to confirm calling poll doing nothing or returning already fetched records.

I guess in our case we need to either leverage updateAssignmentMetadataIfNeeded to only deal with metadata (it may require some hack and they clarified it's for testing so this is not the one for us), or poll with small timeout (50ms) with tolerating the case where record to pull is not available (incorporated in latency regardless of availability of metadata).

Btw, I'm seeing KIP-288 which proposed new public API waitForAssignment similar to updateAssignmentMetadataIfNeeded but it was discarded since KIP-266 superseded KIP-288, and KIP-266 didn't finally add it. Not sure it is declined or just missed it.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-288%3A+%5BDISCARDED%5D+Consumer.poll%28%29+timeout+semantic+change+and+new+waitForAssignment+method

gaborgsomogyi · 2019-08-06T12:43:24Z

First of all thanks @HeartSaVioR for the deep look, it helped!

@zsxwing I've had another round with the Kafka guys on this and here are the conclusions:

The approach to call poll(Duration... in a loop is the suggested solution
The second approach failed because of the following: Originally the partition assignment was synchronous in the old API (and may hang infinitely) but with the new implementation proper timeout applied. In the mentioned tests "kafkaConsumer.pollTimeoutMs" -> "1000" is set. Since the partition assignment happens on my machine (and seems like on Jenkins as well) around 1000 ms it was mainly race luck whether the test passed or not. The first approach was more on the success side but the second was more on the failing side.

Here is the client log:

19/08/05 06:55:27.650 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE NetworkClient: [Consumer clientId=consumer-522, groupId=spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2] Sending JOIN_GROUP {group_id=spark-kafka-source-42d70d54-e5ce-45$
19/08/05 06:55:28.758 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE KafkaConsumer: [Consumer clientId=consumer-522, groupId=spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2] Closing the Kafka consumer
19/08/05 06:55:28.758 kafka-coordinator-heartbeat-thread | spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2 DEBUG AbstractCoordinator: [Consumer clientId=consumer-522, groupId=spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2] Heartbeat thread has closed
19/08/05 06:55:28.758 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE NetworkClient: [Consumer clientId=consumer-522, groupId=spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2] Completed receive from node 0 for METADATA with correlation id 6$
19/08/05 06:55:28.759 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metadata: [Consumer clientId=consumer-522, groupId=spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2] Determining if we should replace existing epoch 0 with new epoch 0
19/08/05 06:55:28.759 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metadata: [Consumer clientId=consumer-522, groupId=spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2] Updating last seen epoch from 0 to 0 for partition failOnDataLoss-0-0
19/08/05 06:55:28.759 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metadata: [Consumer clientId=consumer-522, groupId=spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2] Updated cluster metadata updateVersion 3 to MetadataCache{cluster=Clu$
19/08/05 06:55:28.760 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE NetworkClient: [Consumer clientId=consumer-522, groupId=spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2] Completed receive from node 2147483647 for JOIN_GROUP with corre$
19/08/05 06:55:28.760 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=connection-count, group=consumer-metrics, description=The current number of active connections., tags={client-id=consumer-522}]
19/08/05 06:55:28.760 data-plane-kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-1 DEBUG Selector: [SocketServer brokerId=0] Connection with /127.0.0.1 disconnected
java.io.EOFException
        at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:96)
        at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424)
        at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385)
        at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:651)
        at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:572)
        at org.apache.kafka.common.network.Selector.poll(Selector.java:483)
        at kafka.network.Processor.poll(SocketServer.scala:863)
        at kafka.network.Processor.run(SocketServer.scala:762)
        at java.lang.Thread.run(Thread.java:748)
19/08/05 06:55:28.760 data-plane-kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-0 DEBUG Selector: [SocketServer brokerId=0] Connection with /127.0.0.1 disconnected
java.io.EOFException
        at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:96)
        at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424)
        at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385)
        at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:651)
        at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:572)
        at org.apache.kafka.common.network.Selector.poll(Selector.java:483)
        at kafka.network.Processor.poll(SocketServer.scala:863)
        at kafka.network.Processor.run(SocketServer.scala:762)
        at java.lang.Thread.run(Thread.java:748)
19/08/05 06:55:28.760 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=connection-close-total, group=consumer-metrics, description=The total number of connections closed, tags={client-id=consumer-522}]
19/08/05 06:55:28.760 data-plane-kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-2 DEBUG Selector: [SocketServer brokerId=0] Connection with /127.0.0.1 disconnected
java.io.EOFException
        at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:96)
        at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424)
        at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385)
        at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:651)
        at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:572)
        at org.apache.kafka.common.network.Selector.poll(Selector.java:483)
        at kafka.network.Processor.poll(SocketServer.scala:863)
        at kafka.network.Processor.run(SocketServer.scala:762)
        at java.lang.Thread.run(Thread.java:748)
19/08/05 06:55:28.760 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=connection-close-rate, group=consumer-metrics, description=The number of connections closed per second, tags={client-id=consumer-522}]
19/08/05 06:55:28.760 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name connections-closed:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=connection-creation-total, group=consumer-metrics, description=The total number of new connections established, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=connection-creation-rate, group=consumer-metrics, description=The number of new connections established per second, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name connections-created:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=successful-authentication-total, group=consumer-metrics, description=The total number of connections with successful authentication, tags={client-id=co$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=successful-authentication-rate, group=consumer-metrics, description=The number of connections with successful authentication per second, tags={client-i$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name successful-authentication:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=successful-reauthentication-total, group=consumer-metrics, description=The total number of successful re-authentication of connections, tags={client-id$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=successful-reauthentication-rate, group=consumer-metrics, description=The number of successful re-authentication of connections per second, tags={clien$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name successful-reauthentication:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=successful-authentication-no-reauth-total, group=consumer-metrics, description=The total number of connections with successful authentication where the$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name successful-authentication-no-reauth:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=failed-authentication-total, group=consumer-metrics, description=The total number of connections with failed authentication, tags={client-id=consumer-5$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=failed-authentication-rate, group=consumer-metrics, description=The number of connections with failed authentication per second, tags={client-id=consum$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name failed-authentication:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=failed-reauthentication-total, group=consumer-metrics, description=The total number of failed re-authentication of connections, tags={client-id=consume$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=failed-reauthentication-rate, group=consumer-metrics, description=The number of failed re-authentication of connections per second, tags={client-id=con$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name failed-reauthentication:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=reauthentication-latency-max, group=consumer-metrics, description=The max latency observed due to re-authentication, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=reauthentication-latency-avg, group=consumer-metrics, description=The average latency observed due to re-authentication, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name reauthentication-latency:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=network-io-total, group=consumer-metrics, description=The total number of network operations (reads or writes) on all connections, tags={client-id=cons$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=network-io-rate, group=consumer-metrics, description=The number of network operations (reads or writes) on all connections per second, tags={client-id=$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name bytes-sent-received:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=outgoing-byte-total, group=consumer-metrics, description=The total number of outgoing bytes sent to all servers, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=outgoing-byte-rate, group=consumer-metrics, description=The number of outgoing bytes sent to all servers per second, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-total, group=consumer-metrics, description=The total number of requests sent, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-rate, group=consumer-metrics, description=The number of requests sent per second, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-size-avg, group=consumer-metrics, description=The average size of requests sent., tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-size-max, group=consumer-metrics, description=The maximum size of any request sent., tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name bytes-sent:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=incoming-byte-total, group=consumer-metrics, description=The total number of bytes read off all sockets, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=incoming-byte-rate, group=consumer-metrics, description=The number of bytes read off all sockets per second, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=response-total, group=consumer-metrics, description=The total number of responses received, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=response-rate, group=consumer-metrics, description=The number of responses received per second, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name bytes-received:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=select-total, group=consumer-metrics, description=The total number of times the I/O layer checked for new I/O to perform, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=select-rate, group=consumer-metrics, description=The number of times the I/O layer checked for new I/O to perform per second, tags={client-id=consumer-$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=io-wait-time-ns-avg, group=consumer-metrics, description=The average length of time the I/O thread spent waiting for a socket ready for reads or writes$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=io-waittime-total, group=consumer-metrics, description=The total time the I/O thread spent waiting, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=io-wait-ratio, group=consumer-metrics, description=The fraction of time the I/O thread spent waiting, tags={client-id=consumer-522}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name select-time:
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=io-time-ns-avg, group=consumer-metrics, description=The average length of time for I/O per select call in nanoseconds., tags={client-id=consumer-522}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=iotime-total, group=consumer-metrics, description=The total time the I/O thread spent doing I/O, tags={client-id=consumer-522}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=io-ratio, group=consumer-metrics, description=The fraction of time the I/O thread spent doing I/O, tags={client-id=consumer-522}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name io-time:
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=outgoing-byte-total, group=consumer-node-metrics, description=The total number of outgoing bytes, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=outgoing-byte-rate, group=consumer-node-metrics, description=The number of outgoing bytes per second, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-total, group=consumer-node-metrics, description=The total number of requests sent, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-rate, group=consumer-node-metrics, description=The number of requests sent per second, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-size-avg, group=consumer-node-metrics, description=The average size of requests sent., tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-size-max, group=consumer-node-metrics, description=The maximum size of any request sent., tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name node--1.bytes-sent
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=incoming-byte-total, group=consumer-node-metrics, description=The total number of incoming bytes, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=incoming-byte-rate, group=consumer-node-metrics, description=The number of incoming bytes per second, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=response-total, group=consumer-node-metrics, description=The total number of responses received, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=response-rate, group=consumer-node-metrics, description=The number of responses received per second, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name node--1.bytes-received
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-latency-avg, group=consumer-node-metrics, description=, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-latency-max, group=consumer-node-metrics, description=, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name node--1.latency
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=outgoing-byte-total, group=consumer-node-metrics, description=The total number of outgoing bytes, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=outgoing-byte-rate, group=consumer-node-metrics, description=The number of outgoing bytes per second, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-total, group=consumer-node-metrics, description=The total number of requests sent, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-rate, group=consumer-node-metrics, description=The number of requests sent per second, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-size-avg, group=consumer-node-metrics, description=The average size of requests sent., tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-size-max, group=consumer-node-metrics, description=The maximum size of any request sent., tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name node-0.bytes-sent
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=incoming-byte-total, group=consumer-node-metrics, description=The total number of incoming bytes, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=incoming-byte-rate, group=consumer-node-metrics, description=The number of incoming bytes per second, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=response-total, group=consumer-node-metrics, description=The total number of responses received, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=response-rate, group=consumer-node-metrics, description=The number of responses received per second, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name node-0.bytes-received
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-latency-avg, group=consumer-node-metrics, description=, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-latency-max, group=consumer-node-metrics, description=, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name node-0.latency
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=outgoing-byte-total, group=consumer-node-metrics, description=The total number of outgoing bytes, tags={client-id=consumer-522, node-id=node-2147483647$
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=outgoing-byte-rate, group=consumer-node-metrics, description=The number of outgoing bytes per second, tags={client-id=consumer-522, node-id=node-214748$
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-total, group=consumer-node-metrics, description=The total number of requests sent, tags={client-id=consumer-522, node-id=node-2147483647}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-rate, group=consumer-node-metrics, description=The number of requests sent per second, tags={client-id=consumer-522, node-id=node-2147483647}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-size-avg, group=consumer-node-metrics, description=The average size of requests sent., tags={client-id=consumer-522, node-id=node-2147483647}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-size-max, group=consumer-node-metrics, description=The maximum size of any request sent., tags={client-id=consumer-522, node-id=node-2147483647$
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name node-2147483647.bytes-sent
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=incoming-byte-total, group=consumer-node-metrics, description=The total number of incoming bytes, tags={client-id=consumer-522, node-id=node-2147483647$
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=incoming-byte-rate, group=consumer-node-metrics, description=The number of incoming bytes per second, tags={client-id=consumer-522, node-id=node-214748$
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=response-total, group=consumer-node-metrics, description=The total number of responses received, tags={client-id=consumer-522, node-id=node-2147483647}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=response-rate, group=consumer-node-metrics, description=The number of responses received per second, tags={client-id=consumer-522, node-id=node-2147483$
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name node-2147483647.bytes-received
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-latency-avg, group=consumer-node-metrics, description=, tags={client-id=consumer-522, node-id=node-2147483647}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-latency-max, group=consumer-node-metrics, description=, tags={client-id=consumer-522, node-id=node-2147483647}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name node-2147483647.latency
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=version, group=app-info, description=Metric indicating version, tags={client-id=consumer-522}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=commit-id, group=app-info, description=Metric indicating commit-id, tags={client-id=consumer-522}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=start-time-ms, group=app-info, description=Metric indicating start-time-ms, tags={client-id=consumer-522}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG KafkaConsumer: [Consumer clientId=consumer-522, groupId=spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2] Kafka consumer has been closed
19/08/05 06:55:28.769 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] ERROR MicroBatchExecution: Query DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] terminated with error
java.lang.IllegalArgumentException: requirement failed
        at scala.Predef$.require(Predef.scala:268)
        at org.apache.spark.sql.kafka010.KafkaOffsetReader.getPartitions(KafkaOffsetReader.scala:431)
        at org.apache.spark.sql.kafka010.KafkaOffsetReader.$anonfun$fetchSpecificOffsets$2(KafkaOffsetReader.scala:172)
        at org.apache.spark.sql.kafka010.KafkaOffsetReader.$anonfun$withRetriesWithoutInterrupt$1(KafkaOffsetReader.scala:383)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
        at org.apache.spark.sql.kafka010.KafkaOffsetReader.withRetriesWithoutInterrupt(KafkaOffsetReader.scala:382)
        at org.apache.spark.sql.kafka010.KafkaOffsetReader.$anonfun$fetchSpecificOffsets$1(KafkaOffsetReader.scala:171)
        at org.apache.spark.sql.kafka010.KafkaOffsetReader.runUninterruptibly(KafkaOffsetReader.scala:350)
        at org.apache.spark.sql.kafka010.KafkaOffsetReader.fetchSpecificOffsets(KafkaOffsetReader.scala:171)
        at org.apache.spark.sql.kafka010.KafkaSource.$anonfun$initialPartitionOffsets$1(KafkaSource.scala:136)
        at scala.Option.getOrElse(Option.scala:138)
        at org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets$lzycompute(KafkaSource.scala:132)
        at org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets(KafkaSource.scala:100)
        at org.apache.spark.sql.kafka010.KafkaSource.getOffset(KafkaSource.scala:151)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$3(MicroBatchExecution.scala:361)
        at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:328)
        at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:326)
        at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)

gaborgsomogyi · 2019-08-06T12:49:00Z

To resolve this situation I've considered mainly 2 things:

Increase kafkaConsumer.pollTimeoutMs
Introduce an additional timeout for partition assignment

The second approach is more like an overkill from my point of view. Such configuration can be added later if we see the need.

SparkQA · 2019-08-06T13:23:37Z

Test build #108719 has finished for PR 25135 at commit 69c7ca5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-06T13:32:55Z

Test build #108720 has finished for PR 25135 at commit 82f1520.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR

The latest changes look good to me.

Maybe we would like to guide that it might be safer to set CONSUMER_POLL_TIMEOUT to be higher value so that it could count on metadata update as well. Not sure where is the best place to guide so, maybe release note? Let's hear the committers' voices.

vanzin · 2019-08-06T22:22:42Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala

+  private def getPartitions(): ju.Set[TopicPartition] = {
+    consumer.poll(jt.Duration.ZERO)
+    var partitions = consumer.assignment()
+    val startTimeMs = System.currentTimeMillis()


For this kind of logic it's better to use System.nanoTime() which is monotonic. Also you can do a little less computation this way:

val deadline = System.nanoTime() + someTimeout; while (... && System.nanoTime() < deadline) { }

Good point, since @zsxwing suggested new API usage I would wait here and check the Kafka side.

zsxwing · 2019-08-06T22:40:59Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala

+    val startTimeMs = System.currentTimeMillis()
+    while (partitions.isEmpty && System.currentTimeMillis() - startTimeMs < pollTimeoutMs) {
+      // Poll to get the latest assigned partitions
+      consumer.poll(jt.Duration.ofMillis(100))


So using this new API will pull data to driver. Right? The previous poll(0) is basically a hack to avoid fetching data to driver. Maybe we should ask the Kafka community to add a new API to pull metadata only.

Good point. While Kafka doc says the behavior of such hack has been indeterministic and Kafka never support it officially, we expect such behavior in any way.

I've initiated thread to ask about viable alternatives of poll(0) and possibility of adding public API to update metadata only.
https://lists.apache.org/thread.html/017cf631ef981ab1b494b1249be5c11d7edfe5f4867770a18188ebdc@%3Cdev.kafka.apache.org%3E

I'm aware of that but since the doc says:

there is no guarantee that poll(0) won't return records the first time it's called

I've considered poll(0) usage as design decision and no problem if small amount of data comes. Since you say it is not guaranteed but was working like that all the time the situation is different.

@HeartSaVioR thanks for initiating the discussion and let's see where it goes.

Yeah you're right that's also true as well. One thing slightly different between two is, we are providing small amount of timeout (not 0) and we don't know how much amount of remaining timeout would be used as polling records (instead of polling metadata). It would be unlikely to be exactly 0 as it would be timed out if it goes below 0.

gaborgsomogyi · 2019-08-13T13:16:36Z

To make a small extract about the discussion in the Kafka dev channel the proposal where tends to be agreement is the following:

Call AdminClient.listTopics to get possible topic for subscribe pattern (in case of assign and subscribe such step not needed). Here there are 2 possibilities.

Do the filtering in Spark code
Kafka adds regex to the mentioned API

Worth to mention such filtering is happening on the client side in Consumer at the moment so this would not cause any unwanted extra memory consumption.

KIP-396 would add AdminClient.listOffsets API from where offsets can be obtained without data polling.

zsxwing · 2019-08-13T20:10:12Z

@gaborgsomogyi thanks for coordinating with the Kafka community. I think we can close this PR for now and revisit it when they have a release with the new API.

gaborgsomogyi · 2019-08-14T08:26:37Z

@zsxwing yep, not much to do for now. Thanks for your help finding this out.
@HeartSaVioR thanks for initiating + pushing the discussion forward.
I'm going to open a new PR when the API is available...

dongjoon-hyun · 2019-12-19T22:12:21Z

@HeartSaVioR gave me the pointer of this PR.
Since Apache Kafka 2.4.0 is released, it would be great to have this in 3.0. Thanks.

HeartSaVioR · 2019-12-20T06:12:41Z

Sorry to make confusion. Proposal of KIP-396 was accepted in time, but the code seems to be merged after code freeze of Kafka 2.4. (Though it's just 4 days later... I thought it was merged in time.)
We may need to wait on Kafka 2.5, or exclude poll(0) and go ahead, and revisit poll(0) in Kafka 2.5.

dongjoon-hyun · 2019-12-21T03:01:09Z

Thank you for updates, @HeartSaVioR .

Tagar · 2020-06-10T17:12:11Z

@gaborgsomogyi just for understanding, since the upstream Spark now uses Kafka 2.5 client, there is no obstacle to use KafkaConsumer.poll API in Kafka connector? Thanks for this improvement!

HeartSaVioR · 2020-06-19T03:52:16Z

@Tagar
Please refer the comment in JIRA issue:
https://issues.apache.org/jira/browse/SPARK-28367?focusedCommentId=17088514&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17088514

In short, it's OK to use the new API in executors (we may need to set longer timeout then), but still not be ideal to use it in driver. It requires Spark to implement the subscription modes by itself with AdminClient API, which is not impossible but brings complexity to match the behavior same as we expect from Kafka client.

HeartSaVioR · 2020-06-19T04:07:19Z

That said, it can be unblocked if we concern about infinite timeout seriously (see the behavior fairly often) and consider as critical one tolerating additional complexity brought in driver side.

I'm even OK to go with partial address, fix executors usage first, and leave an issue for driver. This would greatly reduce the chance of infinite timeout - only Kafka consumer in driver can hang.

@zsxwing @gaborgsomogyi WDYT?

gaborgsomogyi · 2020-06-19T08:12:42Z

@HeartSaVioR I think it makes sense to split up the problem so created sub-tasks and executor side PR is on the way. The driver side requires further attention and discussion w/ Kafka. Re-implementing the subscription in Spark is an overkill even if we would like to avoid infinite wait. We need another way...

Tagar · 2020-06-19T15:37:04Z

@gaborgsomogyi and @HeartSaVioR thanks. saw the first PR for SPARK-32033 - that's great!

[SPARK-28367][SS] Use new KafkaConsumer.poll API in Kafka connector

ad1863a

gaborgsomogyi mentioned this pull request Jul 12, 2019

[SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer #22138

Closed

HeartSaVioR reviewed Jul 12, 2019

View reviewed changes

dongjoon-hyun added the STRUCTURED STREAMING label Jul 13, 2019

Add sleep + remove duplications

f92c758

HeartSaVioR approved these changes Jul 15, 2019

View reviewed changes

HeartSaVioR reviewed Jul 16, 2019

View reviewed changes

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala Show resolved Hide resolved

Minimize wait time

2968c2a

Change wait approach + added proper error message

2ed24d3

gaborgsomogyi added 2 commits August 6, 2019 14:50

Increase kafkaConsumer.pollTimeoutMs

69c7ca5

Improve error message when assigned partitions is empty

82f1520

HeartSaVioR approved these changes Aug 6, 2019

View reviewed changes

vanzin reviewed Aug 6, 2019

View reviewed changes

zsxwing reviewed Aug 6, 2019

View reviewed changes

gaborgsomogyi closed this Aug 14, 2019

HeartSaVioR mentioned this pull request Dec 19, 2019

[SPARK-30311][SS] Use KafkaConsumer.poll(Duration) instead of deprecated one #26954

Closed

HeartSaVioR mentioned this pull request Apr 17, 2020

[SPARK-31464][BUILD][SS] Upgrade Kafka to 2.5.0 #28235

Closed

2 tasks

[SPARK-28367][SS] Use new KafkaConsumer.poll API in Kafka connector #25135

[SPARK-28367][SS] Use new KafkaConsumer.poll API in Kafka connector #25135

Conversation

gaborgsomogyi commented Jul 12, 2019 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jul 12, 2019

gaborgsomogyi commented Jul 12, 2019

HeartSaVioR left a comment

Choose a reason for hiding this comment

gaborgsomogyi commented Jul 15, 2019

SparkQA commented Jul 15, 2019

HeartSaVioR commented Jul 15, 2019

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Jul 16, 2019

SparkQA commented Jul 16, 2019

gaborgsomogyi commented Jul 16, 2019

SparkQA commented Jul 16, 2019

gaborgsomogyi commented Jul 16, 2019

gaborgsomogyi commented Jul 16, 2019

SparkQA commented Jul 16, 2019

gaborgsomogyi commented Jul 16, 2019

SparkQA commented Jul 16, 2019

gaborgsomogyi commented Jul 17, 2019

zsxwing commented Jul 17, 2019

gaborgsomogyi commented Jul 18, 2019

HeartSaVioR commented Jul 18, 2019

gaborgsomogyi commented Jul 18, 2019 • edited

HeartSaVioR commented Jul 23, 2019 • edited

gaborgsomogyi commented Aug 6, 2019

gaborgsomogyi commented Aug 6, 2019

SparkQA commented Aug 6, 2019

SparkQA commented Aug 6, 2019

HeartSaVioR left a comment • edited

Choose a reason for hiding this comment

vanzin Aug 6, 2019

Choose a reason for hiding this comment

gaborgsomogyi Aug 7, 2019

Choose a reason for hiding this comment

zsxwing Aug 6, 2019 • edited

Choose a reason for hiding this comment

HeartSaVioR Aug 7, 2019 • edited

Choose a reason for hiding this comment

gaborgsomogyi Aug 7, 2019

Choose a reason for hiding this comment

HeartSaVioR Aug 7, 2019 • edited

Choose a reason for hiding this comment

gaborgsomogyi commented Aug 13, 2019

zsxwing commented Aug 13, 2019

gaborgsomogyi commented Aug 14, 2019

dongjoon-hyun commented Dec 19, 2019

HeartSaVioR commented Dec 20, 2019

dongjoon-hyun commented Dec 21, 2019

Tagar commented Jun 10, 2020

HeartSaVioR commented Jun 19, 2020

HeartSaVioR commented Jun 19, 2020

gaborgsomogyi commented Jun 19, 2020

Tagar commented Jun 19, 2020

gaborgsomogyi commented Jul 12, 2019 •

edited

gaborgsomogyi commented Jul 18, 2019 •

edited

HeartSaVioR commented Jul 23, 2019 •

edited

HeartSaVioR left a comment •

edited

zsxwing Aug 6, 2019 •

edited

HeartSaVioR Aug 7, 2019 •

edited

HeartSaVioR Aug 7, 2019 •

edited