Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28367][SS] Use new KafkaConsumer.poll API in Kafka connector #25135

Closed
wants to merge 6 commits into from

Conversation

gaborgsomogyi
Copy link
Contributor

@gaborgsomogyi gaborgsomogyi commented Jul 12, 2019

What changes were proposed in this pull request?

Spark uses an old and deprecated API named KafkaConsumer.poll(long) which never returns and stays in live lock if metadata is not updated (for instance when broker disappears at consumer creation). Please see Kafka documentation and standalone test application for further details.

In this PR I've applied the new KafkaConsumer.poll(Duration) API.

How was this patch tested?

Existing unit tests + started Kafka to Kafka query on cluster.

@SparkQA
Copy link

SparkQA commented Jul 12, 2019

Test build #107595 has finished for PR 25135 at commit ad1863a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor Author

cc @HeartSaVioR

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change mostly looks good. Some minors.

Btw, just wondering, if you managed to test with situation delaying updating metadata (finally timed-out), then what would be occurred? Query fails? Or reattempt of relevant tasks happen?

@gaborgsomogyi
Copy link
Contributor Author

Btw, just wondering, if you managed to test with situation delaying updating metadata (finally timed-out), then what would be occurred? Query fails? Or reattempt of relevant tasks happen?

Presume you're interested in what is the new behavior. If exception arrives withRetriesWithoutInterrupt catches all NonFatal and takes care of the reattempt.

@SparkQA
Copy link

SparkQA commented Jul 15, 2019

Test build #107681 has finished for PR 25135 at commit f92c758.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

Ah OK there's already a retry loop. Thanks for the info.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HeartSaVioR
Copy link
Contributor

LGTM again.

@SparkQA
Copy link

SparkQA commented Jul 16, 2019

Test build #107731 has finished for PR 25135 at commit 2968c2a.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 16, 2019

Test build #107732 has finished for PR 25135 at commit 2968c2a.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor Author

Created a quick fix #25171

@gaborgsomogyi
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 16, 2019

Test build #107735 has finished for PR 25135 at commit 2968c2a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor Author

Checking the issue...

@SparkQA
Copy link

SparkQA commented Jul 16, 2019

Test build #107739 has finished for PR 25135 at commit 2ed24d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor Author

Little bit explanation to the last change:

  • While I was debugging the issue I've realized requirement failed message is not really talkative so added a meaningful message
  • When consumer.poll(Duration.ZERO) with Thread.sleep(100) used then under some circumstances even 2 minutes was not enough to get the assignment. To test this I've put the standalone application into a 1k execution loop. The end result is that with non-zero timeout (including Duration.ofMillis(1)) everything works fine.

So the end judgement is that the actual implementation works but as a side track I'm going to analyze the consumer.poll(Duration.ZERO) behavior with the Kafka guys.

@zsxwing
Copy link
Member

zsxwing commented Jul 17, 2019

@gaborgsomogyi yeah, I think it's better to ask Kafka guys to clarify how to use this new poll API. It seems it cannot be used directly without retrying.

@gaborgsomogyi
Copy link
Contributor Author

@zsxwing the main approach is created in agreement with the Kafka guys. The only thing which bothers me is why producing the following approaches different results (maybe an edge case hit with 0?!).
Works:

...
    consumer.poll(jt.Duration.ZERO)
    var partitions = consumer.assignment()
    while (partitions.isEmpty && System.currentTimeMillis() - startTimeMs < pollTimeoutMs) {
      consumer.poll(jt.Duration.ofMillis(100))
      partitions = consumer.assignment()
    }
...

Fails consistently and no flakyness (worth to mention only couple of tests, so maybe tests are wrongly implemented):

...
    var partitions = Set.empty[TopicPartition].asJava
    while (partitions.isEmpty && System.currentTimeMillis() - startTimeMs < pollTimeoutMs) {
      consumer.poll(jt.Duration.ZERO)
      partitions = consumer.assignment()
      if (partitions.isEmpty) {
        Thread.sleep(100)
      }
    }
...

@HeartSaVioR
Copy link
Contributor

It feels me like consumer.poll(jt.Duration.ZERO) just doesn't work.

Maybe it's only me, but I'm not sure I understand why there's no alternative being proposed on widely used patterns. If we can't use consumer.poll(jt.Duration.ZERO) to avoid the unnecessary wait where record to pull is not available, I feel Kafka should have separate public API on explicitly saying "no record" but interested on others like metadata.

@gaborgsomogyi
Copy link
Contributor Author

gaborgsomogyi commented Jul 18, 2019

It feels me like consumer.poll(jt.Duration.ZERO) just doesn't work.

If that would be true several users would protest with blocker bugs on Kafka side but haven't seen such situation (polling without assignment is not possible). I've seen several heavy users who are doing the polling (and no other magic) with consumer.poll(Duration.ZERO). The use-case what I've seen is something like this:

wile (!interrupted) {
  val data = consumer.poll(Duration.ZERO)
  process(data)
}

Anyway, I'm going to have 2 weeks vacation but after that I'm going to sit together with the Kafka guys and having a deeper look at this (again, only couple of Spark tests are failing consistently and it may happen we're using the API not properly).

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Jul 23, 2019

Here's a part of test code Kafka has been doing with new poll.

https://github.com/apache/kafka/blob/f98e176746d663fadedbcd3c18312a7f476a20c8/core/src/test/scala/integration/kafka/api/PlaintextConsumerTest.scala#L1748-L1752

  private def awaitAssignment(consumer: Consumer[_, _], expectedAssignment: Set[TopicPartition]): Unit = {
    TestUtils.pollUntilTrue(consumer, () => consumer.assignment() == expectedAssignment.asJava,
      s"Timed out while awaiting expected assignment $expectedAssignment. " +
        s"The current assignment is ${consumer.assignment()}")
  }

https://github.com/apache/kafka/blob/f98e176746d663fadedbcd3c18312a7f476a20c8/core/src/test/scala/unit/kafka/utils/TestUtils.scala#L767-L775

  def pollUntilTrue(consumer: Consumer[_, _],
                    action: () => Boolean,
                    msg: => String,
                    waitTimeMs: Long = JTestUtils.DEFAULT_MAX_WAIT_MS): Unit = {
    waitUntilTrue(() => {
      consumer.poll(Duration.ofMillis(50))
      action()
    }, msg = msg, pause = 0L, waitTimeMs = waitTimeMs)
  }

Kafka has still some parts of test code relying on deprecated poll(0) (so co-usage on both poll(Duration) and poll(long)). It might not be technical reason to do so, but they're still relying on old favor, which might mean they indicate the needs of usage on poll(0).

Sometimes Kafka calls updateAssignmentMetadataIfNeeded directly which deals with metadata update in poll() with max long timer, effectively blocking. The method is for testing: defined as package private.

consumer.updateAssignmentMetadataIfNeeded(time.timer(Long.MAX_VALUE));

In many cases of calling poll(Duration.ZERO) in test code, updateAssignmentMetadataIfNeeded is called prior. In other cases the verification codes just seem to confirm calling poll doing nothing or returning already fetched records.

I guess in our case we need to either leverage updateAssignmentMetadataIfNeeded to only deal with metadata (it may require some hack and they clarified it's for testing so this is not the one for us), or poll with small timeout (50ms) with tolerating the case where record to pull is not available (incorporated in latency regardless of availability of metadata).

Btw, I'm seeing KIP-288 which proposed new public API waitForAssignment similar to updateAssignmentMetadataIfNeeded but it was discarded since KIP-266 superseded KIP-288, and KIP-266 didn't finally add it. Not sure it is declined or just missed it.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-288%3A+%5BDISCARDED%5D+Consumer.poll%28%29+timeout+semantic+change+and+new+waitForAssignment+method

@gaborgsomogyi
Copy link
Contributor Author

First of all thanks @HeartSaVioR for the deep look, it helped!

@zsxwing I've had another round with the Kafka guys on this and here are the conclusions:

  • The approach to call poll(Duration... in a loop is the suggested solution
  • The second approach failed because of the following: Originally the partition assignment was synchronous in the old API (and may hang infinitely) but with the new implementation proper timeout applied. In the mentioned tests "kafkaConsumer.pollTimeoutMs" -> "1000" is set. Since the partition assignment happens on my machine (and seems like on Jenkins as well) around 1000 ms it was mainly race luck whether the test passed or not. The first approach was more on the success side but the second was more on the failing side.

Here is the client log:

19/08/05 06:55:27.650 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE NetworkClient: [Consumer clientId=consumer-522, groupId=spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2] Sending JOIN_GROUP {group_id=spark-kafka-source-42d70d54-e5ce-45$
19/08/05 06:55:28.758 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE KafkaConsumer: [Consumer clientId=consumer-522, groupId=spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2] Closing the Kafka consumer
19/08/05 06:55:28.758 kafka-coordinator-heartbeat-thread | spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2 DEBUG AbstractCoordinator: [Consumer clientId=consumer-522, groupId=spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2] Heartbeat thread has closed
19/08/05 06:55:28.758 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE NetworkClient: [Consumer clientId=consumer-522, groupId=spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2] Completed receive from node 0 for METADATA with correlation id 6$
19/08/05 06:55:28.759 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metadata: [Consumer clientId=consumer-522, groupId=spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2] Determining if we should replace existing epoch 0 with new epoch 0
19/08/05 06:55:28.759 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metadata: [Consumer clientId=consumer-522, groupId=spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2] Updating last seen epoch from 0 to 0 for partition failOnDataLoss-0-0
19/08/05 06:55:28.759 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metadata: [Consumer clientId=consumer-522, groupId=spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2] Updated cluster metadata updateVersion 3 to MetadataCache{cluster=Clu$
19/08/05 06:55:28.760 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE NetworkClient: [Consumer clientId=consumer-522, groupId=spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2] Completed receive from node 2147483647 for JOIN_GROUP with corre$
19/08/05 06:55:28.760 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=connection-count, group=consumer-metrics, description=The current number of active connections., tags={client-id=consumer-522}]
19/08/05 06:55:28.760 data-plane-kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-1 DEBUG Selector: [SocketServer brokerId=0] Connection with /127.0.0.1 disconnected
java.io.EOFException
        at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:96)
        at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424)
        at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385)
        at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:651)
        at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:572)
        at org.apache.kafka.common.network.Selector.poll(Selector.java:483)
        at kafka.network.Processor.poll(SocketServer.scala:863)
        at kafka.network.Processor.run(SocketServer.scala:762)
        at java.lang.Thread.run(Thread.java:748)
19/08/05 06:55:28.760 data-plane-kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-0 DEBUG Selector: [SocketServer brokerId=0] Connection with /127.0.0.1 disconnected
java.io.EOFException
        at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:96)
        at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424)
        at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385)
        at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:651)
        at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:572)
        at org.apache.kafka.common.network.Selector.poll(Selector.java:483)
        at kafka.network.Processor.poll(SocketServer.scala:863)
        at kafka.network.Processor.run(SocketServer.scala:762)
        at java.lang.Thread.run(Thread.java:748)
19/08/05 06:55:28.760 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=connection-close-total, group=consumer-metrics, description=The total number of connections closed, tags={client-id=consumer-522}]
19/08/05 06:55:28.760 data-plane-kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-2 DEBUG Selector: [SocketServer brokerId=0] Connection with /127.0.0.1 disconnected
java.io.EOFException
        at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:96)
        at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424)
        at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385)
        at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:651)
        at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:572)
        at org.apache.kafka.common.network.Selector.poll(Selector.java:483)
        at kafka.network.Processor.poll(SocketServer.scala:863)
        at kafka.network.Processor.run(SocketServer.scala:762)
        at java.lang.Thread.run(Thread.java:748)
19/08/05 06:55:28.760 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=connection-close-rate, group=consumer-metrics, description=The number of connections closed per second, tags={client-id=consumer-522}]
19/08/05 06:55:28.760 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name connections-closed:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=connection-creation-total, group=consumer-metrics, description=The total number of new connections established, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=connection-creation-rate, group=consumer-metrics, description=The number of new connections established per second, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name connections-created:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=successful-authentication-total, group=consumer-metrics, description=The total number of connections with successful authentication, tags={client-id=co$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=successful-authentication-rate, group=consumer-metrics, description=The number of connections with successful authentication per second, tags={client-i$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name successful-authentication:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=successful-reauthentication-total, group=consumer-metrics, description=The total number of successful re-authentication of connections, tags={client-id$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=successful-reauthentication-rate, group=consumer-metrics, description=The number of successful re-authentication of connections per second, tags={clien$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name successful-reauthentication:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=successful-authentication-no-reauth-total, group=consumer-metrics, description=The total number of connections with successful authentication where the$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name successful-authentication-no-reauth:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=failed-authentication-total, group=consumer-metrics, description=The total number of connections with failed authentication, tags={client-id=consumer-5$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=failed-authentication-rate, group=consumer-metrics, description=The number of connections with failed authentication per second, tags={client-id=consum$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name failed-authentication:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=failed-reauthentication-total, group=consumer-metrics, description=The total number of failed re-authentication of connections, tags={client-id=consume$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=failed-reauthentication-rate, group=consumer-metrics, description=The number of failed re-authentication of connections per second, tags={client-id=con$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name failed-reauthentication:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=reauthentication-latency-max, group=consumer-metrics, description=The max latency observed due to re-authentication, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=reauthentication-latency-avg, group=consumer-metrics, description=The average latency observed due to re-authentication, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name reauthentication-latency:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=network-io-total, group=consumer-metrics, description=The total number of network operations (reads or writes) on all connections, tags={client-id=cons$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=network-io-rate, group=consumer-metrics, description=The number of network operations (reads or writes) on all connections per second, tags={client-id=$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name bytes-sent-received:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=outgoing-byte-total, group=consumer-metrics, description=The total number of outgoing bytes sent to all servers, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=outgoing-byte-rate, group=consumer-metrics, description=The number of outgoing bytes sent to all servers per second, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-total, group=consumer-metrics, description=The total number of requests sent, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-rate, group=consumer-metrics, description=The number of requests sent per second, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-size-avg, group=consumer-metrics, description=The average size of requests sent., tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-size-max, group=consumer-metrics, description=The maximum size of any request sent., tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name bytes-sent:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=incoming-byte-total, group=consumer-metrics, description=The total number of bytes read off all sockets, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=incoming-byte-rate, group=consumer-metrics, description=The number of bytes read off all sockets per second, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=response-total, group=consumer-metrics, description=The total number of responses received, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=response-rate, group=consumer-metrics, description=The number of responses received per second, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name bytes-received:
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=select-total, group=consumer-metrics, description=The total number of times the I/O layer checked for new I/O to perform, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=select-rate, group=consumer-metrics, description=The number of times the I/O layer checked for new I/O to perform per second, tags={client-id=consumer-$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=io-wait-time-ns-avg, group=consumer-metrics, description=The average length of time the I/O thread spent waiting for a socket ready for reads or writes$
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=io-waittime-total, group=consumer-metrics, description=The total time the I/O thread spent waiting, tags={client-id=consumer-522}]
19/08/05 06:55:28.761 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=io-wait-ratio, group=consumer-metrics, description=The fraction of time the I/O thread spent waiting, tags={client-id=consumer-522}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name select-time:
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=io-time-ns-avg, group=consumer-metrics, description=The average length of time for I/O per select call in nanoseconds., tags={client-id=consumer-522}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=iotime-total, group=consumer-metrics, description=The total time the I/O thread spent doing I/O, tags={client-id=consumer-522}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=io-ratio, group=consumer-metrics, description=The fraction of time the I/O thread spent doing I/O, tags={client-id=consumer-522}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name io-time:
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=outgoing-byte-total, group=consumer-node-metrics, description=The total number of outgoing bytes, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=outgoing-byte-rate, group=consumer-node-metrics, description=The number of outgoing bytes per second, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-total, group=consumer-node-metrics, description=The total number of requests sent, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-rate, group=consumer-node-metrics, description=The number of requests sent per second, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-size-avg, group=consumer-node-metrics, description=The average size of requests sent., tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-size-max, group=consumer-node-metrics, description=The maximum size of any request sent., tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name node--1.bytes-sent
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=incoming-byte-total, group=consumer-node-metrics, description=The total number of incoming bytes, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=incoming-byte-rate, group=consumer-node-metrics, description=The number of incoming bytes per second, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=response-total, group=consumer-node-metrics, description=The total number of responses received, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=response-rate, group=consumer-node-metrics, description=The number of responses received per second, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name node--1.bytes-received
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-latency-avg, group=consumer-node-metrics, description=, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-latency-max, group=consumer-node-metrics, description=, tags={client-id=consumer-522, node-id=node--1}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name node--1.latency
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=outgoing-byte-total, group=consumer-node-metrics, description=The total number of outgoing bytes, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=outgoing-byte-rate, group=consumer-node-metrics, description=The number of outgoing bytes per second, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-total, group=consumer-node-metrics, description=The total number of requests sent, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-rate, group=consumer-node-metrics, description=The number of requests sent per second, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-size-avg, group=consumer-node-metrics, description=The average size of requests sent., tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-size-max, group=consumer-node-metrics, description=The maximum size of any request sent., tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name node-0.bytes-sent
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=incoming-byte-total, group=consumer-node-metrics, description=The total number of incoming bytes, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=incoming-byte-rate, group=consumer-node-metrics, description=The number of incoming bytes per second, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=response-total, group=consumer-node-metrics, description=The total number of responses received, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=response-rate, group=consumer-node-metrics, description=The number of responses received per second, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name node-0.bytes-received
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-latency-avg, group=consumer-node-metrics, description=, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-latency-max, group=consumer-node-metrics, description=, tags={client-id=consumer-522, node-id=node-0}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name node-0.latency
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=outgoing-byte-total, group=consumer-node-metrics, description=The total number of outgoing bytes, tags={client-id=consumer-522, node-id=node-2147483647$
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=outgoing-byte-rate, group=consumer-node-metrics, description=The number of outgoing bytes per second, tags={client-id=consumer-522, node-id=node-214748$
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-total, group=consumer-node-metrics, description=The total number of requests sent, tags={client-id=consumer-522, node-id=node-2147483647}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-rate, group=consumer-node-metrics, description=The number of requests sent per second, tags={client-id=consumer-522, node-id=node-2147483647}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-size-avg, group=consumer-node-metrics, description=The average size of requests sent., tags={client-id=consumer-522, node-id=node-2147483647}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-size-max, group=consumer-node-metrics, description=The maximum size of any request sent., tags={client-id=consumer-522, node-id=node-2147483647$
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name node-2147483647.bytes-sent
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=incoming-byte-total, group=consumer-node-metrics, description=The total number of incoming bytes, tags={client-id=consumer-522, node-id=node-2147483647$
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=incoming-byte-rate, group=consumer-node-metrics, description=The number of incoming bytes per second, tags={client-id=consumer-522, node-id=node-214748$
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=response-total, group=consumer-node-metrics, description=The total number of responses received, tags={client-id=consumer-522, node-id=node-2147483647}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=response-rate, group=consumer-node-metrics, description=The number of responses received per second, tags={client-id=consumer-522, node-id=node-2147483$
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name node-2147483647.bytes-received
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-latency-avg, group=consumer-node-metrics, description=, tags={client-id=consumer-522, node-id=node-2147483647}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=request-latency-max, group=consumer-node-metrics, description=, tags={client-id=consumer-522, node-id=node-2147483647}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG Metrics: Removed sensor with name node-2147483647.latency
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=version, group=app-info, description=Metric indicating version, tags={client-id=consumer-522}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=commit-id, group=app-info, description=Metric indicating commit-id, tags={client-id=consumer-522}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] TRACE Metrics: Removed metric named MetricName [name=start-time-ms, group=app-info, description=Metric indicating start-time-ms, tags={client-id=consumer-522}]
19/08/05 06:55:28.762 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] DEBUG KafkaConsumer: [Consumer clientId=consumer-522, groupId=spark-kafka-source-42d70d54-e5ce-453a-a84c-40cd8126a3ff--1337036482-driver-2] Kafka consumer has been closed
19/08/05 06:55:28.769 stream execution thread for DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] ERROR MicroBatchExecution: Query DontFailOnDataLoss [id = 9396b4b5-bc30-4e1e-8c6a-1d3a9d890ccf, runId = 5b1ee39c-c174-4c42-b0d5-1c4b59061ab3] terminated with error
java.lang.IllegalArgumentException: requirement failed
        at scala.Predef$.require(Predef.scala:268)
        at org.apache.spark.sql.kafka010.KafkaOffsetReader.getPartitions(KafkaOffsetReader.scala:431)
        at org.apache.spark.sql.kafka010.KafkaOffsetReader.$anonfun$fetchSpecificOffsets$2(KafkaOffsetReader.scala:172)
        at org.apache.spark.sql.kafka010.KafkaOffsetReader.$anonfun$withRetriesWithoutInterrupt$1(KafkaOffsetReader.scala:383)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
        at org.apache.spark.sql.kafka010.KafkaOffsetReader.withRetriesWithoutInterrupt(KafkaOffsetReader.scala:382)
        at org.apache.spark.sql.kafka010.KafkaOffsetReader.$anonfun$fetchSpecificOffsets$1(KafkaOffsetReader.scala:171)
        at org.apache.spark.sql.kafka010.KafkaOffsetReader.runUninterruptibly(KafkaOffsetReader.scala:350)
        at org.apache.spark.sql.kafka010.KafkaOffsetReader.fetchSpecificOffsets(KafkaOffsetReader.scala:171)
        at org.apache.spark.sql.kafka010.KafkaSource.$anonfun$initialPartitionOffsets$1(KafkaSource.scala:136)
        at scala.Option.getOrElse(Option.scala:138)
        at org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets$lzycompute(KafkaSource.scala:132)
        at org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets(KafkaSource.scala:100)
        at org.apache.spark.sql.kafka010.KafkaSource.getOffset(KafkaSource.scala:151)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$3(MicroBatchExecution.scala:361)
        at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:328)
        at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:326)
        at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)

@gaborgsomogyi
Copy link
Contributor Author

To resolve this situation I've considered mainly 2 things:

  • Increase kafkaConsumer.pollTimeoutMs
  • Introduce an additional timeout for partition assignment

The second approach is more like an overkill from my point of view. Such configuration can be added later if we see the need.

@SparkQA
Copy link

SparkQA commented Aug 6, 2019

Test build #108719 has finished for PR 25135 at commit 69c7ca5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2019

Test build #108720 has finished for PR 25135 at commit 82f1520.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest changes look good to me.

Maybe we would like to guide that it might be safer to set CONSUMER_POLL_TIMEOUT to be higher value so that it could count on metadata update as well. Not sure where is the best place to guide so, maybe release note? Let's hear the committers' voices.

private def getPartitions(): ju.Set[TopicPartition] = {
consumer.poll(jt.Duration.ZERO)
var partitions = consumer.assignment()
val startTimeMs = System.currentTimeMillis()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this kind of logic it's better to use System.nanoTime() which is monotonic. Also you can do a little less computation this way:

val deadline = System.nanoTime() + someTimeout;
while (... && System.nanoTime() < deadline) {

}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, since @zsxwing suggested new API usage I would wait here and check the Kafka side.

val startTimeMs = System.currentTimeMillis()
while (partitions.isEmpty && System.currentTimeMillis() - startTimeMs < pollTimeoutMs) {
// Poll to get the latest assigned partitions
consumer.poll(jt.Duration.ofMillis(100))
Copy link
Member

@zsxwing zsxwing Aug 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So using this new API will pull data to driver. Right? The previous poll(0) is basically a hack to avoid fetching data to driver. Maybe we should ask the Kafka community to add a new API to pull metadata only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. While Kafka doc says the behavior of such hack has been indeterministic and Kafka never support it officially, we expect such behavior in any way.

I've initiated thread to ask about viable alternatives of poll(0) and possibility of adding public API to update metadata only.
https://lists.apache.org/thread.html/017cf631ef981ab1b494b1249be5c11d7edfe5f4867770a18188ebdc@%3Cdev.kafka.apache.org%3E

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware of that but since the doc says:

there is no guarantee that poll(0) won't return records the first time it's called

I've considered poll(0) usage as design decision and no problem if small amount of data comes. Since you say it is not guaranteed but was working like that all the time the situation is different.

@HeartSaVioR thanks for initiating the discussion and let's see where it goes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right that's also true as well. One thing slightly different between two is, we are providing small amount of timeout (not 0) and we don't know how much amount of remaining timeout would be used as polling records (instead of polling metadata). It would be unlikely to be exactly 0 as it would be timed out if it goes below 0.

@gaborgsomogyi
Copy link
Contributor Author

To make a small extract about the discussion in the Kafka dev channel the proposal where tends to be agreement is the following:

  • Call AdminClient.listTopics to get possible topic for subscribe pattern (in case of assign and subscribe such step not needed). Here there are 2 possibilities.
  1. Do the filtering in Spark code
  2. Kafka adds regex to the mentioned API

Worth to mention such filtering is happening on the client side in Consumer at the moment so this would not cause any unwanted extra memory consumption.

  • KIP-396 would add AdminClient.listOffsets API from where offsets can be obtained without data polling.

@zsxwing
Copy link
Member

zsxwing commented Aug 13, 2019

@gaborgsomogyi thanks for coordinating with the Kafka community. I think we can close this PR for now and revisit it when they have a release with the new API.

@gaborgsomogyi
Copy link
Contributor Author

@zsxwing yep, not much to do for now. Thanks for your help finding this out.
@HeartSaVioR thanks for initiating + pushing the discussion forward.
I'm going to open a new PR when the API is available...

@dongjoon-hyun
Copy link
Member

@HeartSaVioR gave me the pointer of this PR.
Since Apache Kafka 2.4.0 is released, it would be great to have this in 3.0. Thanks.

@HeartSaVioR
Copy link
Contributor

Sorry to make confusion. Proposal of KIP-396 was accepted in time, but the code seems to be merged after code freeze of Kafka 2.4. (Though it's just 4 days later... I thought it was merged in time.)
We may need to wait on Kafka 2.5, or exclude poll(0) and go ahead, and revisit poll(0) in Kafka 2.5.

@dongjoon-hyun
Copy link
Member

Thank you for updates, @HeartSaVioR .

@Tagar
Copy link

Tagar commented Jun 10, 2020

@gaborgsomogyi just for understanding, since the upstream Spark now uses Kafka 2.5 client, there is no obstacle to use KafkaConsumer.poll API in Kafka connector? Thanks for this improvement!

@HeartSaVioR
Copy link
Contributor

@Tagar
Please refer the comment in JIRA issue:
https://issues.apache.org/jira/browse/SPARK-28367?focusedCommentId=17088514&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17088514

In short, it's OK to use the new API in executors (we may need to set longer timeout then), but still not be ideal to use it in driver. It requires Spark to implement the subscription modes by itself with AdminClient API, which is not impossible but brings complexity to match the behavior same as we expect from Kafka client.

@HeartSaVioR
Copy link
Contributor

That said, it can be unblocked if we concern about infinite timeout seriously (see the behavior fairly often) and consider as critical one tolerating additional complexity brought in driver side.

I'm even OK to go with partial address, fix executors usage first, and leave an issue for driver. This would greatly reduce the chance of infinite timeout - only Kafka consumer in driver can hang.

@zsxwing @gaborgsomogyi WDYT?

@gaborgsomogyi
Copy link
Contributor Author

@HeartSaVioR I think it makes sense to split up the problem so created sub-tasks and executor side PR is on the way. The driver side requires further attention and discussion w/ Kafka. Re-implementing the subscription in Spark is an overkill even if we would like to avoid infinite wait. We need another way...

@Tagar
Copy link

Tagar commented Jun 19, 2020

@gaborgsomogyi and @HeartSaVioR thanks. saw the first PR for SPARK-32033 - that's great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants