New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19185] [DSTREAMS] Avoid concurrent use of cached consumers in CachedKafkaConsumer #20997

Closed
wants to merge 7 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@gaborgsomogyi
Contributor

gaborgsomogyi commented Apr 6, 2018

What changes were proposed in this pull request?

CachedKafkaConsumer in the project streaming-kafka-0-10 is designed to maintain a pool of KafkaConsumers that can be reused. However, it was built with the assumption there will be only one thread trying to read the same Kafka TopicPartition at the same time. This assumption is not true all the time and this can inadvertently lead to ConcurrentModificationException.

Here is a better way to design this. The consumer pool should be smart enough to avoid concurrent use of a cached consumer. If there is another request for the same TopicPartition as a currently in-use consumer, the pool should automatically return a fresh consumer.

  • There are effectively two kinds of consumer that may be generated
    • Cached consumer - this should be returned to the pool at task end
    • Non-cached consumer - this should be closed at task end
  • A trait called KafkaDataConsumer is introduced to hide this difference from the users of the consumer so that the client code does not have to reason about whether to stop and release. They simply call val consumer = KafkaDataConsumer.acquire and then consumer.release.
  • If there is request for a consumer that is in-use, then a new consumer is generated.
  • If there is request for a consumer which is a task reattempt, then already existing cached consumer will be invalidated and a new consumer is generated. This could fix potential issues if the source of the reattempt is a malfunctioning consumer.
  • In addition, I renamed the CachedKafkaConsumer class to KafkaDataConsumer because is a misnomer given that what it returns may or may not be cached.

How was this patch tested?

A new stress test that verifies it is safe to concurrently get consumers for the same TopicPartition from the consumer pool.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 6, 2018

Test build #88990 has finished for PR 20997 at commit 0fe456b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 6, 2018

Test build #88990 has finished for PR 20997 at commit 0fe456b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@gaborgsomogyi

This comment has been minimized.

Show comment
Hide comment
@gaborgsomogyi
Contributor

gaborgsomogyi commented Apr 6, 2018

@vanzin

I think the caching code could use some cleanup. In particular, it seems odd to not reuse consumers across attempts, if you're already tracking whether the consumer is in use.

If there's a reason for that, there needs to be a comment in the code with a little more info.

Show outdated Hide outdated ...c/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala Outdated
Show outdated Hide outdated ...c/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala Outdated
Show outdated Hide outdated ...c/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala Outdated
Show outdated Hide outdated ...c/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala Outdated
/**
* A wrapper around Kafka's KafkaConsumer.
* This is not for direct use outside this file.

This comment has been minimized.

@vanzin

vanzin Apr 9, 2018

Contributor

This generally means the class should be private not private[blah].

@vanzin

vanzin Apr 9, 2018

Contributor

This generally means the class should be private not private[blah].

This comment has been minimized.

@gaborgsomogyi

gaborgsomogyi Apr 11, 2018

Contributor

Changed.

@gaborgsomogyi

gaborgsomogyi Apr 11, 2018

Contributor

Changed.

Show outdated Hide outdated ...c/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala Outdated
Show outdated Hide outdated ...c/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala Outdated
Show outdated Hide outdated ...c/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala Outdated
Show outdated Hide outdated ...c/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala Outdated
Show outdated Hide outdated ...c/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala Outdated
@koeninger

This comment has been minimized.

Show comment
Hide comment
@koeninger

koeninger Apr 10, 2018

Contributor

In general, 2 things about this make me uncomfortable:

  • It's basically a cut-and-paste of the SQL equivalent PR, #20767, but it is different from both that PR and the existing DStream code.

  • I don't see an upper bound on the number of consumers per key, nor a way of reaping idle consumers. If the SQL equivalent code is likely to be modified to use pooling of some kind, seems better to make a consistent decision.

Contributor

koeninger commented Apr 10, 2018

In general, 2 things about this make me uncomfortable:

  • It's basically a cut-and-paste of the SQL equivalent PR, #20767, but it is different from both that PR and the existing DStream code.

  • I don't see an upper bound on the number of consumers per key, nor a way of reaping idle consumers. If the SQL equivalent code is likely to be modified to use pooling of some kind, seems better to make a consistent decision.

threadPool.shutdown()
}
}
}

This comment has been minimized.

@koeninger

koeninger Apr 10, 2018

Contributor

If this PR is intended to fix a problem with silent reading of incorrect data, can you add a test reproducing that?

@koeninger

koeninger Apr 10, 2018

Contributor

If this PR is intended to fix a problem with silent reading of incorrect data, can you add a test reproducing that?

This comment has been minimized.

@gaborgsomogyi

gaborgsomogyi Apr 12, 2018

Contributor

That's a baad cut and paste issue. This PR intends to solve ConcurrentModificationException.

@gaborgsomogyi

gaborgsomogyi Apr 12, 2018

Contributor

That's a baad cut and paste issue. This PR intends to solve ConcurrentModificationException.

This comment has been minimized.

@gaborgsomogyi

gaborgsomogyi Apr 12, 2018

Contributor

Removed from the PR description.

@gaborgsomogyi

gaborgsomogyi Apr 12, 2018

Contributor

Removed from the PR description.

Show outdated Hide outdated ...c/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala Outdated
Show outdated Hide outdated ...c/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala Outdated
Show outdated Hide outdated ...c/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala Outdated
Show outdated Hide outdated ...c/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala Outdated
Show outdated Hide outdated ...c/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala Outdated
Show outdated Hide outdated ...c/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala Outdated
@gaborgsomogyi

This comment has been minimized.

Show comment
Hide comment
@gaborgsomogyi

gaborgsomogyi Apr 12, 2018

Contributor

@koeninger

I don't see an upper bound on the number of consumers per key, nor a way of reaping idle consumers. If the SQL equivalent code is likely to be modified to use pooling of some kind, seems better to make a consistent decision.

When do you think the decision will happen?

Contributor

gaborgsomogyi commented Apr 12, 2018

@koeninger

I don't see an upper bound on the number of consumers per key, nor a way of reaping idle consumers. If the SQL equivalent code is likely to be modified to use pooling of some kind, seems better to make a consistent decision.

When do you think the decision will happen?

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 12, 2018

Test build #89274 has finished for PR 20997 at commit d776289.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 12, 2018

Test build #89274 has finished for PR 20997 at commit d776289.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 12, 2018

Test build #89275 has finished for PR 20997 at commit 250ad92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 12, 2018

Test build #89275 has finished for PR 20997 at commit 250ad92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 13, 2018

Test build #89344 has finished for PR 20997 at commit 215339d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 13, 2018

Test build #89344 has finished for PR 20997 at commit 215339d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 13, 2018

Test build #89359 has finished for PR 20997 at commit 7aa3257.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class KafkaDataConsumerSuite extends SparkFunSuite with BeforeAndAfterAll

SparkQA commented Apr 13, 2018

Test build #89359 has finished for PR 20997 at commit 7aa3257.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class KafkaDataConsumerSuite extends SparkFunSuite with BeforeAndAfterAll
@koeninger

This comment has been minimized.

Show comment
Hide comment
@koeninger

koeninger Apr 17, 2018

Contributor
Contributor

koeninger commented Apr 17, 2018

@gaborgsomogyi

This comment has been minimized.

Show comment
Hide comment
@gaborgsomogyi

gaborgsomogyi Apr 20, 2018

Contributor

Taken a look at the pool options and I have the feeling it requires more time to come up with a proper solution. Switching back to the SQL code provided one cached consumer approach...

Contributor

gaborgsomogyi commented Apr 20, 2018

Taken a look at the pool options and I have the feeling it requires more time to come up with a proper solution. Switching back to the SQL code provided one cached consumer approach...

@gaborgsomogyi

This comment has been minimized.

Show comment
Hide comment
@gaborgsomogyi

gaborgsomogyi Apr 21, 2018

Contributor

In the meantime found a small glitch in the SQL part. Namely if reattempt happens this line


removes the consumer from cache which will end up in this log message:

13:27:07.556 INFO org.apache.spark.sql.kafka010.KafkaDataConsumer: Released a supposedly cached consumer that was not found in the cache

I've solved this here by removing only the closed consumer. The marked for close will be removed in release.

Contributor

gaborgsomogyi commented Apr 21, 2018

In the meantime found a small glitch in the SQL part. Namely if reattempt happens this line


removes the consumer from cache which will end up in this log message:

13:27:07.556 INFO org.apache.spark.sql.kafka010.KafkaDataConsumer: Released a supposedly cached consumer that was not found in the cache

I've solved this here by removing only the closed consumer. The marked for close will be removed in release.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 21, 2018

Test build #89676 has finished for PR 20997 at commit 2c45388.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 21, 2018

Test build #89676 has finished for PR 20997 at commit 2c45388.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@vanzin

A few minor things but in general I'll defer to Cody, who knows a lot more about this than I do.

Show outdated Hide outdated ...c/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala Outdated
Show outdated Hide outdated ...c/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala Outdated
// likely running on a beefy machine that can handle a large number of simultaneously
// active consumers.
if (entry.getValue.inUse == false && this.size > maxCapacity) {

This comment has been minimized.

@vanzin

vanzin Apr 25, 2018

Contributor

nit: !entry.getValue.inUse

@vanzin

vanzin Apr 25, 2018

Contributor

nit: !entry.getValue.inUse

This comment has been minimized.

@gaborgsomogyi

gaborgsomogyi May 2, 2018

Contributor

Fixed.

@gaborgsomogyi

gaborgsomogyi May 2, 2018

Contributor

Fixed.

Show outdated Hide outdated ...c/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala Outdated
}
}
private def release(internalConsumer: InternalKafkaConsumer[_, _]): Unit = synchronized {

This comment has been minimized.

@vanzin

vanzin Apr 25, 2018

Contributor

After reading this code and also the acquire method, is there a useful difference between the CachedKafkaDataConsumer and NonCachedKafkaDataConsumer types?

It seems like the code doesn't really care about those types, but just about whether the consumer is in the cache?

@vanzin

vanzin Apr 25, 2018

Contributor

After reading this code and also the acquire method, is there a useful difference between the CachedKafkaDataConsumer and NonCachedKafkaDataConsumer types?

It seems like the code doesn't really care about those types, but just about whether the consumer is in the cache?

This comment has been minimized.

@koeninger

koeninger Apr 25, 2018

Contributor

I think that's a good observation. But I'm not sure it's worth deviating from the same design being used in the SQL code.

@koeninger

koeninger Apr 25, 2018

Contributor

I think that's a good observation. But I'm not sure it's worth deviating from the same design being used in the SQL code.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA May 2, 2018

Test build #90044 has finished for PR 20997 at commit 6cd67c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented May 2, 2018

Test build #90044 has finished for PR 20997 at commit 6cd67c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@gaborgsomogyi

This comment has been minimized.

Show comment
Hide comment
@gaborgsomogyi

gaborgsomogyi May 22, 2018

Contributor

Do I need to do any further changes?

Contributor

gaborgsomogyi commented May 22, 2018

Do I need to do any further changes?

@vanzin

This comment has been minimized.

Show comment
Hide comment
@vanzin

vanzin May 22, 2018

Contributor

I'm fine with it. Unless Cody beats me to it or has more comments, I'll push this after the long weekend.

Contributor

vanzin commented May 22, 2018

I'm fine with it. Unless Cody beats me to it or has more comments, I'll push this after the long weekend.

@vanzin

This comment has been minimized.

Show comment
Hide comment
@vanzin

vanzin May 22, 2018

Contributor

retest this please

Contributor

vanzin commented May 22, 2018

retest this please

@koeninger

This comment has been minimized.

Show comment
Hide comment
@koeninger

koeninger May 22, 2018

Contributor

I'm fine as well.

Contributor

koeninger commented May 22, 2018

I'm fine as well.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA May 22, 2018

Test build #90989 has finished for PR 20997 at commit 6cd67c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented May 22, 2018

Test build #90989 has finished for PR 20997 at commit 6cd67c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@vanzin

This comment has been minimized.

Show comment
Hide comment
@vanzin

vanzin May 22, 2018

Contributor

That being the case, merging to master.

Contributor

vanzin commented May 22, 2018

That being the case, merging to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment