KAFKA-6431: Shard purgatory to mitigate lock contention #5338

ying-zheng · 2018-07-06T00:21:34Z

Shard purgatory and use ReentrantLock instead of ReentrantReadWriteLock

This fix has been deployed to Uber's production environment for several months

sync from apache repo

sync from apache kafka

harshach

Thanks for the patch @ying-zheng . Over all it looks good to me. Left some minor nits.
build failure is not related to this patch. All unit tests pass on my local with this patch.
@guozhangwang you might want to take a look as well.

harshach · 2018-07-06T22:41:38Z

core/src/main/scala/kafka/server/DelayedOperation.scala

+  }
+
+  /* 512 shards */
+  private val watcherLists = Array.fill[WatcherList](512)(new WatcherList)


any specific reason for 512 shards. Should we make this configurable. If not lets declare it as a constant with a comment.

i don't want to make the configuration over complicated. 512 is just a large enough number. i have some explanation in the ticket. i will make it a constant

harshach · 2018-07-06T22:42:14Z

core/src/main/scala/kafka/server/DelayedOperation.scala

+  private class WatcherList {
+    val watchersForKey = new Pool[Any, Watchers](Some((key: Any) => new Watchers(key)))
+
+    val removeWatchersLock = new ReentrantLock()


can we call this as watchersLock? since we are using to read as well.

harshach · 2018-07-06T22:42:26Z

core/src/main/scala/kafka/server/DelayedOperation.scala


  /* a list of operation watching keys */
-  private val watchersForKey = new Pool[Any, Watchers](Some((key: Any) => new Watchers(key)))
+  //private val watchersForKey = new Pool[Any, Watchers](Some((key: Any) => new Watchers(key)))


Lets remove the commented out lines

harshach · 2018-07-06T22:43:16Z

core/src/main/scala/kafka/server/DelayedOperation.scala

@@ -282,7 +303,13 @@ final class DelayedOperationPurgatory[T <: DelayedOperation](purgatoryName: Stri
   * on multiple lists, and some of its watched entries may still be in the watch lists
   * even when it has been completed, this number may be larger than the number of real operations watched
   */
-  def watched: Int = allWatchers.map(_.countWatched).sum
+  def watched() = {


can you add the return type.

guozhangwang · 2018-07-07T15:13:44Z

cc @junrao @rajinisivaram to take a look.

ijuma · 2018-07-09T06:57:47Z

cc @cmccabe

ijuma · 2018-07-09T07:13:02Z

Thanks for the PR @ying-zheng. Can you also share the improvement you have seen after deploying this change to Uber's production environment?

ying-zheng · 2018-07-09T19:16:24Z

@ijuma in my simulation test, this change reduced the P50 produce latency (acks=all) from 4ms to 3ms.

You can find more details about the simulation test in https://www.slideshare.net/YingZheng35/improving-kafka-atleastonce-performance-at-uber

As we deployed serval performance improvements to the production environment together, it's hard to tell the exact improvement from this change. With these performance optimizations (mostly presented in the slides above), the produce latency (P99 and P50) of our acks=all cluster reduced by about 90%, which is more significant that the improvements we saw in the simulation test.

harshach · 2018-07-10T16:39:41Z

Thanks for the changes @ying-zheng. I am +1 on the patch. I'll wait for other reviewers to comment before merging it in.

harshach · 2018-07-17T02:33:20Z

@junrao @rajinisivaram ping for a review. Thanks.

rajinisivaram

@ying-zheng Thanks for the PR. Left a few minor comments. Could we also add some unit tests to verify the cases where the operations are in the same shard as well as different shards?

rajinisivaram · 2018-07-17T09:40:33Z

core/src/main/scala/kafka/server/DelayedOperation.scala


-  private val removeWatchersLock = new ReentrantReadWriteLock()
+  private val Shards = 512 // Shard the watcher list to reduce lock contention


Perhaps define this in an object rather than as an instance variable?

rajinisivaram · 2018-07-17T09:41:24Z

core/src/main/scala/kafka/server/DelayedOperation.scala

-  private val removeWatchersLock = new ReentrantReadWriteLock()
+  private val Shards = 512 // Shard the watcher list to reduce lock contention
+  private val watcherLists = Array.fill[WatcherList](Shards)(new WatcherList)
+  private def getWatcherList(key: Any): WatcherList = {


getWatcherList -> watcherList since we don't normally add get prefix?

rajinisivaram · 2018-07-17T09:50:13Z

core/src/main/scala/kafka/server/DelayedOperation.scala

-  def watched: Int = allWatchers.map(_.countWatched).sum
+  def watched(): Int = {
+    var sum = 0
+    for (wl <- watcherLists) {


Could use foldLeft instead of for loop?

rajinisivaram · 2018-07-17T09:54:19Z

core/src/main/scala/kafka/server/DelayedOperation.scala

@@ -424,7 +444,10 @@ final class DelayedOperationPurgatory[T <: DelayedOperation](purgatoryName: Stri
      // a little overestimated total number of operations.
      estimatedTotalOperations.getAndSet(delayed)
      debug("Begin purging watch lists")
-      val purged = allWatchers.map(_.purgeCompleted()).sum
+      var purged = 0
+      for (wl <- watcherLists) {


Could use foldLeft instead of for loop?

cmccabe · 2018-07-17T16:37:38Z

I tested this and didn't see any performance difference. However, it may be that my test case was too small-scale to see a benefit. If there is a benefit, I would expect to see it when there is a lot of contention due to many simultaneous requests.

ying-zheng · 2018-07-18T21:00:28Z

@rajinisivaram thank you for the comments. I have updated the code diff

ying-zheng · 2018-07-18T21:05:17Z

@cmccabe
In my test, each broker leads about 900 topic partitions, and there are about 300K produce request per second.

cmccabe · 2018-07-18T23:09:26Z

Thanks. Out of curiousity, how many partitions (and partitions per broker) did you use?

ying-zheng · 2018-07-19T23:11:19Z

@cmccabe About 2800 topic-partitions per broker, 3 replicas, each broker leads ~900 topic-partitions. besides 2 followers, there are 4 consumers consuming each topic

tedyu

Looks good overall.

tedyu · 2018-07-19T23:29:55Z

core/src/main/scala/kafka/server/DelayedOperation.scala

  /* a list of operation watching keys */
-  private val watchersForKey = new Pool[Any, Watchers](Some((key: Any) => new Watchers(key)))
+  private class WatcherList {
+    val watchersForKey = new Pool[Any, Watchers](Some((key: Any) => new Watchers(key)))


watchersForKey -> watchersByKey

harshach · 2018-07-20T23:27:07Z

@rajinisivaram can you review the latest changes. If looks good would like to merge into trunk and 1.1 branch

ijuma · 2018-07-21T04:20:18Z

@harshach, a meta-comment: I noticed that there are a number of PRs where you mentioned merging to trunk and 1.1. There is also a 2.0 branch, if you backport to 1.1, it's essential that the change is also cherry-picked to 2.0.

An additional point is that we tend to backport low risk and/or important fixes to older branches. We want to make sure we don't introduce regressions to older releases. Please take this into consideration when considering what to backport.

harshach · 2018-07-23T17:54:13Z

@ijuma understood. These patches are for performance improvements and some are critical fixes for large clusters. Users who are deploying large clusters are usually don't upgrade latest versions easily hence the reason to merge critical fixes to the 1.x line

ijuma · 2018-07-23T17:58:10Z

The numbers provided in this PR seemed very incremental and were for several changes (not just this one), so the case for backporting does not seem strong.

harshach · 2018-07-24T23:48:06Z

@ijuma fair enough. Will merge it into the trunk.
@rajinisivaram another ping for your approval. Thanks.

rajinisivaram · 2018-07-25T10:10:53Z

@ying-zheng Thanks for the updates. Implementation looks good. We have a micro-benchmark for the purgatory: test.TestPurgatoryPerformance. Do you think it will be useful to update this to test to run with multiple threads to test the scenario that this PR helps with? That would help us easily test for regressions in the future. Thank you!

ying-zheng · 2018-08-02T18:23:59Z

@rajinisivaram I think should be able to see the performance difference with 32 threads and >300kQPS. However, I don't know how the micro-benchmark test works. How do you know if there is performance regression? Run the test with different Kafka versions on the same host?

rajinisivaram · 2018-08-03T09:24:01Z

@ying-zheng Yes, if you could run the benchmark on the same host with and without this PR, that would be great. Thank you!

harshach · 2018-10-19T18:55:07Z

@rajinisivaram can we make this PR Into upcoming releases. Unfortunately running a benchmark taking time. Is that a blocker to get this merged in.

rajinisivaram · 2018-10-23T11:17:03Z

@harshach Perhaps we can merge this just to trunk? @ijuma What do you think?

harshach · 2018-12-14T20:38:31Z

@rajinisivaram Let me know if you are +1 to merge into trunk.

junrao

@ying-zheng : Thanks for the patch. LGTM. Just a few minor comments below.

junrao · 2018-12-17T22:59:55Z

core/src/main/scala/kafka/server/DelayedOperation.scala

@@ -424,7 +441,7 @@ final class DelayedOperationPurgatory[T <: DelayedOperation](purgatoryName: Stri
      // a little overestimated total number of operations.
      estimatedTotalOperations.getAndSet(delayed)
      debug("Begin purging watch lists")
-      val purged = allWatchers.map(_.purgeCompleted()).sum
+      var purged = watcherLists.foldLeft(0) { _ + _.allWatchers.map(_.purgeCompleted()).sum }


Does purged need to be var?

Also, could we use case inside foldLeft() so that we can use named params instead of _?

junrao · 2018-12-17T23:20:33Z

core/src/main/scala/kafka/server/DelayedOperation.scala

+     * note that the returned watchers may be removed from the list by other threads
+     */
+    def allWatchers = {
+      inLock(watchersLock) { watchersByKey.values }


This is an existing issue. But I am wondering if we really need the lock here. What we return from the lock is a view of the backing map, which can change after the lock is released. Since ConcurrentHashMap supports weakly consistent iterator already, it seems that we don't need the lock.

Removed the lock
Thank you for the comments!

Happy new year!

junrao · 2018-12-17T23:31:14Z

core/src/main/scala/kafka/server/DelayedOperation.scala

@@ -282,7 +299,9 @@ final class DelayedOperationPurgatory[T <: DelayedOperation](purgatoryName: Stri
   * on multiple lists, and some of its watched entries may still be in the watch lists
   * even when it has been completed, this number may be larger than the number of real operations watched
   */
-  def watched: Int = allWatchers.map(_.countWatched).sum
+  def watched(): Int = {


Do we need to add () to watched? It doesn't seem to have any side effect.

an unnecessary lock

harshach · 2019-01-04T00:01:28Z

@junrao can you please re-check. @yingzuber addressed your comments.

* Shard purgatory to reduce lock contention * put constant into Object, use foldLeft instead of for loop * watchersForKey -> watchersByKey * Incorporate Jun's comments: use named arguments instead of _, and remove an unnecessary lock Reviewers: Sriharsha Chintalapani <sriharsha@apache.org>, Jun Rao <junrao@gmail.com>, Rajini Sivaram <rajinisivaram@googlemail.com>

ying-zheng and others added 3 commits April 3, 2018 17:16

Merge pull request #1 from apache/trunk

63474e1

sync from apache repo

Merge pull request #2 from apache/trunk

aaa27a7

sync from apache kafka

Shard purgatory to reduce lock contention

3c03c09

ying-zheng changed the title ~~Kafka 6431~~ [KAFKA-6431] Shard purgatory to mitigate lock contention Jul 6, 2018

ying-zheng changed the title ~~[KAFKA-6431] Shard purgatory to mitigate lock contention~~ KAFKA-6431: Shard purgatory to mitigate lock contention Jul 6, 2018

harshach requested changes Jul 7, 2018

View reviewed changes

guozhangwang requested review from junrao and rajinisivaram July 7, 2018 15:13

Incorperate Harsha's comments

2cd3fc4

rajinisivaram reviewed Jul 17, 2018

View reviewed changes

put constant into Object, use foldLeft instead of for loop

3f29040

harshach approved these changes Jul 19, 2018

View reviewed changes

ying-zheng closed this Jul 19, 2018

ying-zheng reopened this Jul 19, 2018

tedyu reviewed Jul 19, 2018

View reviewed changes

watchersForKey -> watchersByKey

5dbc656

junrao approved these changes Dec 17, 2018

View reviewed changes

Incorporate Jun's comments: use named arguments instead of _, and remove

5efdb74

an unnecessary lock

harshach merged commit 459a4dd into apache:trunk Jan 4, 2019


		private val removeWatchersLock = new ReentrantReadWriteLock()
		private val Shards = 512 // Shard the watcher list to reduce lock contention

KAFKA-6431: Shard purgatory to mitigate lock contention #5338

KAFKA-6431: Shard purgatory to mitigate lock contention #5338

Conversation

ying-zheng commented Jul 6, 2018 • edited Loading

harshach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guozhangwang commented Jul 7, 2018

ijuma commented Jul 9, 2018

ijuma commented Jul 9, 2018

ying-zheng commented Jul 9, 2018 • edited Loading

harshach commented Jul 10, 2018

harshach commented Jul 17, 2018

rajinisivaram left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmccabe commented Jul 17, 2018

ying-zheng commented Jul 18, 2018

ying-zheng commented Jul 18, 2018 • edited Loading

cmccabe commented Jul 18, 2018

ying-zheng commented Jul 19, 2018

tedyu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harshach commented Jul 20, 2018

ijuma commented Jul 21, 2018

harshach commented Jul 23, 2018

ijuma commented Jul 23, 2018

harshach commented Jul 24, 2018

rajinisivaram commented Jul 25, 2018

ying-zheng commented Aug 2, 2018

rajinisivaram commented Aug 3, 2018

harshach commented Oct 19, 2018

rajinisivaram commented Oct 23, 2018

harshach commented Dec 14, 2018

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harshach commented Jan 4, 2019

ying-zheng commented Jul 6, 2018 •

edited

Loading

ying-zheng commented Jul 9, 2018 •

edited

Loading

ying-zheng commented Jul 18, 2018 •

edited

Loading