Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MINOR: reduce allocations in log start and recovery checkpoints #8467

Merged
merged 7 commits into from Apr 25, 2020

Conversation

lbradstreet
Copy link
Contributor

@lbradstreet lbradstreet commented Apr 12, 2020

For brokers with replica counts > 4000, allocations from logsByDir become substantial. logsByDir is called often by LogManager.checkpointLogRecoveryOffsets and LogManager.checkpointLogStartOffsets. The approach used is similar to the one from the checkpointHighwatermarks change in #6741.

Are there better ways to structure out data structure to avoid creating logsByDir on demand for each checkpoint iteration? This micro-optimization will help as is, but if we can avoid doing this completely it'd be better.

JMH benchmark results:

Before:
Benchmark                                                                      (numPartitions)  (numTopics)   Mode  Cnt        Score        Error   Units
CheckpointBench.measureCheckpointLogStartOffsets                                             3          100  thrpt   15        2.233 ±      0.013  ops/ms
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate                              3          100  thrpt   15      477.097 ±     49.731  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate.norm                         3          100  thrpt   15   246083.007 ±     33.052    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space                     3          100  thrpt   15      475.683 ±     55.569  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space.norm                3          100  thrpt   15   245474.040 ±  14968.328    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen                        3          100  thrpt   15        0.001 ±      0.001  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen.norm                   3          100  thrpt   15        0.341 ±      0.268    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.count                                   3          100  thrpt   15      129.000               counts
CheckpointBench.measureCheckpointLogStartOffsets:·gc.time                                    3          100  thrpt   15       52.000                   ms
CheckpointBench.measureCheckpointLogStartOffsets                                             3         1000  thrpt   15        0.572 ±      0.004  ops/ms
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate                              3         1000  thrpt   15     1360.240 ±    150.539  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate.norm                         3         1000  thrpt   15  2750221.257 ±    891.024    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space                     3         1000  thrpt   15     1362.908 ±    148.799  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space.norm                3         1000  thrpt   15  2756395.092 ±  44671.843    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen                        3         1000  thrpt   15        0.017 ±      0.008  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen.norm                   3         1000  thrpt   15       33.611 ±     14.401    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.count                                   3         1000  thrpt   15      273.000               counts
CheckpointBench.measureCheckpointLogStartOffsets:·gc.time                                    3         1000  thrpt   15      186.000                   ms
CheckpointBench.measureCheckpointLogStartOffsets                                             3         2000  thrpt   15        0.266 ±      0.002  ops/ms
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate                              3         2000  thrpt   15     1342.557 ±    171.260  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate.norm                         3         2000  thrpt   15  5877881.729 ±   3695.086    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space                     3         2000  thrpt   15     1343.965 ±    186.069  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space.norm                3         2000  thrpt   15  5877788.561 ± 168540.343    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen                        3         2000  thrpt   15        0.081 ±      0.043  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen.norm                   3         2000  thrpt   15      351.277 ±    167.006    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.count                                   3         2000  thrpt   15      253.000               counts
CheckpointBench.measureCheckpointLogStartOffsets:·gc.time                                    3         2000  thrpt   15      231.000                   ms
JMH benchmarks done

After:
CheckpointBench.measureCheckpointLogStartOffsets                                             3          100  thrpt   15        2.809 ±     0.129  ops/ms
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate                              3          100  thrpt   15      211.248 ±    25.953  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate.norm                         3          100  thrpt   15    86533.838 ±  3763.989    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space                     3          100  thrpt   15      211.512 ±    38.669  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space.norm                3          100  thrpt   15    86228.552 ±  9590.781    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen                        3          100  thrpt   15       ≈ 10⁻³              MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen.norm                   3          100  thrpt   15        0.140 ±     0.111    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.count                                   3          100  thrpt   15       57.000              counts
CheckpointBench.measureCheckpointLogStartOffsets:·gc.time                                    3          100  thrpt   15       25.000                  ms
CheckpointBench.measureCheckpointLogStartOffsets                                             3         1000  thrpt   15        1.046 ±     0.030  ops/ms
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate                              3         1000  thrpt   15      524.597 ±    74.793  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate.norm                         3         1000  thrpt   15   582898.889 ± 37552.262    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space                     3         1000  thrpt   15      519.675 ±    89.754  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space.norm                3         1000  thrpt   15   576371.150 ± 55972.955    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen                        3         1000  thrpt   15        0.009 ±     0.005  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen.norm                   3         1000  thrpt   15        9.920 ±     5.375    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.count                                   3         1000  thrpt   15      111.000              counts
CheckpointBench.measureCheckpointLogStartOffsets:·gc.time                                    3         1000  thrpt   15       56.000                  ms
CheckpointBench.measureCheckpointLogStartOffsets                                             3         2000  thrpt   15        0.617 ±     0.007  ops/ms
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate                              3         2000  thrpt   15      573.061 ±    95.931  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate.norm                         3         2000  thrpt   15  1092098.004 ± 75140.633    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space                     3         2000  thrpt   15      572.448 ±    97.960  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space.norm                3         2000  thrpt   15  1091290.460 ± 85946.164    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen                        3         2000  thrpt   15        0.010 ±     0.012  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen.norm                   3         2000  thrpt   15       19.990 ±    24.407    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.count                                   3         2000  thrpt   15      109.000              counts
CheckpointBench.measureCheckpointLogStartOffsets:·gc.time                                    3         2000  thrpt   15       67.000                  ms
JMH benchmarks done

For the 2000 topic, 3 partition case, we see a reduction in normalized allocations from 5877881B/op to 1284190.774B/op, a reduction of 78%.

Some allocation profiles from a mid sized broker follow. I have seen worse, but these add up to around 3.8% on a broker that saw GC overhead in CPU time of around 30%. You could argue that this is relatively small, but it seems worthwhile for a low risk change.
image
image

def logsByDir: Map[String, Map[TopicPartition, Log]] = {
val byDir = new mutable.HashMap[String, mutable.HashMap[TopicPartition, Log]]()
def addToDir(tp: TopicPartition, log: Log): Unit = {
byDir.getOrElseUpdate(log.parentDir, new mutable.HashMap[TopicPartition, Log]()).put(tp, log)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the other PR we used AnyRefMap. Any reason why we are not doing that here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add a comment about why we are writing non idiomatic code here to avoid reverts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will try it and retest and add a comment to prevent regression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this and it's ready for another review pass.

@ijuma
Copy link
Contributor

ijuma commented Apr 17, 2020

ok to test

@lbradstreet
Copy link
Contributor Author

Updated benchmarks with AnyRefMap change:

Benchmark                                         (numPartitions)  (numTopics)   Mode  Cnt  Score   Error   Units
CheckpointBench.measureCheckpointLogStartOffsets                3          100  thrpt   15  2.673 ± 0.027  ops/ms
CheckpointBench.measureCheckpointLogStartOffsets                3         1000  thrpt   15  1.066 ± 0.012  ops/ms
CheckpointBench.measureCheckpointLogStartOffsets                3         2000  thrpt   15  0.609 ± 0.007  ops/ms

Looks marginally better. I'm re-running with -prof gc so I can include the final results in the commit.

@ijuma
Copy link
Contributor

ijuma commented Apr 19, 2020

ok to test

@ijuma
Copy link
Contributor

ijuma commented Apr 23, 2020

retest this please

byDir.getOrElseUpdate(log.parentDir, new mutable.AnyRefMap[TopicPartition, Log]()).put(tp, log)
}
currentLogs.foreach { case (tp, log) => addToDir(tp, log) }
futureLogs.foreach { case (tp, log) => addToDir(tp, log) }
Copy link
Contributor

@ijuma ijuma Apr 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala.collection.Map has a def foreachEntry[U](f: (K, V) => U): Unit method that avoids allocating tuples (it's a function with two parameters). If we want to avoid allocations, we should introduce a similar method to Pool and use it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried out your suggestion and interestingly saw more allocations. Note extreme use of the var item there in case there was an extra allocation there.

--- a/core/src/main/scala/kafka/log/LogManager.scala
+++ b/core/src/main/scala/kafka/log/LogManager.scala
@@ -1011,8 +1011,8 @@ class LogManager(logDirs: Seq[File],
     def addToDir(tp: TopicPartition, log: Log): Unit = {
       byDir.getOrElseUpdate(log.parentDir, new mutable.AnyRefMap[TopicPartition, Log]()).put(tp, log)
     }
-    currentLogs.foreach { case (tp, log) => addToDir(tp, log) }
-    futureLogs.foreach { case (tp, log) => addToDir(tp, log) }
+    currentLogs.foreachEntry(addToDir)
+    futureLogs.foreachEntry(addToDir)
     byDir
   }
 
diff --git a/core/src/main/scala/kafka/utils/Pool.scala b/core/src/main/scala/kafka/utils/Pool.scala
index 964de7eae2..450f90611b 100644
--- a/core/src/main/scala/kafka/utils/Pool.scala
+++ b/core/src/main/scala/kafka/utils/Pool.scala
@@ -74,6 +74,15 @@ class Pool[K,V](valueFactory: Option[K => V] = None) extends Iterable[(K, V)] {
   def values: Iterable[V] = pool.values.asScala
 
   def clear(): Unit = { pool.clear() }
+
+  def foreachEntry(f: (K, V) => Unit): Unit = {
+    val iter = iterator
+    var item: (K, V) = null
+    while(iter.hasNext) {
+      item = iter.next()
+      f(item._1, item._2)
+    }
+  }

CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate.norm 3 2000 thrpt 15 1428889.965 ± 75131.113 B/op
vs this PR:
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate.norm 3 2000 thrpt 15 1284326.850 ± 75148.430 B/op

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ updated text there to say "this PR" rather than trunk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is that you're not using the underlying map iterator, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to use the iterator of ConcurrentHashMap[K, V] which gives you Map.Entry instead of a tuple. Then you avoid it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what I'm doing wrong when I supply a lambda to the java .forEach? I know it works for consumers but in my brief effort improving this I couldn't make it work with the BiConsumer here and had to create a new one.

I will update the body with the rerun jmh results.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What error are you getting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, it easy enough to do something like this:

pool.forEach((k,v) => f(k,v))

But then we appear to be back to allocating tuples.

It doesn't appear easy to do something like:

pool.forEach(kv => f(kv.getKey, kv.getValue))

since it's not able to create a BiConsumer for you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you say that the following would allocate tuples?

def foreachEntry(f: (K, V) => Unit): Unit = {
    pool.forEach((k, v) => f(k, v))
  }

Did you measure it? Looking at the code, I don't see any tuples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ijuma I had mistakenly thought it was taking a tuple there, and maybe that was just the way Scala converted a lambda to a BiConsumer, and the results had regressed back to close to what I was seeing with the iterator version.

When I ran it, it returned saw:

1140074.937 ± 75151.914    B/op
def foreachEntry(f: (K, V) => Unit): Unit = {
    pool.forEach((k, v) => f(k, v))
  }

1044120.722 ±  1676.137    B/op (included in an above comment)   
pool.forEach(new BiConsumer[K,V] {
      override def accept(t: K, u: V): Unit = f(t, u)
    })

Strangely I have just re-run the BiConsumer version and it returned 1188097.413 ± 1639.537 B/op. I'm not sure why it's regressed from what I saw on a previous run. I am OK with using the version with a lambda if you are. I'm not sure I will have time to investigate it further, and it's still a good improvement compared to the version that used foreach.

@lbradstreet
Copy link
Contributor Author

I have updated the benchmark results in the OP with the new forEach code. It is strictly better than the old foreach code.

Copy link
Contributor

@ijuma ijuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the PR!

@ijuma
Copy link
Contributor

ijuma commented Apr 24, 2020

ok to test

@ijuma
Copy link
Contributor

ijuma commented Apr 24, 2020

ok to test

@ijuma
Copy link
Contributor

ijuma commented Apr 24, 2020

retest this please

@ijuma
Copy link
Contributor

ijuma commented Apr 24, 2020

ok to test

@ijuma ijuma merged commit cfc34ca into apache:trunk Apr 25, 2020
ijuma added a commit to ijuma/kafka that referenced this pull request Apr 29, 2020
…t-for-generated-requests

* apache-github/trunk: (366 commits)
  MINOR: Improve producer test BufferPoolTest#testCloseNotifyWaiters. (apache#7982)
  MINOR: document how to escape json parameters to ducktape tests (apache#8546)
  KAFKA-9885; Evict last members of a group when the maximum allowed is reached (apache#8525)
  KAFKA-9866: Avoid election for topics where preferred leader is not in ISR (apache#8524)
  KAFKA-9839; Broker should accept control requests with newer broker epoch (apache#8509)
  KAKFA-9612: Add an option to kafka-configs.sh to add configs from a prop file (KIP-574)
  MINOR: Partition is under reassignment when adding and removing (apache#8364)
  MINOR: reduce allocations in log start and recovery checkpoints (apache#8467)
  MINOR: Remove unused foreign-key join class (apache#8547)
  HOTFIX: Fix broker bounce system tests (apache#8532)
  KAFKA-9704: Fix the issue z/OS won't let us resize file when mmap. (apache#8224)
  KAFKA-8639: Replace AddPartitionsToTxn with Automated Protocol  (apache#8326)
  MINOR: equals() should compare all fields for generated classes (apache#8539)
  KAFKA-9844; Fix race condition which allows more than maximum number of members(apache#8454)
  KAFKA-9823: Remember the sent generation for the coordinator request (apache#8445)
  KAFKA-9883: Add better error message when REST API forwards a request and leader is not known (apache#8536)
  KAFKA-9907: Switch default build to Scala 2.13 (apache#8537)
  MINOR: Some html fixes in Streams DSL documentation (apache#8503)
  MINOR: Enable fatal warnings with scala 2.13 (apache#8429)
  KAFKA-9852: Change the max duration that calls to the buffer pool can block from 2000ms to 10ms to reduce overall test runtime (apache#8464)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants