MINOR: reduce allocations in log start and recovery checkpoints #8467

lbradstreet · 2020-04-12T01:03:03Z

For brokers with replica counts > 4000, allocations from logsByDir become substantial. logsByDir is called often by LogManager.checkpointLogRecoveryOffsets and LogManager.checkpointLogStartOffsets. The approach used is similar to the one from the checkpointHighwatermarks change in #6741.

Are there better ways to structure out data structure to avoid creating logsByDir on demand for each checkpoint iteration? This micro-optimization will help as is, but if we can avoid doing this completely it'd be better.

JMH benchmark results:

Before:
Benchmark                                                                      (numPartitions)  (numTopics)   Mode  Cnt        Score        Error   Units
CheckpointBench.measureCheckpointLogStartOffsets                                             3          100  thrpt   15        2.233 ±      0.013  ops/ms
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate                              3          100  thrpt   15      477.097 ±     49.731  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate.norm                         3          100  thrpt   15   246083.007 ±     33.052    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space                     3          100  thrpt   15      475.683 ±     55.569  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space.norm                3          100  thrpt   15   245474.040 ±  14968.328    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen                        3          100  thrpt   15        0.001 ±      0.001  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen.norm                   3          100  thrpt   15        0.341 ±      0.268    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.count                                   3          100  thrpt   15      129.000               counts
CheckpointBench.measureCheckpointLogStartOffsets:·gc.time                                    3          100  thrpt   15       52.000                   ms
CheckpointBench.measureCheckpointLogStartOffsets                                             3         1000  thrpt   15        0.572 ±      0.004  ops/ms
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate                              3         1000  thrpt   15     1360.240 ±    150.539  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate.norm                         3         1000  thrpt   15  2750221.257 ±    891.024    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space                     3         1000  thrpt   15     1362.908 ±    148.799  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space.norm                3         1000  thrpt   15  2756395.092 ±  44671.843    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen                        3         1000  thrpt   15        0.017 ±      0.008  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen.norm                   3         1000  thrpt   15       33.611 ±     14.401    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.count                                   3         1000  thrpt   15      273.000               counts
CheckpointBench.measureCheckpointLogStartOffsets:·gc.time                                    3         1000  thrpt   15      186.000                   ms
CheckpointBench.measureCheckpointLogStartOffsets                                             3         2000  thrpt   15        0.266 ±      0.002  ops/ms
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate                              3         2000  thrpt   15     1342.557 ±    171.260  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate.norm                         3         2000  thrpt   15  5877881.729 ±   3695.086    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space                     3         2000  thrpt   15     1343.965 ±    186.069  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space.norm                3         2000  thrpt   15  5877788.561 ± 168540.343    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen                        3         2000  thrpt   15        0.081 ±      0.043  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen.norm                   3         2000  thrpt   15      351.277 ±    167.006    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.count                                   3         2000  thrpt   15      253.000               counts
CheckpointBench.measureCheckpointLogStartOffsets:·gc.time                                    3         2000  thrpt   15      231.000                   ms
JMH benchmarks done

After:
CheckpointBench.measureCheckpointLogStartOffsets                                             3          100  thrpt   15        2.809 ±     0.129  ops/ms
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate                              3          100  thrpt   15      211.248 ±    25.953  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate.norm                         3          100  thrpt   15    86533.838 ±  3763.989    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space                     3          100  thrpt   15      211.512 ±    38.669  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space.norm                3          100  thrpt   15    86228.552 ±  9590.781    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen                        3          100  thrpt   15       ≈ 10⁻³              MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen.norm                   3          100  thrpt   15        0.140 ±     0.111    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.count                                   3          100  thrpt   15       57.000              counts
CheckpointBench.measureCheckpointLogStartOffsets:·gc.time                                    3          100  thrpt   15       25.000                  ms
CheckpointBench.measureCheckpointLogStartOffsets                                             3         1000  thrpt   15        1.046 ±     0.030  ops/ms
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate                              3         1000  thrpt   15      524.597 ±    74.793  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate.norm                         3         1000  thrpt   15   582898.889 ± 37552.262    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space                     3         1000  thrpt   15      519.675 ±    89.754  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space.norm                3         1000  thrpt   15   576371.150 ± 55972.955    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen                        3         1000  thrpt   15        0.009 ±     0.005  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen.norm                   3         1000  thrpt   15        9.920 ±     5.375    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.count                                   3         1000  thrpt   15      111.000              counts
CheckpointBench.measureCheckpointLogStartOffsets:·gc.time                                    3         1000  thrpt   15       56.000                  ms
CheckpointBench.measureCheckpointLogStartOffsets                                             3         2000  thrpt   15        0.617 ±     0.007  ops/ms
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate                              3         2000  thrpt   15      573.061 ±    95.931  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate.norm                         3         2000  thrpt   15  1092098.004 ± 75140.633    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space                     3         2000  thrpt   15      572.448 ±    97.960  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Eden_Space.norm                3         2000  thrpt   15  1091290.460 ± 85946.164    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen                        3         2000  thrpt   15        0.010 ±     0.012  MB/sec
CheckpointBench.measureCheckpointLogStartOffsets:·gc.churn.G1_Old_Gen.norm                   3         2000  thrpt   15       19.990 ±    24.407    B/op
CheckpointBench.measureCheckpointLogStartOffsets:·gc.count                                   3         2000  thrpt   15      109.000              counts
CheckpointBench.measureCheckpointLogStartOffsets:·gc.time                                    3         2000  thrpt   15       67.000                  ms
JMH benchmarks done

For the 2000 topic, 3 partition case, we see a reduction in normalized allocations from 5877881B/op to 1284190.774B/op, a reduction of 78%.

Some allocation profiles from a mid sized broker follow. I have seen worse, but these add up to around 3.8% on a broker that saw GC overhead in CPU time of around 30%. You could argue that this is relatively small, but it seems worthwhile for a low risk change.

ijuma · 2020-04-16T13:03:23Z

core/src/main/scala/kafka/log/LogManager.scala

+  def logsByDir: Map[String, Map[TopicPartition, Log]] = {
+    val byDir = new mutable.HashMap[String, mutable.HashMap[TopicPartition, Log]]()
+    def addToDir(tp: TopicPartition, log: Log): Unit = {
+      byDir.getOrElseUpdate(log.parentDir, new mutable.HashMap[TopicPartition, Log]()).put(tp, log)


In the other PR we used AnyRefMap. Any reason why we are not doing that here?

We should also add a comment about why we are writing non idiomatic code here to avoid reverts.

Good point. I will try it and retest and add a comment to prevent regression.

I have addressed this and it's ready for another review pass.

ijuma · 2020-04-17T15:07:38Z

ok to test

lbradstreet · 2020-04-19T21:32:56Z

Updated benchmarks with AnyRefMap change:

Benchmark                                         (numPartitions)  (numTopics)   Mode  Cnt  Score   Error   Units
CheckpointBench.measureCheckpointLogStartOffsets                3          100  thrpt   15  2.673 ± 0.027  ops/ms
CheckpointBench.measureCheckpointLogStartOffsets                3         1000  thrpt   15  1.066 ± 0.012  ops/ms
CheckpointBench.measureCheckpointLogStartOffsets                3         2000  thrpt   15  0.609 ± 0.007  ops/ms

Looks marginally better. I'm re-running with -prof gc so I can include the final results in the commit.

ijuma · 2020-04-19T22:55:44Z

ok to test

ijuma · 2020-04-23T07:16:07Z

retest this please

ijuma · 2020-04-23T14:08:43Z

core/src/main/scala/kafka/log/LogManager.scala

+      byDir.getOrElseUpdate(log.parentDir, new mutable.AnyRefMap[TopicPartition, Log]()).put(tp, log)
+    }
+    currentLogs.foreach { case (tp, log) => addToDir(tp, log) }
+    futureLogs.foreach { case (tp, log) => addToDir(tp, log) }


scala.collection.Map has a def foreachEntry[U](f: (K, V) => U): Unit method that avoids allocating tuples (it's a function with two parameters). If we want to avoid allocations, we should introduce a similar method to Pool and use it here.

I tried out your suggestion and interestingly saw more allocations. Note extreme use of the var item there in case there was an extra allocation there.

--- a/core/src/main/scala/kafka/log/LogManager.scala +++ b/core/src/main/scala/kafka/log/LogManager.scala @@ -1011,8 +1011,8 @@ class LogManager(logDirs: Seq[File], def addToDir(tp: TopicPartition, log: Log): Unit = { byDir.getOrElseUpdate(log.parentDir, new mutable.AnyRefMap[TopicPartition, Log]()).put(tp, log) } - currentLogs.foreach { case (tp, log) => addToDir(tp, log) } - futureLogs.foreach { case (tp, log) => addToDir(tp, log) } + currentLogs.foreachEntry(addToDir) + futureLogs.foreachEntry(addToDir) byDir } diff --git a/core/src/main/scala/kafka/utils/Pool.scala b/core/src/main/scala/kafka/utils/Pool.scala index 964de7eae2..450f90611b 100644 --- a/core/src/main/scala/kafka/utils/Pool.scala +++ b/core/src/main/scala/kafka/utils/Pool.scala @@ -74,6 +74,15 @@ class Pool[K,V](valueFactory: Option[K => V] = None) extends Iterable[(K, V)] { def values: Iterable[V] = pool.values.asScala def clear(): Unit = { pool.clear() } + + def foreachEntry(f: (K, V) => Unit): Unit = { + val iter = iterator + var item: (K, V) = null + while(iter.hasNext) { + item = iter.next() + f(item._1, item._2) + } + }

CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate.norm 3 2000 thrpt 15 1428889.965 ± 75131.113 B/op
vs this PR:
CheckpointBench.measureCheckpointLogStartOffsets:·gc.alloc.rate.norm 3 2000 thrpt 15 1284326.850 ± 75148.430 B/op

^ updated text there to say "this PR" rather than trunk.

I think the issue is that you're not using the underlying map iterator, right?

You have to use the iterator of ConcurrentHashMap[K, V] which gives you Map.Entry instead of a tuple. Then you avoid it.

Do you know what I'm doing wrong when I supply a lambda to the java .forEach? I know it works for consumers but in my brief effort improving this I couldn't make it work with the BiConsumer here and had to create a new one.

I will update the body with the rerun jmh results.

What error are you getting?

AFAICT, it easy enough to do something like this:

pool.forEach((k,v) => f(k,v))

But then we appear to be back to allocating tuples.

It doesn't appear easy to do something like:

pool.forEach(kv => f(kv.getKey, kv.getValue))

since it's not able to create a BiConsumer for you.

Why do you say that the following would allocate tuples?

def foreachEntry(f: (K, V) => Unit): Unit = { pool.forEach((k, v) => f(k, v)) }

Did you measure it? Looking at the code, I don't see any tuples.

@ijuma I had mistakenly thought it was taking a tuple there, and maybe that was just the way Scala converted a lambda to a BiConsumer, and the results had regressed back to close to what I was seeing with the iterator version.

When I ran it, it returned saw:

1140074.937 ± 75151.914 B/op def foreachEntry(f: (K, V) => Unit): Unit = { pool.forEach((k, v) => f(k, v)) } 1044120.722 ± 1676.137 B/op (included in an above comment) pool.forEach(new BiConsumer[K,V] { override def accept(t: K, u: V): Unit = f(t, u) })

Strangely I have just re-run the BiConsumer version and it returned 1188097.413 ± 1639.537 B/op. I'm not sure why it's regressed from what I saw on a previous run. I am OK with using the version with a lambda if you are. I'm not sure I will have time to investigate it further, and it's still a good improvement compared to the version that used foreach.

lbradstreet · 2020-04-24T14:13:52Z

I have updated the benchmark results in the OP with the new forEach code. It is strictly better than the old foreach code.

ijuma

LGTM, thanks for the PR!

ijuma · 2020-04-24T14:17:13Z

ok to test

ijuma · 2020-04-24T22:20:14Z

ok to test

ijuma · 2020-04-24T22:20:27Z

retest this please

ijuma · 2020-04-24T23:18:47Z

ok to test

…t-for-generated-requests * apache-github/trunk: (366 commits) MINOR: Improve producer test BufferPoolTest#testCloseNotifyWaiters. (apache#7982) MINOR: document how to escape json parameters to ducktape tests (apache#8546) KAFKA-9885; Evict last members of a group when the maximum allowed is reached (apache#8525) KAFKA-9866: Avoid election for topics where preferred leader is not in ISR (apache#8524) KAFKA-9839; Broker should accept control requests with newer broker epoch (apache#8509) KAKFA-9612: Add an option to kafka-configs.sh to add configs from a prop file (KIP-574) MINOR: Partition is under reassignment when adding and removing (apache#8364) MINOR: reduce allocations in log start and recovery checkpoints (apache#8467) MINOR: Remove unused foreign-key join class (apache#8547) HOTFIX: Fix broker bounce system tests (apache#8532) KAFKA-9704: Fix the issue z/OS won't let us resize file when mmap. (apache#8224) KAFKA-8639: Replace AddPartitionsToTxn with Automated Protocol (apache#8326) MINOR: equals() should compare all fields for generated classes (apache#8539) KAFKA-9844; Fix race condition which allows more than maximum number of members(apache#8454) KAFKA-9823: Remember the sent generation for the coordinator request (apache#8445) KAFKA-9883: Add better error message when REST API forwards a request and leader is not known (apache#8536) KAFKA-9907: Switch default build to Scala 2.13 (apache#8537) MINOR: Some html fixes in Streams DSL documentation (apache#8503) MINOR: Enable fatal warnings with scala 2.13 (apache#8429) KAFKA-9852: Change the max duration that calls to the buffer pool can block from 2000ms to 10ms to reduce overall test runtime (apache#8464) ...

lbradstreet and others added 2 commits April 11, 2020 16:28

Reduce allocations in logsbydir

ebc5104

Bring bench runs back in line

ace92dd

ijuma reviewed Apr 16, 2020

View reviewed changes

Switch to AnyRefMap, add comment

b2e4221

ijuma reviewed Apr 23, 2020

View reviewed changes

lbradstreet added 3 commits April 23, 2020 14:53

Add foreachEntry to Pool

2ce3e12

Revert topic counts change

4f67c62

Switch to using lambda with forEach

d54ecb8

ijuma approved these changes Apr 24, 2020

View reviewed changes

Remove unused import

409f483

ijuma merged commit cfc34ca into apache:trunk Apr 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MINOR: reduce allocations in log start and recovery checkpoints #8467

MINOR: reduce allocations in log start and recovery checkpoints #8467

lbradstreet commented Apr 12, 2020 •

edited

ijuma Apr 16, 2020

ijuma Apr 16, 2020

lbradstreet Apr 16, 2020

lbradstreet Apr 19, 2020

ijuma commented Apr 17, 2020

lbradstreet commented Apr 19, 2020

ijuma commented Apr 19, 2020

ijuma commented Apr 23, 2020

ijuma Apr 23, 2020 •

edited

lbradstreet Apr 23, 2020 •

edited

lbradstreet Apr 23, 2020

ijuma Apr 23, 2020

ijuma Apr 23, 2020

lbradstreet Apr 23, 2020

ijuma Apr 23, 2020

lbradstreet Apr 23, 2020

ijuma Apr 24, 2020

lbradstreet Apr 24, 2020

lbradstreet commented Apr 24, 2020

ijuma left a comment

ijuma commented Apr 24, 2020

ijuma commented Apr 24, 2020

ijuma commented Apr 24, 2020

ijuma commented Apr 24, 2020

MINOR: reduce allocations in log start and recovery checkpoints #8467

MINOR: reduce allocations in log start and recovery checkpoints #8467

Conversation

lbradstreet commented Apr 12, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ijuma commented Apr 17, 2020

lbradstreet commented Apr 19, 2020

ijuma commented Apr 19, 2020

ijuma commented Apr 23, 2020

ijuma Apr 23, 2020 • edited

Choose a reason for hiding this comment

lbradstreet Apr 23, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbradstreet commented Apr 24, 2020

ijuma left a comment

Choose a reason for hiding this comment

ijuma commented Apr 24, 2020

ijuma commented Apr 24, 2020

ijuma commented Apr 24, 2020

ijuma commented Apr 24, 2020

lbradstreet commented Apr 12, 2020 •

edited

ijuma Apr 23, 2020 •

edited

lbradstreet Apr 23, 2020 •

edited