Skip to content

[improve][common] Optimize TopicName.get() to reduce lock contention on cache lookup#25367

Open
liangyepianzhou wants to merge 1 commit intoapache:masterfrom
liangyepianzhou:optimize/TopicName-get
Open

[improve][common] Optimize TopicName.get() to reduce lock contention on cache lookup#25367
liangyepianzhou wants to merge 1 commit intoapache:masterfrom
liangyepianzhou:optimize/TopicName-get

Conversation

@liangyepianzhou
Copy link
Copy Markdown
Contributor

Motivation

TopicName.get() previously used ConcurrentHashMap.computeIfAbsent() to populate the topic-name cache. Although computeIfAbsent is atomic, it holds the internal bin-lock for the entire duration of the mapping function, which includes the non-trivial TopicName construction (string splitting, validation, etc.).

Under high-concurrency workloads where many threads simultaneously encounter the same uncached topic name, this causes unnecessary lock contention and can degrade throughput.

Modifications

Replace computeIfAbsent with an explicit two-step pattern:

  1. Fast path: call cache.get(topic) first — a single volatile read with no locking — and return immediately on a cache hit (steady-state case).
  2. Slow path (cache miss): construct TopicName outside the lock, then use cache.putIfAbsent() to insert. If two threads race on the same key, one wins the putIfAbsent and the other's instance is discarded; this is safe because TopicName is immutable.

Add a Javadoc comment on get() explaining the rationale.

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

…on cache lookup

### Motivation

`TopicName.get()` previously used `ConcurrentHashMap.computeIfAbsent()` to populate
the topic-name cache. Although `computeIfAbsent` is atomic, it holds the internal
bin-lock for the entire duration of the mapping function, which includes the
non-trivial `TopicName` construction (string splitting, validation, etc.).

Under high-concurrency workloads where many threads simultaneously encounter the same
uncached topic name, this causes unnecessary lock contention and can degrade throughput.

### Modifications

Replace `computeIfAbsent` with an explicit two-step pattern:

1. **Fast path**: call `cache.get(topic)` first — a single volatile read with no
   locking — and return immediately on a cache hit (steady-state case).
2. **Slow path** (cache miss): construct `TopicName` *outside* the lock, then use
   `cache.putIfAbsent()` to insert. If two threads race on the same key, one wins
   the `putIfAbsent` and the other's instance is discarded; this is safe because
   `TopicName` is immutable.

Add a Javadoc comment on `get()` explaining the rationale.
@liangyepianzhou
Copy link
Copy Markdown
Contributor Author

liangyepianzhou commented Mar 20, 2026

image

Background

While load-testing Pulsar with a single topic containing 1,000,000 partitions, we observed that TopicName.get() was consuming a disproportionate amount of CPU, showing up prominently in flame graphs.

Root Cause Analysis

Two compounding issues were identified:

  1. topicNameCacheMaxCapacity too small — With 1M partitions, the default cache capacity is inevitably exceeded, triggering repeated clearIfReachedMaxCapacity calls and causing cache stampedes where all partitions miss simultaneously.

  2. new TopicName() executed inside the lock — The current implementation uses computeIfAbsent(key, TopicName::new), which holds the ConcurrentHashMap bin-lock for the entire duration of object construction. Under concurrent cache misses, threads serialize on the same lock, degrading throughput significantly.

Fix

Replace computeIfAbsent with a lock-free get + out-of-lock construction + putIfAbsent pattern:

// Before
return cache.computeIfAbsent(topic, TopicName::new);

// After
TopicName cached = cache.get(topic);
if (cached != null) return cached;
TopicName created = new TopicName(topic);          // constructed outside lock
TopicName existing = cache.putIfAbsent(topic, created);
return existing != null ? existing : created;

Under concurrent misses on the same key, each thread constructs its own instance independently; putIfAbsent elects the winner and the losers are simply GC'd — eliminating bin-lock contention entirely.

Benchmark Results (JMH · SingleShotTime · 8 threads · 1M partitions · full cold cache)

Benchmark Avg (ms) Median (ms) p99 (ms) Min (ms)
miss_AcomputeIfAbsent 427.1 395.3 644.5 316.5
miss_Bget + putIfAbsent 287.1 278.9 368.0 236.1
Improvement ~1.49× ~1.42× ~1.75× ~1.34×

The tail latency improvement (1.75× at p99) is especially significant: computeIfAbsent causes severe jitter under lock contention (316–644 ms range), while get+putIfAbsent stays stable (236–368 ms).

image

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Mar 20, 2026
Copy link
Copy Markdown
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good.

can you try to compare to just call cache .put() instead of putIfAbsent()?

It would be good if you could add the jhm code to the microbenchmarks module

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Mar 20, 2026

Related PR from the past #24457 (change was rejected so I closed it) with a lot of interesting comments.
One of the comments came from Caffeine author: #24457 (comment).
Besides performance another detail to consider is the duplication of java.lang.String instances in memory.

Copy link
Copy Markdown
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't address one of the key problems which is that the cache is cleared if the size exceeds the max size setting (topicNameCacheMaxCapacity):

public static void clearIfReachedMaxCapacity(int maxCapacity) {
if (maxCapacity < 0) {
// Unlimited cache.
return;
}
if (cache.size() > maxCapacity) {
cache.clear();
}
}

We'd be better off by switching to use Caffeine again. The Caffeine author commented in #24457 (comment) that the bottleneck in the earlier version has been addressed. Since we have upgraded to Java 17 for the client, we can use the newer Caffeine version.

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Mar 20, 2026

The flamegraph in #25367 (comment) looks like it was from an old version of Pulsar that uses Guava Cache (before #23052 <3.0.6, <3.3.1).
@liangyepianzhou What version of Pulsar are you load testing?

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Mar 20, 2026

The tail latency improvement (1.75× at p99) is especially significant: computeIfAbsent causes severe jitter under lock contention (316–644 ms range), while get+putIfAbsent stays stable (236–368 ms).

The benchmark seems to measure the miss of 1M cache entries. That's not a very realistic scenario that there would be such amount of misses at once.

I think it would be more useful to have a solution that ensures that the memory use of the cache is bounded.
That is something that I have added quite recently to AbstractMetadataStore children cache:

long childrenCacheMaxSizeBytes = getChildrenCacheMaxSizeBytes();
Caffeine<Object, Object> childrenCacheBuilder = Caffeine.newBuilder()
.recordStats()
.refreshAfterWrite(CACHE_REFRESH_TIME_MILLIS, TimeUnit.MILLISECONDS)
.expireAfterWrite(CACHE_REFRESH_TIME_MILLIS * 2, TimeUnit.MILLISECONDS);
if (childrenCacheMaxSizeBytes > 0) {
childrenCacheBuilder.maximumWeight(childrenCacheMaxSizeBytes)
.weigher((String key, List<String> children) -> {
// calculate the total byte size of the key and entries in the children list
// to get some estimation of the required heap memory required for the entry.
// add 16 bytes overhead for Java object header and 16 bytes for java.lang.String fields.
int totalSize = ByteBufUtil.utf8Bytes(key) + 32;
for (String child : children) {
totalSize += ByteBufUtil.utf8Bytes(child) + 32;
}
return totalSize;
});
}

in #24868
The max size is limited to 20% of max heapsize.

For TopicName cache there could be a bytesize limit too. The StringInterner solution would be useful at least for namespace and tenant Strings to ensure that those don't cause heap duplication if the namespace or tenant caches expire or overflow.

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Mar 20, 2026

The benchmark doesn't use actual code which would be part of the Pulsar code base.
I believe that this is a more representative way: https://github.com/lhotari/pulsar/blob/lh-fix-topicname-memory-leak/microbench/src/main/java/org/apache/pulsar/common/naming/TopicNameBenchmark.java
That was done in #25367 which wasn't merged.

@liangyepianzhou
Copy link
Copy Markdown
Contributor Author

The flamegraph in #25367 (comment) looks like it was from an old version of Pulsar that uses Guava Cache (before #23052 <3.0.6, <3.3.1). @liangyepianzhou What version of Pulsar are you load testing?

The flamegraph comes from 3.0.5, I change version to 3.0.13 and make topicNameCacheMaxCapacity = -1. CPU usage dropped from 90% to 9%, but the percentage of TopicName.get() in the flame graph is still very high.

The benchmark seems to measure the miss of 1M cache entries. That's not a very realistic scenario that there would be such amount of misses at once.

In my scenario, one topic has 1 million partitions, so there are 1 million cache misses during the initial startup. Topic loading is currently very slow, and I suspect this may be one of the reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants