[improve][common] Optimize TopicName.get() to reduce lock contention on cache lookup by liangyepianzhou · Pull Request #25367 · apache/pulsar

liangyepianzhou · 2026-03-20T06:02:07Z

Motivation

TopicName.get() previously used ConcurrentHashMap.computeIfAbsent() to populate the topic-name cache. Although computeIfAbsent is atomic, it holds the internal bin-lock for the entire duration of the mapping function, which includes the non-trivial TopicName construction (string splitting, validation, etc.).

Under high-concurrency workloads where many threads simultaneously encounter the same uncached topic name, this causes unnecessary lock contention and can degrade throughput.

Modifications

Replace computeIfAbsent with an explicit two-step pattern:

Fast path: call cache.get(topic) first — a single volatile read with no locking — and return immediately on a cache hit (steady-state case).
Slow path (cache miss): construct TopicName outside the lock, then use cache.putIfAbsent() to insert. If two threads race on the same key, one wins the putIfAbsent and the other's instance is discarded; this is safe because TopicName is immutable.

Add a Javadoc comment on get() explaining the rationale.

Verifying this change

Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (10MB)
Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository:

…on cache lookup ### Motivation `TopicName.get()` previously used `ConcurrentHashMap.computeIfAbsent()` to populate the topic-name cache. Although `computeIfAbsent` is atomic, it holds the internal bin-lock for the entire duration of the mapping function, which includes the non-trivial `TopicName` construction (string splitting, validation, etc.). Under high-concurrency workloads where many threads simultaneously encounter the same uncached topic name, this causes unnecessary lock contention and can degrade throughput. ### Modifications Replace `computeIfAbsent` with an explicit two-step pattern: 1. **Fast path**: call `cache.get(topic)` first — a single volatile read with no locking — and return immediately on a cache hit (steady-state case). 2. **Slow path** (cache miss): construct `TopicName` *outside* the lock, then use `cache.putIfAbsent()` to insert. If two threads race on the same key, one wins the `putIfAbsent` and the other's instance is discarded; this is safe because `TopicName` is immutable. Add a Javadoc comment on `get()` explaining the rationale.

liangyepianzhou · 2026-03-20T06:02:34Z

Background

While load-testing Pulsar with a single topic containing 1,000,000 partitions, we observed that TopicName.get() was consuming a disproportionate amount of CPU, showing up prominently in flame graphs.

Root Cause Analysis

Two compounding issues were identified:

topicNameCacheMaxCapacity too small — With 1M partitions, the default cache capacity is inevitably exceeded, triggering repeated clearIfReachedMaxCapacity calls and causing cache stampedes where all partitions miss simultaneously.
new TopicName() executed inside the lock — The current implementation uses computeIfAbsent(key, TopicName::new), which holds the ConcurrentHashMap bin-lock for the entire duration of object construction. Under concurrent cache misses, threads serialize on the same lock, degrading throughput significantly.

Fix

Replace computeIfAbsent with a lock-free get + out-of-lock construction + putIfAbsent pattern:

// Before
return cache.computeIfAbsent(topic, TopicName::new);

// After
TopicName cached = cache.get(topic);
if (cached != null) return cached;
TopicName created = new TopicName(topic);          // constructed outside lock
TopicName existing = cache.putIfAbsent(topic, created);
return existing != null ? existing : created;

Under concurrent misses on the same key, each thread constructs its own instance independently; putIfAbsent elects the winner and the losers are simply GC'd — eliminating bin-lock contention entirely.

Benchmark Results (JMH · SingleShotTime · 8 threads · 1M partitions · full cold cache)

Benchmark	Avg (ms)	Median (ms)	p99 (ms)	Min (ms)
`miss_A` — `computeIfAbsent`	427.1	395.3	644.5	316.5
`miss_B` — `get + putIfAbsent`	287.1	278.9	368.0	236.1
Improvement	~1.49×	~1.42×	~1.75×	~1.34×

The tail latency improvement (1.75× at p99) is especially significant: computeIfAbsent causes severe jitter under lock contention (316–644 ms range), while get+putIfAbsent stays stable (236–368 ms).

Raw results: miss_B_1M_result.json
Benchmark source: TopicNameCacheBenchmark.java

merlimat

Change looks good.

can you try to compare to just call cache .put() instead of putIfAbsent()?

It would be good if you could add the jhm code to the microbenchmarks module

lhotari · 2026-03-20T20:11:09Z

Related PR from the past #24457 (change was rejected so I closed it) with a lot of interesting comments.
One of the comments came from Caffeine author: #24457 (comment).
Besides performance another detail to consider is the duplication of java.lang.String instances in memory.

lhotari

This doesn't address one of the key problems which is that the cache is cleared if the size exceeds the max size setting (topicNameCacheMaxCapacity):

pulsar/pulsar-common/src/main/java/org/apache/pulsar/common/naming/TopicName.java

Lines 54 to 62 in 2e3e78d

    
           public static void clearIfReachedMaxCapacity(int maxCapacity) { 
        
               if (maxCapacity < 0) { 
        
                   // Unlimited cache. 
        
                   return; 
        
               } 
        
               if (cache.size() > maxCapacity) { 
        
                   cache.clear(); 
        
               } 
        
           }

We'd be better off by switching to use Caffeine again. The Caffeine author commented in #24457 (comment) that the bottleneck in the earlier version has been addressed. Since we have upgraded to Java 17 for the client, we can use the newer Caffeine version.

lhotari · 2026-03-20T20:24:38Z

The flamegraph in #25367 (comment) looks like it was from an old version of Pulsar that uses Guava Cache (before #23052 <3.0.6, <3.3.1).
@liangyepianzhou What version of Pulsar are you load testing?

lhotari · 2026-03-20T20:37:05Z

The tail latency improvement (1.75× at p99) is especially significant: computeIfAbsent causes severe jitter under lock contention (316–644 ms range), while get+putIfAbsent stays stable (236–368 ms).

The benchmark seems to measure the miss of 1M cache entries. That's not a very realistic scenario that there would be such amount of misses at once.

I think it would be more useful to have a solution that ensures that the memory use of the cache is bounded.
That is something that I have added quite recently to AbstractMetadataStore children cache:

pulsar/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/impl/AbstractMetadataStore.java

Lines 123 to 141 in 6e577f0

    
           long childrenCacheMaxSizeBytes = getChildrenCacheMaxSizeBytes(); 
        
           Caffeine<Object, Object> childrenCacheBuilder = Caffeine.newBuilder() 
        
                   .recordStats() 
        
                   .refreshAfterWrite(CACHE_REFRESH_TIME_MILLIS, TimeUnit.MILLISECONDS) 
        
                   .expireAfterWrite(CACHE_REFRESH_TIME_MILLIS * 2, TimeUnit.MILLISECONDS); 
        
           if (childrenCacheMaxSizeBytes > 0) { 
        
               childrenCacheBuilder.maximumWeight(childrenCacheMaxSizeBytes) 
        
                       .weigher((String key, List<String> children) -> { 
        
                           // calculate the total byte size of the key and entries in the children list 
        
                           // to get some estimation of the required heap memory required for the entry. 
        
                           // add 16 bytes overhead for Java object header and 16 bytes for java.lang.String fields. 
        
                           int totalSize = ByteBufUtil.utf8Bytes(key) + 32; 
        
                           for (String child : children) { 
        
                               totalSize += ByteBufUtil.utf8Bytes(child) + 32; 
        
                           } 
        
                           return totalSize; 
        
                       }); 
        
           }

in #24868
The max size is limited to 20% of max heapsize.

For TopicName cache there could be a bytesize limit too. The StringInterner solution would be useful at least for namespace and tenant Strings to ensure that those don't cause heap duplication if the namespace or tenant caches expire or overflow.

lhotari · 2026-03-20T20:41:25Z

Benchmark source: TopicNameCacheBenchmark.java

The benchmark doesn't use actual code which would be part of the Pulsar code base.
I believe that this is a more representative way: https://github.com/lhotari/pulsar/blob/lh-fix-topicname-memory-leak/microbench/src/main/java/org/apache/pulsar/common/naming/TopicNameBenchmark.java
That was done in #25367 which wasn't merged.

liangyepianzhou · 2026-03-23T02:48:32Z

The flamegraph in #25367 (comment) looks like it was from an old version of Pulsar that uses Guava Cache (before #23052 <3.0.6, <3.3.1). @liangyepianzhou What version of Pulsar are you load testing?

The flamegraph comes from 3.0.5, I change version to 3.0.13 and make topicNameCacheMaxCapacity = -1. CPU usage dropped from 90% to 9%, but the percentage of TopicName.get() in the flame graph is still very high.

The benchmark seems to measure the miss of 1M cache entries. That's not a very realistic scenario that there would be such amount of misses at once.

In my scenario, one topic has 1 million partitions, so there are 1 million cache misses during the initial startup. Topic loading is currently very slow, and I suspect this may be one of the reasons.

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Mar 20, 2026

liangyepianzhou requested review from lhotari and poorbarcode March 20, 2026 06:54

merlimat reviewed Mar 20, 2026

View reviewed changes

lhotari reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improve][common] Optimize TopicName.get() to reduce lock contention on cache lookup#25367

[improve][common] Optimize TopicName.get() to reduce lock contention on cache lookup#25367
liangyepianzhou wants to merge 1 commit intoapache:masterfrom
liangyepianzhou:optimize/TopicName-get

liangyepianzhou commented Mar 20, 2026

Uh oh!

liangyepianzhou commented Mar 20, 2026 •

edited

Loading

Uh oh!

merlimat left a comment

Uh oh!

lhotari commented Mar 20, 2026

Uh oh!

lhotari left a comment

Uh oh!

lhotari commented Mar 20, 2026 •

edited

Loading

Uh oh!

lhotari commented Mar 20, 2026

Uh oh!

lhotari commented Mar 20, 2026

Uh oh!

liangyepianzhou commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	public static void clearIfReachedMaxCapacity(int maxCapacity) {
	if (maxCapacity < 0) {
	// Unlimited cache.
	return;
	}
	if (cache.size() > maxCapacity) {
	cache.clear();
	}
	}

Conversation

liangyepianzhou commented Mar 20, 2026

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Matching PR in forked repository

Uh oh!

liangyepianzhou commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

merlimat left a comment

Choose a reason for hiding this comment

Uh oh!

lhotari commented Mar 20, 2026

Uh oh!

lhotari left a comment

Choose a reason for hiding this comment

Uh oh!

lhotari commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhotari commented Mar 20, 2026

Uh oh!

lhotari commented Mar 20, 2026

Uh oh!

liangyepianzhou commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liangyepianzhou commented Mar 20, 2026 •

edited

Loading

lhotari commented Mar 20, 2026 •

edited

Loading