Reduce Heap Usage of OnHeapStringDictionary #12078

vvivekiyer · 2023-12-01T00:32:32Z

Our OnHeapStringDictionary implementation can result in a lot of wasted heap usage if there are enough duplicates in a column.

Below is JXray analysis of the heapdump for one usecase in Linkedin where the OnHeapStringDictionary uses about 13GB of heap

String Interning described in https://www.baeldung.com/string/intern solves this problem. However, there could be certain high-cardinality columns (even with enough duplicates) where interning can be counter productive. So we can solve this with a fixed size interner as described in the following article https://dzone.com/articles/duplicate-strings-how-to-get-rid-of-them-and-save.

I attempted to PoC this change on one of our usecases and observed that the we saw huge savings in heap usage. Below is the heapdump analysis with my PoC change. Note that I used a size of 32M for the fixed size interner.

I'm planning to expose a new tableIndexConfig called onHeapDictionaryConfig that will allow us to enable interning and control the size of the Fixed Size interner.

The text was updated successfully, but these errors were encountered:

vvivekiyer · 2023-12-01T00:36:51Z

cc: @Jackie-Jiang @siddharthteotia

kishoreg · 2023-12-01T00:45:04Z

Why do we need to store strings? We should probably use byte array right and avoid creating string in the first place?

jasperjiaguo · 2023-12-01T00:55:29Z

Our OnHeapStringDictionary implementation can result in a lot of wasted heap usage if there are enough duplicates in a column.

Meaning the dictionaries of the same column across different segments would potentially have a lot of duplicate values and waste space?

vvivekiyer · 2023-12-01T01:42:50Z

Why do we need to store strings? We should probably use byte array right and avoid creating string in the first place?

@kishoreg The above problem exists with duplicate bytes as well. My PoC code does both String internining and Byte interning. The Jxray analysis is below

Before

After

mayankshriv · 2023-12-01T19:13:52Z

Just to add some context, the reason why this was added in the first place was the fact that for certain workloads, byte -> String de-serialization was becoming the bottleneck. And it couldn't be avoided because sorted ordering of byte[] != sorted ordering of Strings.

richardstartin · 2023-12-01T21:17:23Z

I would recommend against using String.intern, see an authoritative source here, which recommends manual interning over use of String.intern.

Depending on which GC (G1, Shenandoah, Z) you’re running with you may be able to get the GC to deduplicate the backing data with -XX:+UseStringDeduplication. Assuming that data in dictionaries should get quite old, you can tune it with -XX:StringDeduplicationAgeThreshold=n where n is by default 3 collections. You can check it’s working properly with -XX:+PrintStringDeduplicationStatistics. This solution has the benefit of not making code changes with data-dependent efficacy and ramifications. (On the other hand, this may not be effective at all if the dictionaries don’t live long enough…)

vvivekiyer · 2023-12-01T23:50:35Z

@richardstartin
We did try the GC optimizations with -XX:+UseStringDeduplication (and others) but noticed elevated CPU usage affecting our query latencies.

I want to clarify that we are not using Java's native string.intern() here but rather using manual interning. As I shared above, the implementation is based on article here. The Poc code is available here - vvivekiyer@dc4538b .

Do you see any potential issues mentioned in the article based on the code above? I'll also take a closer look at the article you shared.

richardstartin · 2023-12-02T09:50:12Z

Sorry for what may have seemed like a drive by comment. I was trying to support the idea of doing custom interning (if any interning is done at all) rather than suggest the prototype used native interning. I would also expect there to be some treatment of GC string deduplication as a potential alternative solution, if only to share lessons learned, what was tried (e.g. which GC, which JDK version, was the age threshold modified?) before proceeding to implementation.

I don’t see any issues with what’s being proposed, except users will need a way to switch it on/off, change the size, observe it and so on.

vvivekiyer · 2023-12-02T20:26:29Z

@richardstartin thanks for the pointers. I do plan to have configs to enable/disable interning along with knobs to control the size.

gortiz · 2024-01-09T09:30:30Z

Why do we need to store strings? We should probably use byte array right and avoid creating string in the first place?

I think that is something we need to explore in the longer term. We would reduce the GC usage by a lot if we do that.

Just to add some context, the reason why this was added in the first place was the fact that for certain workloads, byte -> String de-serialization was becoming the bottleneck.

Sure, that is something we need to take into account and be careful with the implementation. This is specially problematic when strings are not normalized. What I did in the past was to use a Str class that has two attributes: a ByteBuffer and String. When the Str is build from IO buffers, the bytebuffer is set to the slice and the String is set to null. When a materialization is needed (for example, the io buffer will be released or we need to compare the strings), a materialize() method is called. That method initializes the String and after that moment the String is always used. By doing so we can skip the String creation (and therefore heap allocation) in almost all cases where the Str is not used as aggregation key.

We did try the GC optimizations with -XX:+UseStringDeduplication (and others) but noticed elevated CPU usage affecting our query latencies.

I may be wrong, but dictionaries are bound to the query lifetime, right? I mean, we create the dictionary when the segment is being queried and do not re-use it in following queries. If that is the case String Deduplication won't be useful at all because it is only used on Strings in the old generation.

I would recommend against using String.intern, see an authoritative source here, which recommends manual interning over use of String.intern.

I'm with Richard here. My experience with String.intern is bad. It is just better to use our own structure to intern Strings. Something as simple as a Guava Cache is usually better than String.intern.

gortiz · 2024-01-10T09:06:45Z

BTW, this issue is focused on the memory impact of the dictionary. But there is another theoretical improvement here. The solution proposed in #12223 has the side effect that two equal string literals that belong to the same column in different segments will probably be resolved to the same Java String object.

When working with ClickBench, I've seen that we waste a lot of time evaluating equals between actually equal (but not same) String objects when these Strings are used as aggregation keys. With this PR it is possible to find that these two equal String values that were read from different segments are actually the same String Java object, which means that the equals may be evaluated in constant time instead of linear (comparing all bytes).

We should verify the impact in reality of this theoretical reasoning, but in case it actually shows an increase in performance, we could apply the same technique in the brokers when data is being read (interning strings sent by different servers). Although, as said in my previous message, I think the largest improvement would be to use a Str class that actually doesn't allocate in heap if it is not needed.

Jackie-Jiang added enhancement performance labels Dec 5, 2023

vvivekiyer mentioned this issue Jan 5, 2024

Reduce Heap Usage of OnHeapStringDictionary #12223

Merged

vvivekiyer self-assigned this Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce Heap Usage of OnHeapStringDictionary #12078

Reduce Heap Usage of OnHeapStringDictionary #12078

vvivekiyer commented Dec 1, 2023 •

edited

vvivekiyer commented Dec 1, 2023

kishoreg commented Dec 1, 2023

jasperjiaguo commented Dec 1, 2023

vvivekiyer commented Dec 1, 2023 •

edited

mayankshriv commented Dec 1, 2023

richardstartin commented Dec 1, 2023 •

edited

vvivekiyer commented Dec 1, 2023 •

edited

richardstartin commented Dec 2, 2023

vvivekiyer commented Dec 2, 2023

gortiz commented Jan 9, 2024

gortiz commented Jan 10, 2024

Reduce Heap Usage of OnHeapStringDictionary #12078

Reduce Heap Usage of OnHeapStringDictionary #12078

Comments

vvivekiyer commented Dec 1, 2023 • edited

vvivekiyer commented Dec 1, 2023

kishoreg commented Dec 1, 2023

jasperjiaguo commented Dec 1, 2023

vvivekiyer commented Dec 1, 2023 • edited

mayankshriv commented Dec 1, 2023

richardstartin commented Dec 1, 2023 • edited

vvivekiyer commented Dec 1, 2023 • edited

richardstartin commented Dec 2, 2023

vvivekiyer commented Dec 2, 2023

gortiz commented Jan 9, 2024

gortiz commented Jan 10, 2024

vvivekiyer commented Dec 1, 2023 •

edited

vvivekiyer commented Dec 1, 2023 •

edited

richardstartin commented Dec 1, 2023 •

edited

vvivekiyer commented Dec 1, 2023 •

edited