Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Heap Usage of OnHeapStringDictionary #12078

Open
vvivekiyer opened this issue Dec 1, 2023 · 11 comments
Open

Reduce Heap Usage of OnHeapStringDictionary #12078

vvivekiyer opened this issue Dec 1, 2023 · 11 comments

Comments

@vvivekiyer
Copy link
Contributor

vvivekiyer commented Dec 1, 2023

Our OnHeapStringDictionary implementation can result in a lot of wasted heap usage if there are enough duplicates in a column.

Below is JXray analysis of the heapdump for one usecase in Linkedin where the OnHeapStringDictionary uses about 13GB of heap
image

String Interning described in https://www.baeldung.com/string/intern solves this problem. However, there could be certain high-cardinality columns (even with enough duplicates) where interning can be counter productive. So we can solve this with a fixed size interner as described in the following article https://dzone.com/articles/duplicate-strings-how-to-get-rid-of-them-and-save.

I attempted to PoC this change on one of our usecases and observed that the we saw huge savings in heap usage. Below is the heapdump analysis with my PoC change. Note that I used a size of 32M for the fixed size interner.
image

I'm planning to expose a new tableIndexConfig called onHeapDictionaryConfig that will allow us to enable interning and control the size of the Fixed Size interner.

@vvivekiyer
Copy link
Contributor Author

cc: @Jackie-Jiang @siddharthteotia

@kishoreg
Copy link
Member

kishoreg commented Dec 1, 2023

Why do we need to store strings? We should probably use byte array right and avoid creating string in the first place?

@jasperjiaguo
Copy link
Contributor

Our OnHeapStringDictionary implementation can result in a lot of wasted heap usage if there are enough duplicates in a column.

Meaning the dictionaries of the same column across different segments would potentially have a lot of duplicate values and waste space?

@vvivekiyer
Copy link
Contributor Author

vvivekiyer commented Dec 1, 2023

Why do we need to store strings? We should probably use byte array right and avoid creating string in the first place?

@kishoreg The above problem exists with duplicate bytes as well. My PoC code does both String internining and Byte interning. The Jxray analysis is below

Before
image

After
image

@mayankshriv
Copy link
Contributor

Just to add some context, the reason why this was added in the first place was the fact that for certain workloads, byte -> String de-serialization was becoming the bottleneck. And it couldn't be avoided because sorted ordering of byte[] != sorted ordering of Strings.

@richardstartin
Copy link
Member

richardstartin commented Dec 1, 2023

I would recommend against using String.intern, see an authoritative source here, which recommends manual interning over use of String.intern.

Depending on which GC (G1, Shenandoah, Z) you’re running with you may be able to get the GC to deduplicate the backing data with -XX:+UseStringDeduplication. Assuming that data in dictionaries should get quite old, you can tune it with -XX:StringDeduplicationAgeThreshold=n where n is by default 3 collections. You can check it’s working properly with -XX:+PrintStringDeduplicationStatistics. This solution has the benefit of not making code changes with data-dependent efficacy and ramifications. (On the other hand, this may not be effective at all if the dictionaries don’t live long enough…)

@vvivekiyer
Copy link
Contributor Author

vvivekiyer commented Dec 1, 2023

@richardstartin
We did try the GC optimizations with -XX:+UseStringDeduplication (and others) but noticed elevated CPU usage affecting our query latencies.

I want to clarify that we are not using Java's native string.intern() here but rather using manual interning. As I shared above, the implementation is based on article here. The Poc code is available here - vvivekiyer@dc4538b .

Do you see any potential issues mentioned in the article based on the code above? I'll also take a closer look at the article you shared.

@richardstartin
Copy link
Member

Sorry for what may have seemed like a drive by comment. I was trying to support the idea of doing custom interning (if any interning is done at all) rather than suggest the prototype used native interning. I would also expect there to be some treatment of GC string deduplication as a potential alternative solution, if only to share lessons learned, what was tried (e.g. which GC, which JDK version, was the age threshold modified?) before proceeding to implementation.

I don’t see any issues with what’s being proposed, except users will need a way to switch it on/off, change the size, observe it and so on.

@vvivekiyer
Copy link
Contributor Author

@richardstartin thanks for the pointers. I do plan to have configs to enable/disable interning along with knobs to control the size.

@gortiz
Copy link
Contributor

gortiz commented Jan 9, 2024

Why do we need to store strings? We should probably use byte array right and avoid creating string in the first place?

I think that is something we need to explore in the longer term. We would reduce the GC usage by a lot if we do that.

Just to add some context, the reason why this was added in the first place was the fact that for certain workloads, byte -> String de-serialization was becoming the bottleneck.

Sure, that is something we need to take into account and be careful with the implementation. This is specially problematic when strings are not normalized. What I did in the past was to use a Str class that has two attributes: a ByteBuffer and String. When the Str is build from IO buffers, the bytebuffer is set to the slice and the String is set to null. When a materialization is needed (for example, the io buffer will be released or we need to compare the strings), a materialize() method is called. That method initializes the String and after that moment the String is always used. By doing so we can skip the String creation (and therefore heap allocation) in almost all cases where the Str is not used as aggregation key.

We did try the GC optimizations with -XX:+UseStringDeduplication (and others) but noticed elevated CPU usage affecting our query latencies.

I may be wrong, but dictionaries are bound to the query lifetime, right? I mean, we create the dictionary when the segment is being queried and do not re-use it in following queries. If that is the case String Deduplication won't be useful at all because it is only used on Strings in the old generation.

I would recommend against using String.intern, see an authoritative source here, which recommends manual interning over use of String.intern.

I'm with Richard here. My experience with String.intern is bad. It is just better to use our own structure to intern Strings. Something as simple as a Guava Cache is usually better than String.intern.

@gortiz
Copy link
Contributor

gortiz commented Jan 10, 2024

BTW, this issue is focused on the memory impact of the dictionary. But there is another theoretical improvement here. The solution proposed in #12223 has the side effect that two equal string literals that belong to the same column in different segments will probably be resolved to the same Java String object.

When working with ClickBench, I've seen that we waste a lot of time evaluating equals between actually equal (but not same) String objects when these Strings are used as aggregation keys. With this PR it is possible to find that these two equal String values that were read from different segments are actually the same String Java object, which means that the equals may be evaluated in constant time instead of linear (comparing all bytes).

We should verify the impact in reality of this theoretical reasoning, but in case it actually shows an increase in performance, we could apply the same technique in the brokers when data is being read (interning strings sent by different servers). Although, as said in my previous message, I think the largest improvement would be to use a Str class that actually doesn't allocate in heap if it is not needed.

@vvivekiyer vvivekiyer self-assigned this Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants