New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce Heap Usage of OnHeapStringDictionary #12078
Comments
Why do we need to store strings? We should probably use byte array right and avoid creating string in the first place? |
Meaning the dictionaries of the same column across different segments would potentially have a lot of duplicate values and waste space? |
@kishoreg The above problem exists with duplicate bytes as well. My PoC code does both String internining and Byte interning. The Jxray analysis is below |
Just to add some context, the reason why this was added in the first place was the fact that for certain workloads, byte -> String de-serialization was becoming the bottleneck. And it couldn't be avoided because sorted ordering of byte[] != sorted ordering of Strings. |
I would recommend against using String.intern, see an authoritative source here, which recommends manual interning over use of String.intern. Depending on which GC (G1, Shenandoah, Z) you’re running with you may be able to get the GC to deduplicate the backing data with -XX:+UseStringDeduplication. Assuming that data in dictionaries should get quite old, you can tune it with -XX:StringDeduplicationAgeThreshold=n where n is by default 3 collections. You can check it’s working properly with -XX:+PrintStringDeduplicationStatistics. This solution has the benefit of not making code changes with data-dependent efficacy and ramifications. (On the other hand, this may not be effective at all if the dictionaries don’t live long enough…) |
@richardstartin I want to clarify that we are not using Java's native Do you see any potential issues mentioned in the article based on the code above? I'll also take a closer look at the article you shared. |
Sorry for what may have seemed like a drive by comment. I was trying to support the idea of doing custom interning (if any interning is done at all) rather than suggest the prototype used native interning. I would also expect there to be some treatment of GC string deduplication as a potential alternative solution, if only to share lessons learned, what was tried (e.g. which GC, which JDK version, was the age threshold modified?) before proceeding to implementation. I don’t see any issues with what’s being proposed, except users will need a way to switch it on/off, change the size, observe it and so on. |
@richardstartin thanks for the pointers. I do plan to have configs to enable/disable interning along with knobs to control the size. |
I think that is something we need to explore in the longer term. We would reduce the GC usage by a lot if we do that.
Sure, that is something we need to take into account and be careful with the implementation. This is specially problematic when strings are not normalized. What I did in the past was to use a Str class that has two attributes: a ByteBuffer and String. When the Str is build from IO buffers, the bytebuffer is set to the slice and the String is set to null. When a materialization is needed (for example, the io buffer will be released or we need to compare the strings), a
I may be wrong, but dictionaries are bound to the query lifetime, right? I mean, we create the dictionary when the segment is being queried and do not re-use it in following queries. If that is the case String Deduplication won't be useful at all because it is only used on Strings in the old generation.
I'm with Richard here. My experience with String.intern is bad. It is just better to use our own structure to intern Strings. Something as simple as a Guava Cache is usually better than String.intern. |
BTW, this issue is focused on the memory impact of the dictionary. But there is another theoretical improvement here. The solution proposed in #12223 has the side effect that two equal string literals that belong to the same column in different segments will probably be resolved to the same Java String object. When working with ClickBench, I've seen that we waste a lot of time evaluating equals between actually equal (but not same) String objects when these Strings are used as aggregation keys. With this PR it is possible to find that these two equal String values that were read from different segments are actually the same String Java object, which means that the equals may be evaluated in constant time instead of linear (comparing all bytes). We should verify the impact in reality of this theoretical reasoning, but in case it actually shows an increase in performance, we could apply the same technique in the brokers when data is being read (interning strings sent by different servers). Although, as said in my previous message, I think the largest improvement would be to use a Str class that actually doesn't allocate in heap if it is not needed. |
Our
OnHeapStringDictionary
implementation can result in a lot of wasted heap usage if there are enough duplicates in a column.Below is JXray analysis of the heapdump for one usecase in Linkedin where the
OnHeapStringDictionary
uses about 13GB of heapString Interning described in https://www.baeldung.com/string/intern solves this problem. However, there could be certain high-cardinality columns (even with enough duplicates) where interning can be counter productive. So we can solve this with a fixed size interner as described in the following article https://dzone.com/articles/duplicate-strings-how-to-get-rid-of-them-and-save.
I attempted to PoC this change on one of our usecases and observed that the we saw huge savings in heap usage. Below is the heapdump analysis with my PoC change. Note that I used a size of 32M for the fixed size interner.
I'm planning to expose a new tableIndexConfig called
onHeapDictionaryConfig
that will allow us to enable interning and control the size of the Fixed Size interner.The text was updated successfully, but these errors were encountered: