-
Notifications
You must be signed in to change notification settings - Fork 465
PARQUET-1261 - Remove string interning #92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
gszadovszky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this.
LGTM (not a committer)
I think, it worth some perf tests or at least starting a discussion on the dev list about this topic.
|
FYI, this was done to save memory since we refer to columns using their name in the metadata. Which can become quite big when loading a lot of files. If interning is causing problems we should replace it by a different mechanisms to serve the same purpose of deduping strings. |
|
A way to dedupe strings that does not use String.intern() is to use a weak reference set. Basically, use a WeakHashMap, potentially wrapped via Collections.newSetFromMap and Collections.synchronizedSet. If you need concurrency without synchronization, you can't easily use a ConcurrentHashMap, though you can write a class that extends WeakReference for keys and overrides the equals and hashCode method... this is slow because you have to wrap every access in a new WeakReference (which WeakHashMap avoids) If you have Guava on the classpath: https://google.github.io/guava/releases/19.0/api/docs/com/google/common/collect/Interners.html Is thread-safe and based on a concurrent map. I would avoid having Guava on the classpath here because of version conflicts with user code, though perhaps shading a copy of the classes you need is fine. Sadly, you can't create something like this on your own with just Guava's MapMaker because they use a package protected method to allow using key equivalence instead of reference equality on weak keys. The simplest thing is using the JDK's WeakHashMap, and dealing with its thread safety issues. Now for my last point. The original claim that OOMs were caused by String.intern() is bogus if it is JRE 7+. Read this: http://java-performance.info/string-intern-in-java-6-7-8/ Strings that are no longer referenced are GC'd, and the claim that it is not done 'frequently' is false. The code referenced on the mailing list was to when the JVM's class unloading code removes interned strings that were referenced statically by the class and has nothing to do with user-mode calls to If OOM conditions are happening, it is not caused by calls to intern(). And switching to a WeakHashMap will only increase the heap required to manage it. Increasing the StringTable size might help somewhat on the performance side (the benchmark at shiplev.net clearly shows how the performance tanks when the distinct string count exceeds the table side). So in summary:
|
|
thanks a lot for the details Scott. |
|
I've dug a bit more into jvm source code and it's slightly more complicated/not exactly as Scott is saying. String#intern does indeed end up on the StringTable in the jvm and there's no distinction between interened strings and what compiler/jvm interns. The problem though is that handling of that space is gc specific. The article that Scott links is totally accurate for default jvm settings and the links I posted were for CMS garbage collector. From my reading of the code it looks like interning is really only an issue under CMS (since it's very reluctant to retrieve space from it) while ParallelGC and G1 will consider it every time it does gc. Additionally interning or not you can get benefit of it by using I am doing some benchmarking but it seems that switching from string interning has potential to those using CMS gc and shouldn't make significant difference on newer jvm. Will update the pr once I am done benchmarking |
|
Since this is only ever issue with CMS and no other garbage collectors I don't think this pr matters that much. With advent of shenandoah and zgc it doesn't make sense to make this change. |
As explained on the issue - it's questionable whether this brings any benefits and can cause harm