PARQUET-1261 - Remove string interning #92

robert3005 · 2018-03-29T12:15:23Z

As explained on the issue - it's questionable whether this brings any benefits and can cause harm

gszadovszky

Thanks for working on this.
LGTM (not a committer)
I think, it worth some perf tests or at least starting a discussion on the dev list about this topic.

julienledem · 2018-04-04T00:07:58Z

FYI, this was done to save memory since we refer to columns using their name in the metadata. Which can become quite big when loading a lot of files. If interning is causing problems we should replace it by a different mechanisms to serve the same purpose of deduping strings.
Ideally we would change the metadata to refer to column by their index instead but that's a breaking change.
I replied on the mailing list as well.
Thank you

scottcarey · 2018-04-04T10:01:00Z

A way to dedupe strings that does not use String.intern() is to use a weak reference set.

Basically, use a WeakHashMap, potentially wrapped via Collections.newSetFromMap and Collections.synchronizedSet. If you need concurrency without synchronization, you can't easily use a ConcurrentHashMap, though you can write a class that extends WeakReference for keys and overrides the equals and hashCode method... this is slow because you have to wrap every access in a new WeakReference (which WeakHashMap avoids)

If you have Guava on the classpath: https://google.github.io/guava/releases/19.0/api/docs/com/google/common/collect/Interners.html Is thread-safe and based on a concurrent map. I would avoid having Guava on the classpath here because of version conflicts with user code, though perhaps shading a copy of the classes you need is fine.

Sadly, you can't create something like this on your own with just Guava's MapMaker because they use a package protected method to allow using key equivalence instead of reference equality on weak keys.

The simplest thing is using the JDK's WeakHashMap, and dealing with its thread safety issues.

Now for my last point. The original claim that OOMs were caused by String.intern() is bogus if it is JRE 7+.

Read this: http://java-performance.info/string-intern-in-java-6-7-8/

Strings that are no longer referenced are GC'd, and the claim that it is not done 'frequently' is false. The code referenced on the mailing list was to when the JVM's class unloading code removes interned strings that were referenced statically by the class and has nothing to do with user-mode calls to String.intern().

If OOM conditions are happening, it is not caused by calls to intern(). And switching to a WeakHashMap will only increase the heap required to manage it. Increasing the StringTable size might help somewhat on the performance side (the benchmark at shiplev.net clearly shows how the performance tanks when the distinct string count exceeds the table side).

So in summary:

The original concern seems incorrect, unless it is running on JRE 6 and has too-small of a perm gen configured. Otherwise, this should not lead to OOM since the Strings are strongly referenced anyway. However, String.intern can lead to some performance issues with under-sized string tables.
Doing the interning by hand with Guava's Interner or a WeakHashMap might be a win performance wise, but will likely use a bit more heap.

julienledem · 2018-04-04T17:15:24Z

thanks a lot for the details Scott.
I'd like to add that the number of distinct strings here is not that big, since it is the name of the fields in schemas (individual field names not fully qualified paths). They are referred many times though especially when inspecting footers from many files. If that's a problem we can switch to a different deduping mechanism. It seems the overhead of a separate map would still be reasonable.

robert3005 · 2018-04-04T17:48:09Z

I've dug a bit more into jvm source code and it's slightly more complicated/not exactly as Scott is saying. String#intern does indeed end up on the StringTable in the jvm and there's no distinction between interened strings and what compiler/jvm interns. The problem though is that handling of that space is gc specific. The article that Scott links is totally accurate for default jvm settings and the links I posted were for CMS garbage collector. From my reading of the code it looks like interning is really only an issue under CMS (since it's very reluctant to retrieve space from it) while ParallelGC and G1 will consider it every time it does gc. Additionally interning or not you can get benefit of it by using UseStringDeduplication under G1 (default from java 9 onwards).

I am doing some benchmarking but it seems that switching from string interning has potential to those using CMS gc and shouldn't make significant difference on newer jvm. Will update the pr once I am done benchmarking

robert3005 · 2018-09-13T13:14:19Z

Since this is only ever issue with CMS and no other garbage collectors I don't think this pr matters that much. With advent of shenandoah and zgc it doesn't make sense to make this change.

remove interning from format utils

a4ba193

robert3005 mentioned this pull request Mar 29, 2018

Memory pressure in ParquetFileSplitter palantir/spark#259

Closed

gszadovszky reviewed Apr 3, 2018

View reviewed changes

robert3005 closed this Sep 13, 2018

robert3005 deleted the rk/remove-interning branch September 13, 2018 13:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PARQUET-1261 - Remove string interning #92

PARQUET-1261 - Remove string interning #92

Uh oh!

robert3005 commented Mar 29, 2018

Uh oh!

gszadovszky left a comment

Uh oh!

julienledem commented Apr 4, 2018

Uh oh!

scottcarey commented Apr 4, 2018

Uh oh!

julienledem commented Apr 4, 2018

Uh oh!

robert3005 commented Apr 4, 2018

Uh oh!

robert3005 commented Sep 13, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PARQUET-1261 - Remove string interning #92

PARQUET-1261 - Remove string interning #92

Uh oh!

Conversation

robert3005 commented Mar 29, 2018

Uh oh!

gszadovszky left a comment

Choose a reason for hiding this comment

Uh oh!

julienledem commented Apr 4, 2018

Uh oh!

scottcarey commented Apr 4, 2018

Uh oh!

julienledem commented Apr 4, 2018

Uh oh!

robert3005 commented Apr 4, 2018

Uh oh!

robert3005 commented Sep 13, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants