Skip to content

Conversation

@robert3005
Copy link

As explained on the issue - it's questionable whether this brings any benefits and can cause harm

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this.
LGTM (not a committer)
I think, it worth some perf tests or at least starting a discussion on the dev list about this topic.

@julienledem
Copy link
Member

FYI, this was done to save memory since we refer to columns using their name in the metadata. Which can become quite big when loading a lot of files. If interning is causing problems we should replace it by a different mechanisms to serve the same purpose of deduping strings.
Ideally we would change the metadata to refer to column by their index instead but that's a breaking change.
I replied on the mailing list as well.
Thank you

@scottcarey
Copy link

A way to dedupe strings that does not use String.intern() is to use a weak reference set.

Basically, use a WeakHashMap, potentially wrapped via Collections.newSetFromMap and Collections.synchronizedSet. If you need concurrency without synchronization, you can't easily use a ConcurrentHashMap, though you can write a class that extends WeakReference for keys and overrides the equals and hashCode method... this is slow because you have to wrap every access in a new WeakReference (which WeakHashMap avoids)

If you have Guava on the classpath: https://google.github.io/guava/releases/19.0/api/docs/com/google/common/collect/Interners.html Is thread-safe and based on a concurrent map. I would avoid having Guava on the classpath here because of version conflicts with user code, though perhaps shading a copy of the classes you need is fine.

Sadly, you can't create something like this on your own with just Guava's MapMaker because they use a package protected method to allow using key equivalence instead of reference equality on weak keys.

The simplest thing is using the JDK's WeakHashMap, and dealing with its thread safety issues.

Now for my last point. The original claim that OOMs were caused by String.intern() is bogus if it is JRE 7+.

Read this: http://java-performance.info/string-intern-in-java-6-7-8/

Strings that are no longer referenced are GC'd, and the claim that it is not done 'frequently' is false. The code referenced on the mailing list was to when the JVM's class unloading code removes interned strings that were referenced statically by the class and has nothing to do with user-mode calls to String.intern().

If OOM conditions are happening, it is not caused by calls to intern(). And switching to a WeakHashMap will only increase the heap required to manage it. Increasing the StringTable size might help somewhat on the performance side (the benchmark at shiplev.net clearly shows how the performance tanks when the distinct string count exceeds the table side).

So in summary:

  1. The original concern seems incorrect, unless it is running on JRE 6 and has too-small of a perm gen configured. Otherwise, this should not lead to OOM since the Strings are strongly referenced anyway. However, String.intern can lead to some performance issues with under-sized string tables.
  2. Doing the interning by hand with Guava's Interner or a WeakHashMap might be a win performance wise, but will likely use a bit more heap.

@julienledem
Copy link
Member

thanks a lot for the details Scott.
I'd like to add that the number of distinct strings here is not that big, since it is the name of the fields in schemas (individual field names not fully qualified paths). They are referred many times though especially when inspecting footers from many files. If that's a problem we can switch to a different deduping mechanism. It seems the overhead of a separate map would still be reasonable.

@robert3005
Copy link
Author

I've dug a bit more into jvm source code and it's slightly more complicated/not exactly as Scott is saying. String#intern does indeed end up on the StringTable in the jvm and there's no distinction between interened strings and what compiler/jvm interns. The problem though is that handling of that space is gc specific. The article that Scott links is totally accurate for default jvm settings and the links I posted were for CMS garbage collector. From my reading of the code it looks like interning is really only an issue under CMS (since it's very reluctant to retrieve space from it) while ParallelGC and G1 will consider it every time it does gc. Additionally interning or not you can get benefit of it by using UseStringDeduplication under G1 (default from java 9 onwards).

I am doing some benchmarking but it seems that switching from string interning has potential to those using CMS gc and shouldn't make significant difference on newer jvm. Will update the pr once I am done benchmarking

@robert3005
Copy link
Author

Since this is only ever issue with CMS and no other garbage collectors I don't think this pr matters that much. With advent of shenandoah and zgc it doesn't make sense to make this change.

@robert3005 robert3005 closed this Sep 13, 2018
@robert3005 robert3005 deleted the rk/remove-interning branch September 13, 2018 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants