Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply deduplication to certain loaded and created Strings #143

Closed
wants to merge 1 commit into from

Conversation

Projects
None yet
6 participants
@philwebb
Copy link

commented Mar 6, 2019

Apply deduplication to certain loaded and created Strings by using
a static Map. This applies the same technique as described in
https://shipilev.net/jvm/anatomy-quarks/10-string-intern/ to
reduce the number of String instances that ultimately contain
the same contents.

See Fix https://bz.apache.org/bugzilla/show_bug.cgi?id=63236

@rmaucher

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2019

Can you give a metric of actual memory gains made using this change ? Adding complexity for very little benefit doesn't seem like a good move overall.

@philwebb

This comment has been minimized.

Copy link
Author

commented Mar 6, 2019

Sure. I discovered these by using the "Inspections" feature from YourKit on an embedded Tomcat application.

Without the change YourKit reports 2192 duplicate Strings, with the change it's 1126 (so a saving of 1066). It's hard to get the exact memory saving but scanning the UI for the top hitters we have:

String Duplicates Waste
"java.lang.String" 2192 149112
"ACTION" 542 38952
"boolean" 191 10640
"void" 204 9744
"Introspected parameter param0" 80 8216
"int" 155 7392
"[Ljava.lang.String;" 42 3280

So it looks like at least 227336 bytes can be saved.

One other thing to consider is that I think these Strings are retained for the life of the application. They can't be GC'd because references remain until shutdown.

@ChristopherSchultz

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2019

I like the idea, but I think the implementation could be better. For instance:

  1. There is no need to have two collections; just use one and cache everything.
  2. The cache should be "weak" so that any value no longer used can be GC'd
@rmaucher

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2019

Ok, I'm not sure it is a very significant saving overall (probably not), but it's still more than I expected.

Yes, +1for Chris, it needs to be something like TomcatString.getString I suppose.

@philwebb

This comment has been minimized.

Copy link
Author

commented Mar 6, 2019

The other option would be to use String.intern for these limited cases. Either way, I'm happy to rework the PR if you have a specific direction you'd like me to take it?

@ChristopherSchultz

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2019

Good point: String.intern is already implemented, and it's already "weak". Less code, same result = better.

Fix https://bz.apache.org/bugzilla/show_bug.cgi?id=63236
Call String.intern() for certain loaded and created Strings. On
a typical run this saves ~1050 duplicate Strings and at least
200K.

@philwebb philwebb force-pushed the philwebb:bz-63236 branch from b7f1a02 to f3bc24d Mar 7, 2019

@philwebb

This comment has been minimized.

Copy link
Author

commented Mar 7, 2019

I've pushed an update that uses String.intern for those two places. With this change 2192 duplicates are reduced to 300 and I'm not seeing any difference in startup time.

@rmannibucau

This comment has been minimized.

Copy link

commented Mar 7, 2019

Is generating a java class at build time instead of reading the xml possible? Then a preregistration in a registry would enable to bypass a lot of xml parsing and reflection which should give a real boost.

@markt-asf

This comment has been minimized.

Copy link
Contributor

commented Mar 7, 2019

I've been running YourKit with allocation tracking enabled and tracking each of the duplicated strings to the root cause. I started with the biggest duplicates and stopped somewhere around 700 bytes / per duplicated String. I ended up with a slightly different patch that covers the same duplicates and a few more. Start-up time appears to remain unaffected. Total saving ~245k.

I'm going to close this PR but I'll credit you in the commit message.

@markt-asf markt-asf closed this Mar 7, 2019

@philwebb

This comment has been minimized.

Copy link
Author

commented Mar 7, 2019

@rmannibucau I've opened another issue for that https://bz.apache.org/bugzilla/show_bug.cgi?id=63237. If I get a second I'll try some experiments.

@philwebb

This comment has been minimized.

Copy link
Author

commented Mar 7, 2019

@markt-asf Thanks Mark!

@@ -62,7 +59,7 @@ public void begin(String namespace, String theName, Attributes attributes)
if ("".equals(name)) {
name = attributes.getQName(i);
}
String value = attributes.getValue(i);
String value = attributes.getValue(i).intern();

This comment has been minimized.

Copy link
@paplorinc

paplorinc Mar 27, 2019

FYI: https://shipilev.net/jvm/anatomy-quarks/10-string-intern

In almost every project we were taking care of, removing String.intern() from the hotpaths, or optionally replacing it with a handrolled deduplicator, was the very profitable performance optimization. Do not use String.intern() without thinking very hard about it, okay?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.