-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-3173. Provide better default JVM options #710
Conversation
I have one concern, in case there is any -XX option configured, then we skip to add GC options. This may be a problem in case when an extended JVM attribute is added for debug of performance purposes but one which is unrelated to GC, as in this case the GC behaviour and used algorithm may be altered unwillingly. It would be better to restrict the regex to something which is more close to GC stuff, like something that matches to GC case insensitive after a -XX but before a space... though I am unsure whether there is an ultimate solution for this, and maybe it is better to just document the suggested parameters for different jre versions, but we certainly should add some documentation on the potential behavioural change in case of JVM tuning somewhere. Also we talk about DataNodes where the problem happened, but the change seems to affect all daemonizable process types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @elek for this improvement, especially for including test for the bash function.
|
||
# | ||
# Can be executed with bats (https://github.com/bats-core/bats-core) | ||
# bats gc_opts.bats (FROM THE CURRENT DIRECTORY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my first experiment, will see how does it work and eventually we can make it part of the main tests.
The only problem what I have found until now that the test script is copied to /tmp by bats, I couldn't use any smart, relative path (like COMPOSE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
). Which means bats should be executed always from the same dir where the test is.
But I can live with it.
if [[ "${HADOOP_SUBCMD_SUPPORTDAEMONIZATION}" == true ]]; then | ||
hadoop_debug "Appending default GC parameter to the HADOOP_OPTS" | ||
if [[ ! "$HADOOP_OPTS" =~ "-XX" ]] ; then | ||
HADOOP_OPTS="${HADOOP_OPTS} -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UseConcMarkSweepGC was deprecated in JDK8.
-XX:+UseConcMarkSweepGC
Enables the use of the CMS garbage collector for the old generation. Oracle recommends that you use the CMS garbage collector when application latency requirements cannot be met by the throughput (-XX:+UseParallelGC) garbage collector. The G1 garbage collector (-XX:+UseG1GC) is another alternative.By default, this option is disabled and the collector is chosen automatically based on the configuration of the machine and type of the JVM. When this option is enabled, the -XX:+UseParNewGC option is automatically set and you should not disable it, because the following combination of options has been deprecated in JDK 8: -XX:+UseConcMarkSweepGC -XX:-UseParNewGC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but as I learned from @arp7 we have no better option, right now. As I know G1 doesn't provide the same performance (yet).
The idea here was provide some very basic default to avoid bad experience for the users who use Ozone out-of-the-box. And advanced users can set any smart default.
Or do you suggest G1 by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not sure which is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is better heavily depends on the workload, but as the main workload are RPC calls with mostly short lived objects in our server processes, CMS is better due to the easier newGen eviction and low number of tenuring/surviving objects. At least this is the experience from HDFS and our workload on the DN and OM/SCM is very similar to HDFS server roles and consist mainly RPC calls while handling large metadata structures. G1 can be tuned for this kind of workload, but it is not the best performing one for this due to the segmented heap structure.
Thanks the feedback. Yes, there is no ultimate solution. I considered to do some smarter pattern (like the one what you suggested), but that one is more dangerous: It's hard to detect all the GC related parameters (including G1, CMS, etc.). I am not sure if we have one generic prefix for all of them. And these settings just for the first time user. Vendors with custom distribution provides adjusted values. Real production clusters might have adjusted values. The goal here is to provide a reasonable default for first time users (who download Ozone to try it out) but enable full customization (without surprises) for the advanced users.
I thinks it's a generic problem that JVM has a very conservative default. I can reproduce the problem with Datanode but as it's related to the retry cache in ratis, OM is affected immediately. And I think all the settings are useful for any server side component. |
Thank you for addressing my concern, sounds reasonable, and I can accept the current solution. I agree that most of the users will have their own settings in a production environment and hopefully if someone starts to tune the JVM then he/she will know what to do. |
I think it's a very good idea to print out the flags when we touch them, but not sure what is your suggestion exactly:
What is your preference? I am thinking about printing out all the JVM options always (similar to the classpath) + a warning that we defined the default GC parameters (2nd option)
Is it possible that somebody adds any secret information to the
As a main rule we don't set anything when any of the |
if [[ ! "$HADOOP_OPTS" =~ "-XX" ]] ; then | ||
hadoop_debug "Appending default GC parameter to the HADOOP_OPTS" | ||
HADOOP_OPTS="${HADOOP_OPTS} -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled" | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we print a warning or message if any -XX options are set, that we are not overriding the GC parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
My suggestion was 3 exactly, as defaults I think would be expected to be there (especially if they are documented), and something the user wants to know about is when we skip adding them as I see. By now I also like the 2nd approach, but I am not sure if we have anything sensitive that a user might add to the HADOOP_OPTS, and if we are unsure, it is better to do not print the options themselves at all. However we still can print out a message for both cases (when defaults are added and when defaults are not added by the script) instead. |
I feel myself uncomfortable to print out warning ("JVM paraters are NOT added") when the app is used in the normal way (properly configured, including -XX parameters). I am modifying the patch to always print out something if the jvm parameters are set by our script instead of the user (not just on the debug level, but on stderr). This is the 1st option (simple but safe). |
|
Sorry for the late response, I was pretty much overwhelmed with other stuff, I am fine with the current solution, I still think we may revisit this if there are real operational misunderstandings around it when the default options are omitted, but for now let's leave it this way. |
What changes were proposed in this pull request?
The GC pressure on Datanode is high because of the retry cache. I found crashes due to the long GC pauses. I started to use the following JVM parameters:
Which provide stable output.
It would be great to detect the current version and add these parameters, if required.
But there are two problems:
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-3173
How was this patch tested?
HADOOP_OPTS
flag to thedocker-config
file.Unit tests: