Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

override heap / jvm params for tests in gradle build [LUCENE-9160] #10200

Closed
asfimport opened this issue Jan 22, 2020 · 24 comments
Closed

override heap / jvm params for tests in gradle build [LUCENE-9160] #10200

asfimport opened this issue Jan 22, 2020 · 24 comments

Comments

@asfimport
Copy link

asfimport commented Jan 22, 2020

Currently the gradle.properties that is generated lets you control the heap and flags for the gradle build jvms.

But there is no way to control these flags for the actual forked VMs running the unit tests. For example, minHeap is hardcoded at 256m and maxHeap at 512m.

I would like to change minHeap to 512m as well, for a fixed heap, and set some other jvm flags, such as -XX:+UseParallelGC so that my tests are not slow for silly reasons :)

I think it is stuff jenkins CI would need as well.


Migrated from LUCENE-9160 by Robert Muir (@rmuir), resolved Jan 22 2020
Attachments: LUCENE-9160.patch (versions: 2)
Linked issues:

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

Here's a patch that works for me. It allows specifying these parameters similar to how you can with ant:

tests.heapsize=512m
tests.minheapsize=512m
args=-XX:+AlwaysPreTouch -XX:+UseTransparentHugePages -XX:+UseParallelGC
tests.workDir=/tmp/lucene_gradle

I tried to make the parameters match the ant build as much as possible, to reduce confusion, but I'm not stuck on the naming, just want to make it possible :)

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

fwiw adding -XX:TieredStopAtLevel=1 to my args made an even bigger difference, cut overall test time in half (18 minutes -> 9 minutes). we waste all resources testing c2 compiler...

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

fwiw adding -XX:TieredStopAtLevel=1 to my args made an even bigger difference

I tested this option, on 72 core box, using JDK 11.

In lucene/core I ran ant test -Dtests.jvms=36 for baseline, twice:

BUILD SUCCESSFUL
Total time: 1 minute 18 seconds

BUILD SUCCESSFUL
Total time: 1 minute 13 seconds 

And then ran again with this option (to tell hotspot to not try so hard?), ant test -Dtests.jvms=36 -XX:TieredStopAtLevel=1:

BUILD SUCCESSFUL
Total time: 24 seconds
BUILD SUCCESSFUL
Total time: 42 seconds 

Net/net this is a crazy crazy speedup for our tests!!!

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

But we should not hardcode the JVM opts, as we would like to test all combinations (also C2 optimizations) on Jenkins.

So we can add sane defaults, but -Dargs should always override those settings.

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

Basically -XX:TieredStopAtLevel=1 is very similar to -client in older JDKs. So for shortrunning processes this is optimal. Of course it's a bad idea for benchmarks or server environments.

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

Yes, i'd like to just set args=-XX:TieredStopAtLevel=1 as a default. thats the only default i want. the other stuff i do here has tradeoffs, but this one is a no-brainer by default.

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

Updated patch, it sets the default, but you can override of course. I changed name to tests.jvmargs to be consistent with org.gradle.jvmargs which is used for the build VMs.

I also updated the help page. I think its ready.

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

OK, +1

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit 9dae566 in lucene-solr's branch refs/heads/master from Robert Muir
https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=9dae566

LUCENE-9160: add params/docs to override jvm params in gradle build, default C2 off in tests.

Adds some build parameters to tune how tests run. There is an example
shown by "gradle helpLocalSettings"

Default C2 off in tests as it is wasteful locally and causes slowdown of
tests runs. You can override this by setting tests.jvmargs for gradle,
or args for ant.

Some crazy lucene stress tests may need to be toned down after the
change, as they may have been doing too many iterations by default...
but this is not a new problem.

@asfimport
Copy link
Author

Dawid Weiss (@dweiss) (migrated from JIRA)

Correct that running tests effectively stresses the compiler. I'd do it slightly differently so that options are self-documenting but this is something that I can follow-up later on. LGTM overall. Great speedup for the common case.

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit 9dae566 in lucene-solr's branch refs/heads/gradle-master from Robert Muir
https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=9dae566

LUCENE-9160: add params/docs to override jvm params in gradle build, default C2 off in tests.

Adds some build parameters to tune how tests run. There is an example
shown by "gradle helpLocalSettings"

Default C2 off in tests as it is wasteful locally and causes slowdown of
tests runs. You can override this by setting tests.jvmargs for gradle,
or args for ant.

Some crazy lucene stress tests may need to be toned down after the
change, as they may have been doing too many iterations by default...
but this is not a new problem.

@asfimport
Copy link
Author

David Smiley (@dsmiley) (migrated from JIRA)

While this change might improve Lucene tests (I didn't check yet), I'm finding this to be a large degradation in Solr tests.  A machine I use to run tests normally takes around 38 minutes but is now taking 52 minutes.  It's a corporate VM that supposedly has 16 CPUs; "ant test" uses 4 runners.  I passed "-Dargs=" to undo the change args change and I'm back to normal test run times.

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

The solr tests are generally sleep()'ing and hence leave the cpu with a lot of cycles to run background compilation. so there is no downside for the overcompilation of tests, only the benefits. For solr I can't recommend anything, the tests are really hopeless: i'd just use as many runners as possible.

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

Also, if you have a machine with 16 cpus, and you are running with just 4 runners, that is a misconfigured system: you are leaving 75% of your machine free. So it shouldnt be any surprise that background compilation (even if its some insane amount) causes you no problems: 75% of your resources are wasted.

Set the jvms to 16 if you want to do a comparison.

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

Similar comparison: if you have a 16 cpu machine and only use 4 runners, i can speed up your tests by spawning 12 background threads from the build: 12 threads that spend 80% of their time mining cryptocurrency and only 20% of their time running tests.

You'd see a nice speedup, even though overall its wasting all of your resources. And if you set test jvms to 16 you'd see that these background threads only caused contention and slowed you down, because they are wasting your CPU overall. This is what C2 compiler does in our tests :)

@asfimport
Copy link
Author

David Smiley (@dsmiley) (migrated from JIRA)

Yep; I hear you, and thanks for your amusing comparative explanation :).  I recently acquired use of this VM and hadn't tuned the build.  I tried tests.jvms=4,8,10,12,16 and ultimately found 10 yielded the best times on this VM – 17:24m.

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

are you sure you really have 16?

python -c "import psutil; print(psutil.cpu_count(logical=False))"

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

keep in mind with a vm, the admin may not have taken the time to expose resources "correctly" as far as hyperthreads and so on. I can pass -smp 48,cores=12,threads=4 to a kvm from my little 2-core machine and that is what the VM will see.

@asfimport
Copy link
Author

David Smiley (@dsmiley) (migrated from JIRA)

I don't have the psutil module and I'm not versed in python but I ran lscpu:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             16
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 61
Model name:            Intel Core Processor (Broadwell)
Stepping:              2
CPU MHz:               2397.222
BogoMIPS:              4794.44
Virtualization:        VT-x
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
L3 cache:              16384K
NUMA node0 CPU(s):     0-15
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat

(and 48GB of RAM)

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

My guess is there is only really 8. Impossible to tell inside a VM :) I will open an issue, the gradle build divides number of cpus by 2, then artifically caps this at 4, I think we should change that. Divide by 2 is fine, but machines have more cores these days. 8 would have been a better default here.

@asfimport
Copy link
Author

Dawid Weiss (@dweiss) (migrated from JIRA)

I never had a chance to experiment on those super-beefy machines but I'm sure we can alter the defaults.

      // Approximate a common-sense default for running gradle with parallel
      // workers: half the count of available cpus but not more than 12.
      def cpus = Runtime.runtime.availableProcessors()
      def maxWorkers = (int) Math.max(1d, Math.min(cpus * 0.5d, 12))
      def testsJvms = (int) Math.max(1d, Math.min(cpus * 0.5d, 4)) 

My machines quickly saturate I/O and memory bandwidth for higher test parallelism, especially for Solr. The above is just off-the-top-off-my-head default. It can be certainly improved.

@asfimport
Copy link
Author

asfimport commented Jan 23, 2020

Robert Muir (@rmuir) (migrated from JIRA)

Dawid I opened #10205 to discuss further.

Also keep in mindthis jira ticket alters the defaults in ways that impact this.
For example, when running lucene tests with 3 VMs I see load average around 4.0 instead of 15.0-16.0 before this very patch was committed!

That's because I don't have 3 CICompiler threads per JVM doing a lot of useless C2 recompilation. So it makes things more efficient and I think we should raise the hard cap of 4 jvms to 8 or 12.

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit 16f240e in lucene-solr's branch refs/heads/branch_8x from Robert Muir
https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=16f240e

LUCENE-9160: add params/docs to override jvm params in gradle build, default C2 off in tests.

Adds some build parameters to tune how tests run. There is an example
shown by "gradle helpLocalSettings"

Default C2 off in tests as it is wasteful locally and causes slowdown of
tests runs. You can override this by setting tests.jvmargs for gradle,
or args for ant.

Some crazy lucene stress tests may need to be toned down after the
change, as they may have been doing too many iterations by default...
but this is not a new problem.

@asfimport
Copy link
Author

Adrien Grand (@jpountz) (migrated from JIRA)

Closing after the 9.0.0 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant