Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent run times of sanity.openjdk on xLinux #1165

Closed
sxa opened this issue Feb 21, 2020 · 11 comments
Closed

Inconsistent run times of sanity.openjdk on xLinux #1165

sxa opened this issue Feb 21, 2020 · 11 comments
Assignees
Milestone

Comments

@sxa
Copy link
Member

sxa commented Feb 21, 2020

While looking at the status of some of the pipelines last night it became clear that we have some quite considerable differences in the run times of some of the sanity.openjdk jobs. We should look at whether this is machine-specific issue and how to optimise the pipelines if there is an underlying reason.

Data from https://ci.adoptopenjdk.net/view/Build%20and%20Test%20Pipeline%20Calendar/job/Test_openjdk11_hs_sanity.openjdk_x86-64_linux/buildTimeTrend:

  Build  ↑ Duration Agent
  153 9 hr 3 min  test-godaddy-ubuntu1604-x64-1
  152 2 hr 17 min  test-packet-ubuntu1604-x64-1
  151 4 hr 30 min  test-scaleway-ubuntu1604-x64-1
  150 1 hr 28 min  test-godaddy-centos7-x64-1
  149 2 hr 38 min  test-softlayer-ubuntu1604-x64-1
  148 2 hr 4 min  test-godaddy-ubuntu1604-x64-3
  147 9 hr 0 min  test-godaddy-ubuntu1604-x64-1
  146 2 hr 6 min  test-godaddy-debian8-x64-2
  145 2 hr 45 min  test-godaddy-debian8-x64-3
  144 2 hr 24 min  test-packet-ubuntu1604-x64-1
@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Feb 25, 2020

test-godaddy-ubuntu1604-x64-1, on which the job takes 9hrs, does so because certain tests under net, nio and rmi fail. These failures usually involve test cases which return 'Connection Time out errors'.

Test Duration Status Skip Todo
jdk_io_0 1 min 39 sec OK No No
jdk_lang_0 14 min OK No No
jdk_math_0 1 min 55 sec OK No No
jdk_net_0 1 hr 40 min NOT OK No No
jdk_nio_0 2 hr 47 min NOT OK No No
jdk_security1_0 2 min 38 sec OK No No
jdk_util_0 12 min OK No No
jdk_rmi_0 3 hr 55 min NOT OK No No
jdk_native_sanity_0 13 sec OK No No

This was from build 147. Build 153, which also uses test-godaddy-ubuntu1604-x64-1, has similar results

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Mar 2, 2020

Build 151, ran on test-scaleway-ubuntu1604-x64-1, was a bit odd. Only 2 tests failed, java/net/Inet6Address/B6206527.java.B6206527 and java/net/ipv6tests/B6521014.java.B6521014, which took 0.4 and 0.18 seconds respectively, yet the build took 4.5 hours

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Mar 2, 2020

The machines test-scaleway-ubuntu1604-x64-1 and test-godaddy-ubuntu1604-x64-1 seem to be the only machines which give job times outside of the average. The average range of the job time seems to be 1.5 to 2.5 hours, so anything outside of this range should be considered odd

@karianna karianna moved this from TODO to In Progress in infrastructure Mar 15, 2020
@sxa
Copy link
Member Author

sxa commented Apr 2, 2020

In recent runs test-godaddy-centos7-x64-1 was running the suite slower than the other machines, although test-scaleway-ubuntu1604-x64-1 has not been unduly slow (3h4m, although that run had failures) We should keep an eye in this in a weekly basis to ensure there are no significant issues

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Jun 23, 2020

Just had a quick look through the run times. All are around 1hr - 1hr 15 or under, except for builds 277 and 275, both of which ran on test-scaleway-ubuntu1604-x64-1. These builds also had many com/sun/jdi test failures

@sxa
Copy link
Member Author

sxa commented Jun 23, 2020

@smlambert @adam-thorpe are you aware of those failures happening on one of our machines. While I'm somewhat tempted to just decomission this machine at some point if it's exposing a problem it would be useful to track it

@smlambert
Copy link
Contributor

search jdi in openjdk-tests repo and come up with a list of issues (though mainly the jdi tests that are .sh scripts and not the tests that you link to above).

I guess no one is triaging the sanity.openjdk suite for hotspot runs at the moment (as in trying to figure out root cause), just reporting failures in the build repo (example where some of these test failures were reported adoptium/temurin-build#1634 (comment)). It is somewhat telling if only failing on certain machines, that should give a triager a place to start in terms of finding root cause.

@smlambert
Copy link
Contributor

smlambert commented Jun 23, 2020

Looking more closely at the jdi failures, looks to be caused by ERROR: transport error 202: bind failed: Address already in use, as in some previously started process is still using the socket, and so these tests are unable to setup and use the socket, because its already in use.

related: adoptium/TKG#45 will eventually list what processes are still present on machines, (and if possible, what resources they still have a hold on, sockets/file handles, etc).

Wonder if its possible to get more fixes versus more reports via openjdk-build issue 1634?

@sxa
Copy link
Member Author

sxa commented Jun 24, 2020

No sign of processes being left on the machine (although if they were, sxaProcessCheck would have cleared it up by now) so I'm running https://ci.adoptopenjdk.net/job/Grinder/3441 https://ci.adoptopenjdk.net/job/Grinder/3443 (3441 failed to copy artifacts as the upstream build job had been cleaned) and will look at the machine afterwards.~

@sxa
Copy link
Member Author

sxa commented Nov 18, 2020

Seems to be running consistently in under an hour now, but I'm running https://ci.adoptopenjdk.net/view/Build%20and%20Test%20Pipeline%20Calendar/job/Test_openjdk11_hs_sanity.openjdk_x86-64_linux/431/ on test-godaddy-ubuntu1604-x64-1 as a final check before closing this

@sxa
Copy link
Member Author

sxa commented Nov 18, 2020

This may have been down to leftover processes on the machine. We've done a lot of work to resolve such situations recently including a run of SXA-platybookCheck with the new kill -KILL option which has cleared up three jobs from extended.system runs at the start of November (so wouldn't have been around when the initial analysis for this issue was done) but it may have been a cause.

The above job has completed in 47m minutes so the original issue is definitely resolved one way or another

@sxa sxa closed this as completed Nov 18, 2020
infrastructure automation moved this from In Progress to Done Nov 18, 2020
@karianna karianna added this to the November 2020 milestone Nov 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

4 participants