Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK22: Riscv: Test jobs randomly "Terminated" on specific riscv machines #3669

Closed
adamfarley opened this issue Jul 9, 2024 · 4 comments
Closed
Assignees

Comments

@adamfarley
Copy link
Contributor

Summary
The JDK22 Riscv test jobs seem to fail near-constantly with makefile "Terminated" messages.

Details
Most JDK22 Riscv test jobs seem to fail near-constantly, but only after July 3rd mid-afternoon-ish, and only on specific machines.

This may or may not be a related issue, as the test jobs that pass are only run on a specific subset of the riscv machines, and the jobs that fail are only run on a separate subset. There doesn't appear to be a difference in tag, so it may just be random chance.

Machines where failures are seen:

Machines where passes are seen:

Example of a job where one testlist always seems to pass, and the other always seems to fail:
https://ci.adoptium.net/job/Test_openjdk22_hs_extended.perf_riscv64_linux/27/

Error message
Many look like this:

17:01:15  make[1]: *** [compile.mk:45: compile] Terminated
17:01:15  make: *** [makefile:87: compile] Terminated
17:01:15  Terminated
17:01:15  143

But there are variations, like this one:

16:51:40  make[2]: *** [settings.mk:356: testList-..] Terminated
16:51:40  make[1]: *** [makefile:70: _testList] Terminated
16:51:40  make: *** [parallelList.mk:8: testList_0] Terminated
16:51:40  Terminated

I'm lumping these all in together due to the similarity in jdk version, platform, and the "Terminated" part, but these may prove to be separate issues after the first problem is resolved.

URLs

@adamfarley
Copy link
Contributor Author

@sxa - Tagging because Stewart may already be aware of this.

@smlambert
Copy link
Contributor

Seen also in dry-run triage for JDK22

Screenshot 2024-07-14 at 12 08 01 PM

example from https://ci.adoptium.net/job/Test_openjdk22_hs_sanity.openjdk_riscv64_linux_testList_3/16/console

00:22:27  TEST SETUP:
00:22:27  Nothing to be done for setup.
00:22:27  
00:22:27  TESTING:
00:22:27  Directory "/home/jenkins/workspace/Test_openjdk22_hs_sanity.openjdk_riscv64_linux_testList_3/aqa-tests/TKG/../TKG/output_17208440732658/jdk_math_1/work" not found: creating
00:22:27  Directory "/home/jenkins/workspace/Test_openjdk22_hs_sanity.openjdk_riscv64_linux_testList_3/aqa-tests/TKG/../TKG/output_17208440732658/jdk_math_1/report" not found: creating
00:24:35  XML output with verification to /home/jenkins/workspace/Test_openjdk22_hs_sanity.openjdk_riscv64_linux_testList_3/aqa-tests/TKG/output_17208440732658/jdk_math_1/work
00:31:45  make[2]: *** [settings.mk:356: testList-..] Terminated
00:31:45  make[1]: *** [makefile:70: _testList] Terminated
00:31:45  make: *** [parallelList.mk:17: testList_3] Terminated
00:31:45  Terminated
00:31:45  make[3]: *** [/home/jenkins/workspace/Test_openjdk22_hs_sanity.openjdk_riscv64_linux_testList_3/aqa-tests/TKG/../TKG/settings.mk:356: testList-openjdk] Terminated
00:31:45  make[4]: *** [autoGen.mk:67: jdk_math_1] Terminated

@sxa
Copy link
Member

sxa commented Jul 15, 2024

test-rise-ubuntu2404-riscv64-5 through to 7 were all created around the time these failures were first observed (Issue: #3598 (comment)). They are running the same kernel as the earlier numbered 2404 machines, and have the same amount of swap as the earlier ones so there should be no difference in behavior.

Having said that, based on the information in this issue I'm going to remove ci.role.test from those four machines for now, and add a reference to this issue in the machine description so that investigation can continue but we will avoid any problems during the July release cycle.

@sxa sxa self-assigned this Jul 15, 2024
@sxa sxa transferred this issue from adoptium/aqa-tests Jul 15, 2024
@sxa
Copy link
Member

sxa commented Jul 15, 2024

Had a (perhaps obvious) brainwave ... The issue is that some of the agent definitions are pointing to duplicate machines which is why we're getting terminations - two test jobs running on a single machine will be trying to clear up each others processes. The config in jenkins is correct, but i suspect when the definition was duplicated the machine started up connecting to the machine that was the source of the copy and didn't get a disconnect/reconnect cycle. -7 was connected to -4's IP,
Sorted (and for -5) so this should no longer occur. Please reopen if for some reason it's seen again.

@sxa sxa closed this as completed Jul 15, 2024
@sxa sxa added this to the 2024-07 (July) milestone Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants