Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many network tests failing on AIX in JDK20 (with UnknownHostException: Hostname and service name not provided) #3178

Closed
smlambert opened this issue Jul 17, 2023 · 21 comments
Assignees
Labels

Comments

@smlambert
Copy link
Contributor

smlambert commented Jul 17, 2023

sun/security/krb5/auto/NoAddresses.java appears to fail across all available AIX machines:

java.net.UnknownHostException: adopt06: adopt06: Hostname and service name not provided or found
	at java.base/java.net.InetAddress.getLocalHost(InetAddress.java:1791)
	at NoAddresses.main(NoAddresses.java:51)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:578)
	at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:125)
	at java.base/java.lang.Thread.run(Thread.java:1623)
Caused by: java.net.UnknownHostException: adopt06: Hostname and service name not provided or found
	at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
	at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Inet6AddressImpl.java:52)
	at java.base/java.net.InetAddress$PlatformResolver.lookupByName(InetAddress.java:1061)
	at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1683)
	at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:1004)
	at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1673)
	at java.base/java.net.InetAddress.getLocalHost(InetAddress.java:1786)
	... 5 more

JavaTest Message: Test threw exception: java.net.UnknownHostException: adopt06: adopt06: Hostname and service name not provided or found
JavaTest Message: shutting down test

STATUS:Failed.`main' threw exception: java.net.UnknownHostException: adopt06: adopt06: Hostname and service name not provided or found
    

Test Info
Test Name: jdk_security4_1
Test Duration: 13 min 8 sec
Machine: test-osuosl-aix72-ppc64-6
TRSS link for the test output: https://trss.adoptium.net/output/test?id=64b3f9a817052c671580a5fe

Build Info
Build Name: Test_openjdk20_hs_sanity.openjdk_ppc64_aix
Jenkins Build start time: Jul 15 2023, 09:08 pm
Jenkins Build URL: https://ci.adoptium.net/job/Test_openjdk20_hs_sanity.openjdk_ppc64_aix/110/
TRSS link for the build: https://trss.adoptium.net/allTestsInfo?buildId=64b3f8a617052c671580a04d

Java Version
openjdk version "20.0.1-beta" 2023-04-18
OpenJDK Runtime Environment Temurin-20.0.1+9-202307152344 (build 20.0.1-beta+9-202307152344)
OpenJDK 64-Bit Server VM Temurin-20.0.1+9-202307152344 (build 20.0.1-beta+9-202307152344, mixed mode)

This test has been failed 17 times since Jun 10 2023, 08:34 pm
Java Version when the issue first seen
openjdk version "20.0.1" 2023-04-18
OpenJDK Runtime Environment Temurin-20.0.1+9 (build 20.0.1+9)
OpenJDK 64-Bit Server VM Temurin-20.0.1+9 (build 20.0.1+9, mixed mode)
Jenkins Build URL: https://ci.adoptium.net/job/Test_openjdk20_hs_sanity.openjdk_ppc64_aix/94/

The test failed on machine test-osuosl-aix72-ppc64-6 2 times
The test failed on machine test-osuosl-aix72-ppc64-3 3 times
The test failed on machine test-osuosl-aix72-ppc64-5 4 times
The test failed on machine test-osuosl-aix72-ppc64-4 2 times
The test failed on machine test-osuosl-aix72-ppc64-1 1 times
The test failed on machine test-osuosl-aix72-ppc64-2 4 times
The test failed on machine build-osuosl-aix72-ppc64-2 1 times

Rerun in Grinder


Other network related targets also failing on AIX with same issue, examples:
jdk_nio from https://ci.adoptium.net/job/Test_openjdk20_hs_extended.openjdk_ppc64_aix_testList_2/6/

@smlambert smlambert changed the title NoAddresses.java in jdk_security4_1 FAILED in Test_openjdk20_hs_sanity.openjdk_ppc64_aix with UnknownHostException Many network tests failing on AIX in JDK20 (jdk_security4_1 FAILED in with UnknownHostException) Jul 17, 2023
@smlambert
Copy link
Contributor Author

Transferring this issue to infrastructure repository.

It continues to be an issue seen on certain machines, including test-osuosl-aix72-ppc64-5, as seen in https://ci.adoptium.net/job/Test_openjdk21_hs_extended.openjdk_ppc64_aix_testList_2/7/

04:05:27  STDERR:
04:05:27  java.net.UnknownHostException: adopt05: adopt05: Hostname and service name not provided or found
04:05:27  	at java.base/java.net.InetAddress.getLocalHost(InetAddress.java:1936)
04:05:27  	at jdk.test.lib.Utils.getHostname(Utils.java:450)
04:05:27  	at JstatdTest.getDestination(JstatdTest.java:111)
04:05:27  	at JstatdTest.runJps(JstatdTest.java:132)
04:05:27  	at JstatdTest.runToolsAndVerify(JstatdTest.java:209)
04:05:27  	at JstatdTest.runTest(JstatdTest.java:346)
04:05:27  	at JstatdTest.doTest(JstatdTest.java:314)
04:05:27  	at TestJstatdPortAndServer.main(TestJstatdPortAndServer.java:40)
04:05:27  	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
04:05:27  	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
04:05:27  	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:333)
04:05:27  	at java.base/java.lang.Thread.run(Thread.java:1583)

@smlambert smlambert changed the title Many network tests failing on AIX in JDK20 (jdk_security4_1 FAILED in with UnknownHostException) Many network tests failing on AIX in JDK20 (with UnknownHostException: Hostname and service name not provided) Sep 5, 2023
@smlambert smlambert transferred this issue from adoptium/aqa-tests Sep 5, 2023
@smlambert
Copy link
Contributor Author

Related: #3030

@aixtools
Copy link
Contributor

aixtools commented Sep 5, 2023

All hosts - or some? As most systems are a clone of adopt10.

I can look for differences - but a known working server compared with a known failing server works is the preferred starting point.

@smlambert
Copy link
Contributor Author

I believe the issue is observed on test-osuosl-aix72-ppc64-4, test-osuosl-aix72-ppc64-5, test-osuosl-aix72-ppc64-6.

Though I have exhaustively looked at all hosts, anecdotally -1, -2, -3 look like they do not have this issue.

@smlambert
Copy link
Contributor Author

To reproduce this issue, use this Rerun in Grinder link and set LABEL to be the hostname to run on, for example set LABEL to test-osuosl-aix72-ppc64-6.

@adamfarley
Copy link
Contributor

It looks like -3 and -4 were removed from the ansible inventory file back in May, so perhaps these machines can be ignored if they are pending deletion.

I mention the pending deletion of these machines here.

@aixtools
Copy link
Contributor

aixtools commented Sep 6, 2023

Here are the current records:

BUILD:
      - osuosl:
          aix72-ppc64-1: {ip: 140.211.9.166, description: p8-java1-adopt10.osuosl.org, 7200-02-04-1914}
          aix72-ppc64-2: {ip: 140.211.9.12, description: p8-aix1-adopt02.osuosl.org, 7200-02-04-1914}
TEST:
      - osuosl:
          aix72-ppc64-1: {ip: 140.211.9.28, description: p8-aix1-adopt03.osuosl.org, 7200-04-02-2028}
          aix72-ppc64-2: {ip: 140.211.9.36, description: p8-aix1-adopt04.osuosl.org, 7200-02-05-1938}
          aix72-ppc64-3: {ip: 140.211.9.168, description: p8-java1-adopt07.osuosl.org, 7200-02-04-1914}
          aix72-ppc64-4: {ip: 140.211.9.169, description: p8-java1-adopt08.osuosl.org, 7200-02-04-1914}
          aix72-ppc64-5: {ip: 140.211.9.99, description: p9-aix1-adopt05.osuosl.org, 7200-02-04-1914}
          aix72-ppc64-6: {ip: 140.211.9.100, description: p9-aix1-adopt06.osuosl.org, 7200-02-04-1914}
          aix73-ppc64-1: {ip: 140.211.9.10, description: p8-aix1-adopt01.osuosl.org, 7300-01-02-2320}

All systems are build and active.
-1 and -2 were built long ago (before the new build-aix72-1), the systems -[3,4,5,6} are all cloned from the build-aix72-1 system, the aix73 was built from DVD.

After the cloning the playbook, afaik, was rerun over all the all the aix72 systems - and obviously, the aix73 system was built/configured using the playbook.

Any differences are because someone has made changes manually.

No control and/or change history on manual changes.

@aixtools
Copy link
Contributor

aixtools commented Sep 6, 2023

It looks like -3 and -4 were removed from the ansible inventory file back in May, so perhaps these machines can be ignored if they are pending deletion.

I mention the pending deletion of these machines here.

There are no aix71 system remaining - those are the systems that were removed in May.

@adamfarley
Copy link
Contributor

There are no aix71 system remaining - those are the systems that were removed in May.

Oh, ok. Odd that the removed machines had "aix72" in their names.

@adamfarley
Copy link
Contributor

adamfarley commented Sep 6, 2023

Ok, got my facts straight now.

test-osuosl-aix72-ppc64-3 and -4 were removed from the inventory file in may, but they were later replaced by other machines that now use the same names as the ones that were removed.

So the current test-osuosl-aix72-ppc64-3 and -4 are not pending deletion from jenkins.

@aixtools
Copy link
Contributor

aixtools commented Sep 6, 2023 via email

@adamfarley
Copy link
Contributor

Update: This issue still appears to occur. Example.

java.net.UnknownHostException: adopt07: adopt07: Hostname and service name not provided or found

Seen on test-osuosl-aix72-ppc64-3.

@sxa
Copy link
Member

sxa commented Nov 2, 2023

NOTE:

  • I'm running Grinders 7992 through 7999 with some known problematic suites on different AIX 7.2 machines - can look at a summary of how it went with curl -s https://ci.adoptium.net/job/Grinder/7992/consoleText | egrep 'Running test |TEST RESULT: ' against each one. Some of them are showing the problems described above. We should aim to get all machines to the state where they can pass a run of the AQA_test_Pipeline job, but I suspect if we can eliminate the issues shown by the tests in those grinders (TARGET of testList TESTLIST=jdk_jdi_jdk8_0,jdk_jdi_jdk8_1,hotspot_jre_0,hotspot_jre_1,jdk_security3_0,hotspot_jre_0,hotspot_jre_1) we'll be most of the way there.

A number of the errors seem likely to be caused by the hostname not be resolvable i.e. ping $(hostname) doesn't work. I suspect we should look at switching the hostname on the machine to match the jenkins name, as the machines have entries for those in /etc/hosts

@sxa sxa added this to the Backlog milestone Jan 3, 2024
@smlambert
Copy link
Contributor Author

From Deep History:
Screenshot 2024-01-08 at 2 00 53 PM

Machines with the Hostname and service name not provided issue:
test-osuosl-aix72-ppc64-3
test-osuosl-aix72-ppc64-4
test-osuosl-aix72-ppc64-5

Machines that do not have that issue:
test-osuosl-aix72-ppc64-1
test-osuosl-aix72-ppc64-2

@sxa
Copy link
Member

sxa commented Jan 11, 2024

Looking at one of them - adopt06 has this entry in /etc/hosts but has a hostname of adopt06 which is not resolvable (I would expect that the hostname was updated to match the inventory, but not the actual machine hostname, which is why we get such failures (TL;DR ping $(hostname) doesn't work. I'll run tests with the jdk_lang_0,jdk_jdi_jdk8_0,jdk_jdi_jdk8_1 targets to confirm:

@sxa
Copy link
Member

sxa commented Jan 12, 2024

Noting that the /etc/hosts will get replaced by the regular refreshes from AWX.

I've kicked off the following after reinstating the originally deployed line in /etc/hosts with the adoptXX name while the one that ansible updates (external IP and hostname) is left as-is. This should ensure that the adoptXX line does not disappear. In the general case we have three options and we should decide which is the most appropriate:

  1. What I've done as a temporary fix - re-add the adoptXX to /etc/hosts with the local network IP address - I've done this on test machines 03 through 05
  2. Use a hostname that is in the osuosl DNS so it can be resolved that way e.g. p8-aix1-adopt04 (test machine -2) is set up this way and the hostname gets resolved by the DNS at 140.211.166.130. -1 is the same.
  3. Change the hostname on the machine to match our inventory so that it is resolved via the line that ansible is adding to /etc/hosts. I've now done this on -6 so we can verify the outcome.

New test jobs (If Full sanity=N it means I'm just running the same three targets as earlier):

Grinder Full sanity? host config result
8457 N test-6 3 PASS
8458 N test-5 1 PASS
8459 Y test-4 1 Failed [1]
8460 N test-3 1 Failed [1]
8461 Y test-2 2 PASS

[1] - Netowkr tests were all good so the problem in this issue is resolved, however these two runs had a failure in java/lang/Thread:

Execution failed: `main' threw exception: java.lang.OutOfMemoryError: unable to create new native thread

@sxa
Copy link
Member

sxa commented Jan 12, 2024

Noting that the AWX deploy is overwriting my new line, so for now I've added hosts_file to the list of roles to skip in the regular playbook deploymnet on AIX

@sxa
Copy link
Member

sxa commented Jan 14, 2024

Running 100 iteration with the failing thread test (java/lang/ThreadLocal/TestThreadId.java.TestThreadId) on two machines:

Machine result
test-3 All passed
test-5 All passed

And ten instances of jdk_lang_0 on:

Machine result
test-4 Mix of pass and fail

Suggests it's a load issue of some sort when running the whole suite, although the tests are using concurrency:1 but this probably needs tobe a separate issue as the original problem described in this issue is now resolved (although needs an improved playbook fix since I' vemanually patched /etc/hosts)

@sxa
Copy link
Member

sxa commented Jan 15, 2024

Intermittent testThreadId failure is covered under adoptium/aqa-tests#2189

@sxa
Copy link
Member

sxa commented Jan 15, 2024

As a follow-up to the proposals above, noting that for other operating systems:

On this basis it is likely that making a similar change to the UNIX playbook on the AIX machines is the preferred option here, however given the proximity to the January release cycle I suggest we pause anything more for now (although comments/discussion on the options are still welcome) We could also use /etc/motd to remind people on login what the adoptXX name is (Similar to what we do for the RISC-V machines at the PLCTlab.

@sxa
Copy link
Member

sxa commented Jan 15, 2024

Closing as 3344 has been split out to cover a permanent solkution going forward so we don't need to block the hosts_file rule.

@sxa sxa closed this as completed Jan 15, 2024
@sxa sxa self-assigned this Jan 15, 2024
@sxa sxa modified the milestones: Backlog, 2024-01 (January) Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

4 participants