Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThunderX machines cannot reliably git clone due to random gnuTls recv errors #1897

Closed
andrew-m-leonard opened this issue Feb 5, 2021 · 14 comments

Comments

@andrew-m-leonard
Copy link
Contributor

error: RPC failed; curl 56 GnuTLS recv error (-24): Decryption has failed.
fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed

ThunderX nodes to be disabled: #1809 (comment)

@sxa
Copy link
Member

sxa commented Feb 5, 2021

Problem references:

I'm going to close each of those as both have now been mitigated as the ThunderX machines appear to be the cause of both and are no longer in use. The openjdk-build issue has a load of history, but it was mostly things that didn't resolve the problem sadly (so switching to this issue as a clean slate with the history in theres eems like a reasonable course of action) We can continue any investigation towards a resolution here (although it is likely the machines will be decomissioned in the next few months anyway)

My plan, which I hadn't created an issue for, was to try doing an OS upgrade on one of the ThunderX systems - try Ubuntu 18.04 for example, but given that the problems has been seen on both CentOS and Ubuntu systems I'm not convinced that will make a difference if it's hardware related. The only thing may be if we try a TLS implementation build using a differnet compiler version in case we're hitting some sort of compiler bug that only effects these systems.

@sxa
Copy link
Member

sxa commented Feb 5, 2021

I'm going to experiment with test-packet-ubuntu1604-armv8-2. Let's see if the problem shows up on there.

No issues with those jobs, although for a period last night I was consistently failing to complete a checkout of openjdk-tests on the machine

After running multiple Grinders - https://ci.adoptopenjdk.net/job/Grinder/6507/console failed (as did the following two, the previous jobs completed ok. Whatever this issue is it's seemingly only happening at certain times (It's 1727 as I write this so the failure was in the last 5 minutes or so)


Receiving objects:  15% (1783/11886)   
Receiving objects:  15% (1819/11886), 6.45 MiB | 6.44 MiB/s   
error: RPC failed; curl 56 SSL read: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac, errno 0
fatal: The remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed

@sxa
Copy link
Member

sxa commented Feb 8, 2021

Now failing with this - may need git to be rebuilt (or use the system one)

Caused by: hudson.plugins.git.GitException: Command "git fetch --tags --progress -- https://github.com/AdoptOpenJDK/openjdk-tests.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: /usr/local/libexec/git-core/git-remote-https: /usr/lib/aarch64-linux-gnu/libcurl.so.4: version `CURL_OPENSSL_3' not found (required by /usr/local/libexec/git-core/git-remote-https)

	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2450)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:2051)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$500(CliGitAPIImpl.java:84)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:573)
	at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$GitCommandMasterToSlaveCallable.call(RemoteGitImpl.java:161)

@karianna karianna added this to TODO in infrastructure via automation Feb 8, 2021
@karianna karianna added this to the February 2021 milestone Feb 8, 2021
@karianna karianna moved this from TODO to Done in infrastructure Feb 8, 2021
@karianna karianna moved this from Done to In Progress in infrastructure Feb 8, 2021
@sxa
Copy link
Member

sxa commented Feb 8, 2021

Upgtrading to Ubuntu 18 has not resolved this, even going back to the system git - I ran multiple tests and two of them failed.

During the cloning of openjdk-tests I got this this on one run:


Receiving objects:  12% (1427/11886)   
Receiving objects:  13% (1546/11886)   
Receiving objects:  14% (1665/11886)   
Receiving objects:  15% (1783/11886)   
Receiving objects:  15% (1856/11886), 4.54 MiB | 9.07 MiB/s   
Receiving objects:  16% (1902/11886), 13.55 MiB | 13.55 MiB/s   
error: RPC failed; curl 56 GnuTLS recv error (-12): A TLS fatal alert has been received.
fatal: The remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed

and on another run:

Receiving objects:  14% (1665/11886)   
Receiving objects:  15% (1783/11886)   
error: RPC failed; curl 56 GnuTLS recv error (-24): Decryption has failed.
fatal: The remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed


	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2450)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:2051)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$500(CliGitAPIImpl.java:84)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:573)
	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$2.execute(CliGitAPIImpl.java:802)
	at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$GitCommandMasterToSlaveCallable.call(RemoteGitImpl.java:161)
	at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$GitCommandMasterToSlaveCallable.call(RemoteGitImpl.java:154)
	at hudson.remoting.UserRequest.perform(UserRequest.java:211)
	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
	at hudson.remoting.Request$2.run(Request.java:375)
	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:73)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
	Suppressed: hudson.remoting.Channel$CallSiteStackTrace: Remote call to test-packet-ubuntu1604-armv8-2
		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1800)
		at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
		at hudson.remoting.Channel.call(Channel.java:1001)
		at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.execute(RemoteGitImpl.java:146)
		at sun.reflect.GeneratedMethodAccessor669.invoke(Unknown Source)
		at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
		at java.lang.reflect.Method.invoke(Method.java:498)
		at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.invoke(RemoteGitImpl.java:132)
		at com.sun.proxy.$Proxy408.execute(Unknown Source)
		at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1224)
		at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1302)
		at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:125)
		at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:93)
		at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:80)
		at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
		at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$$Lambda$541/000000000000000000.run(Unknown Source)
		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
		at java.util.concurrent.FutureTask.run(FutureTask.java:266)
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
		at java.lang.Thread.run(Thread.java:823)
[Pipeline] }

@sxa sxa removed their assignment Feb 8, 2021
@karianna
Copy link
Contributor

karianna commented Feb 8, 2021

Wrong TLS version or net split?

@sxa
Copy link
Member

sxa commented Feb 8, 2021

Wrong TLS version or net split?

It's happening quite frequently and only on this hardware type. The same job running about 10 times failed twice so unlikely they've negotiated the wrong TLS version. We've also had it when not checksums (see the build issue referenced above whch appears to be in the same area - failures in crypto code) which wouldn't be affected by a netsplit

@sxa
Copy link
Member

sxa commented Feb 8, 2021

Random musing ... would it react differently if OpenSSL was build without hardware crypto support (e.g. ./config no-asm no-engine no-threads or even debug -d)?

@sxa
Copy link
Member

sxa commented Feb 8, 2021

I've now got five loops running in parallel:

  • The system version of OpenSSL on Ubuntu 18.04 (Some verison of 1.1.1 dated 11 Sep 2018)
  • Normal accelerated OpenSSL 1.1.1i
  • Non-accelerated version of 1.1.1i (no-engine no-hw)
  • No assembler version of 1.1.1i (no engine no-hw no-threads -no-asm)
  • Debug version

Initially I just had the first three, and they all showed the problem at one point (all at around the same ten second period)

@lumpfish
Copy link

lumpfish commented Feb 9, 2021

Maybe related: adoptium/aqa-systemtest#402

@sxa
Copy link
Member

sxa commented Feb 9, 2021

Two observations:

  • It doesn't seem to show up much while those parallel loops are running on their own - I only seem to see issues when there is another job running on the machine at the same time (e.g. a Grinder) but it doesn't always fail when there's another job running
  • It doesn't seem to fail (or at least hasn't during my testing) when a noasm version of OpenSSL isbeing used, therefore one option would be to update the OpenSSL to use one with no-asm - in addition to the above sets I've also tried with one that only has no-asm and not the other options present.

On this basis I'm going to set up one of the ThunderX machines with docker containers and attempt to replace the system openssl with one build with no-asm, then re-enable for testing. It's likely that just libcrypto.so is all that strictly needs to be replaced

@sxa
Copy link
Member

sxa commented Feb 10, 2021

OK re-enabled the following after replacing libcrypto with a symlink to mine (ubuntu 16.04 excluded as it uses openssl 1.0 and I haven't built a copy of that)

https://ci.adoptopenjdk.net/computer/test-docker-fedora33-armv8-1 (/usr/lib6/libcrypto.so.1.1.1i) is not active since the krb5-libs package does not work with the updated libcrypto.

The following have been marked temporarily offline to force all jobs to run on the above machines or the new alibaba ones:

Current enabled set of machines can be seen at https://ci.adoptopenjdk.net/label/ci.role.test&&sw.os.linux&&hw.arch.aarch64/

aarch64 pipelines have now been initiated for JDK11 and JDK16 to test

@sxa
Copy link
Member

sxa commented Feb 11, 2021

OK That experiment didn't quite go as well as I expected.

First some good news: I'd replaced /usr/lib64/libcrypto.so.10 and /usr/lib64/libssl.so.10 with links to no-asm versions on build-packet-centos75-armv8-1 and while it gives warnings it appears to work

On test-docker-ubuntu2004-armv8-1

On test-docker-ubuntu1804-armv8-1 and test-docker-ubuntu1604-armv8-1 /usr/lib/git-core/git-remote-https is not linked against libcrypto.so directly, only libk6crypto.so.3 and libhcrypto.so.4 which I didn't have replacements for, therefore we got multiple failures on that machine. NOTE: On both systems wget is NOT linked against those two, so it likely works reliably, but not git. I've rebuild openssl, libcurl and git all into /usr/local and symlinked all of their libraries from /usr/local/lib into /usr/local/lib/aarch64-linux-gnu so that they are picked up by the default linker. (Oddly that works for libcrypto based on the ldd of git-remote-https although not when I try to set up another 18.04 machine the same way...)

@sxa sxa modified the milestones: February 2021, Icebox / On Hold Feb 17, 2021
@sxa
Copy link
Member

sxa commented Apr 7, 2021

These machines will be decomissioned in favour of the ampere ones being set up as part of #2078

@sxa sxa closed this as completed Apr 7, 2021
infrastructure automation moved this from In Progress to Done Apr 7, 2021
@karianna karianna modified the milestones: Icebox / On Hold, April 2021 Apr 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

4 participants