Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpuCI broken #11312

Open
fjetter opened this issue Aug 14, 2024 · 7 comments
Open

gpuCI broken #11312

fjetter opened this issue Aug 14, 2024 · 7 comments
Assignees
Labels
needs triage Needs a response from a contributor

Comments

@fjetter
Copy link
Member

fjetter commented Aug 14, 2024

This has been raised on #11242 already but I always have difficulties finding that draft PR and the failures are not related to a version update from what I can tell.

gpuCI has been pretty consistently failing for a while now.

Logs show something like (from #11310 // https://gpuci.gpuopenanalytics.com/job/dask/job/dask/job/prb/job/dask-prb/6185/console)

15:05:47 GitHub pull request #11310 of commit 355b76fc0632708894cfc1c17ce55b80cef8bbbb, no merge conflicts.
15:05:47 Running as SYSTEM
15:05:47 Setting status of 355b76fc0632708894cfc1c17ce55b80cef8bbbb to PENDING with url https://gpuci.gpuopenanalytics.com/job/dask/job/dask/job/prb/job/dask-prb/6185/ and message: 'Running'
15:05:47 Using context: gpuCI/dask/pr-builder
15:10:13 FATAL: java.io.IOException: Unexpected EOF
15:10:13 java.io.IOException: Unexpected EOF
15:10:13 	at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:101)
15:10:13 	at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
15:10:13 	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
15:10:13 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)
15:10:13 Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to EC2 (aws-b) - runner-m5d2xl (i-00c57ce783f1c62db)
15:10:13 		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1787)
15:10:13 		at hudson.remoting.Request.call(Request.java:199)
15:10:13 		at hudson.remoting.Channel.call(Channel.java:1002)
15:10:13 		at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1121)
15:10:13 		at hudson.Launcher$ProcStarter.start(Launcher.java:506)
15:10:13 		at hudson.Launcher$ProcStarter.join(Launcher.java:517)
15:10:13 		at com.gpuopenanalytics.jenkins.remotedocker.AbstractDockerLauncher.parseVersion(AbstractDockerLauncher.java:193)
15:10:13 		at com.gpuopenanalytics.jenkins.remotedocker.AbstractDockerLauncher.<init>(AbstractDockerLauncher.java:54)
15:10:13 		at com.gpuopenanalytics.jenkins.remotedocker.DockerLauncher.<init>(DockerLauncher.java:54)
15:10:13 		at com.gpuopenanalytics.jenkins.remotedocker.RemoteDockerBuildWrapper.decorateLauncher(RemoteDockerBuildWrapper.java:164)
15:10:13 		at hudson.model.AbstractBuild$AbstractBuildExecution.createLauncher(AbstractBuild.java:613)
15:10:13 		at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:485)
15:10:13 		at hudson.model.Run.execute(Run.java:1894)
15:10:13 		at hudson.matrix.MatrixBuild.run(MatrixBuild.java:323)
15:10:13 		at hudson.model.ResourceController.execute(ResourceController.java:101)
15:10:13 		at hudson.model.Executor.run(Executor.java:442)
15:10:13 Caused: hudson.remoting.RequestAbortedException
15:10:13 	at hudson.remoting.Request.abort(Request.java:346)
15:10:13 	at hudson.remoting.Channel.terminate(Channel.java:1083)
15:10:13 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:90)
15:10:13 Setting status of 355b76fc0632708894cfc1c17ce55b80cef8bbbb to FAILURE with url https://gpuci.gpuopenanalytics.com/job/dask/job/dask/job/prb/job/dask-prb/6185/ and message: 'Build failure
15:10:13  '
15:10:13 Using context: gpuCI/dask/pr-builder
15:10:13 Finished: FAILURE

cc @dask/gpu

@github-actions github-actions bot added the needs triage Needs a response from a contributor label Aug 14, 2024
@rjzamora rjzamora self-assigned this Aug 14, 2024
@rjzamora
Copy link
Member

Thanks for raising an issue @fjetter - I'll definitely work on getting gpuCI back in a passing state today.

I'm not sure what is causing the build failure, but I do know some recent dask/array work has definitely broken cupy support.

@rjzamora
Copy link
Member

I'm really sorry there has been so much unwanted gpuCI noise lately. It looks like gpuCI is now "fixed" in the sense that the pytests should all pass. However, the java.io.IOException described at the top of this issue does still happen intermittently for some reason.

We have not figured out how to fix this intermittent failure yet. However, it you do happen to see this failure in the wild, members of the dask org can re-run the gpuCI check (and only that check) by commenting: Rerun tests (e.g. #11294 (comment))

cc @fjetter @phofl @jrbourbeau @hendrikmakait (just to make sure you know about Rerun tests)

@fjetter
Copy link
Member Author

fjetter commented Aug 30, 2024

@dask/gpu gpuCI appears to be broken again. One example #11354 but there are other failures and it looks quite intermittent. Looking at Jenkins this almost feels like a gpuCI internal problem.

@rjzamora
Copy link
Member

Looking at Jenkins this almost feels like a gpuCI internal problem.

Right, the failures are intermittent, and can always be re-run with a Rerun tests comment (typically takes a few minutes for gpuCI to turn green after you make the comment).

Our ops team is working on a replacement to our current Jenkins infrastructure at the moment. I'm sorry again for the noise.

@fjetter
Copy link
Member Author

fjetter commented Aug 30, 2024

How long will it take to replace the Jenkins infra? I currently feel gpuCI is not delivering a lot of value and is just noise. Would you mind if we disabled this until it is reliable again?

@rjzamora
Copy link
Member

How long will it take to replace the Jenkins infra? I currently feel gpuCI is not delivering a lot of value and is just noise. Would you mind if we disabled this until it is reliable again?

We are discussing this internally to figure out the best way to proceed, but I do have a strong preference to keep gpuCI turned on for now if you/others are willing.

Our team obviously finds gpuCI valuable, but I do understand why you would see things a different way. When gpuCI was actually broken a few weeks ago (not just flaky the way it is now), changes were merged into main that broke cupy support. In theory, gpuCI is a convenient way for contributors/maintainers to know right away if a new change is likely to break GPU compatibility.

The alternative is of course that we (RAPIDS) run our own nightly tests against main, and raise an issue when something breaks. In some cases, the fix will be simple. In others, the change could be a nightmare to roll back or fix. What would be an ideal developer experience on your end? I'm hoping we can work toward something that makes everyone "happy enough".

@jakirkham
Copy link
Member

Roughly a year ago we had proposed moving Dask to a GitHub Actions based system for GPU CI in issue: dask/community#348

We didn't hear much from other maintainers there (admittedly there could have been offline discussion I'm unaware of)

Perhaps it is worth reading that issue and sharing your thoughts on that approach? 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Needs a response from a contributor
Projects
None yet
Development

No branches or pull requests

3 participants