gpuCI broken #11312

fjetter · 2024-08-14T13:18:40Z

This has been raised on #11242 already but I always have difficulties finding that draft PR and the failures are not related to a version update from what I can tell.

gpuCI has been pretty consistently failing for a while now.

Logs show something like (from #11310 // https://gpuci.gpuopenanalytics.com/job/dask/job/dask/job/prb/job/dask-prb/6185/console)

15:05:47 GitHub pull request #11310 of commit 355b76fc0632708894cfc1c17ce55b80cef8bbbb, no merge conflicts.
15:05:47 Running as SYSTEM
15:05:47 Setting status of 355b76fc0632708894cfc1c17ce55b80cef8bbbb to PENDING with url https://gpuci.gpuopenanalytics.com/job/dask/job/dask/job/prb/job/dask-prb/6185/ and message: 'Running'
15:05:47 Using context: gpuCI/dask/pr-builder
15:10:13 FATAL: java.io.IOException: Unexpected EOF
15:10:13 java.io.IOException: Unexpected EOF
15:10:13 	at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:101)
15:10:13 	at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
15:10:13 	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
15:10:13 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)
15:10:13 Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to EC2 (aws-b) - runner-m5d2xl (i-00c57ce783f1c62db)
15:10:13 		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1787)
15:10:13 		at hudson.remoting.Request.call(Request.java:199)
15:10:13 		at hudson.remoting.Channel.call(Channel.java:1002)
15:10:13 		at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1121)
15:10:13 		at hudson.Launcher$ProcStarter.start(Launcher.java:506)
15:10:13 		at hudson.Launcher$ProcStarter.join(Launcher.java:517)
15:10:13 		at com.gpuopenanalytics.jenkins.remotedocker.AbstractDockerLauncher.parseVersion(AbstractDockerLauncher.java:193)
15:10:13 		at com.gpuopenanalytics.jenkins.remotedocker.AbstractDockerLauncher.<init>(AbstractDockerLauncher.java:54)
15:10:13 		at com.gpuopenanalytics.jenkins.remotedocker.DockerLauncher.<init>(DockerLauncher.java:54)
15:10:13 		at com.gpuopenanalytics.jenkins.remotedocker.RemoteDockerBuildWrapper.decorateLauncher(RemoteDockerBuildWrapper.java:164)
15:10:13 		at hudson.model.AbstractBuild$AbstractBuildExecution.createLauncher(AbstractBuild.java:613)
15:10:13 		at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:485)
15:10:13 		at hudson.model.Run.execute(Run.java:1894)
15:10:13 		at hudson.matrix.MatrixBuild.run(MatrixBuild.java:323)
15:10:13 		at hudson.model.ResourceController.execute(ResourceController.java:101)
15:10:13 		at hudson.model.Executor.run(Executor.java:442)
15:10:13 Caused: hudson.remoting.RequestAbortedException
15:10:13 	at hudson.remoting.Request.abort(Request.java:346)
15:10:13 	at hudson.remoting.Channel.terminate(Channel.java:1083)
15:10:13 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:90)
15:10:13 Setting status of 355b76fc0632708894cfc1c17ce55b80cef8bbbb to FAILURE with url https://gpuci.gpuopenanalytics.com/job/dask/job/dask/job/prb/job/dask-prb/6185/ and message: 'Build failure
15:10:13  '
15:10:13 Using context: gpuCI/dask/pr-builder
15:10:13 Finished: FAILURE

cc @dask/gpu

The text was updated successfully, but these errors were encountered:

rjzamora · 2024-08-14T13:56:51Z

Thanks for raising an issue @fjetter - I'll definitely work on getting gpuCI back in a passing state today.

I'm not sure what is causing the build failure, but I do know some recent dask/array work has definitely broken cupy support.

rjzamora · 2024-08-23T16:24:51Z

I'm really sorry there has been so much unwanted gpuCI noise lately. It looks like gpuCI is now "fixed" in the sense that the pytests should all pass. However, the java.io.IOException described at the top of this issue does still happen intermittently for some reason.

We have not figured out how to fix this intermittent failure yet. However, it you do happen to see this failure in the wild, members of the dask org can re-run the gpuCI check (and only that check) by commenting: Rerun tests (e.g. #11294 (comment))

cc @fjetter @phofl @jrbourbeau @hendrikmakait (just to make sure you know about Rerun tests)

fjetter · 2024-08-30T13:27:12Z

@dask/gpu gpuCI appears to be broken again. One example #11354 but there are other failures and it looks quite intermittent. Looking at Jenkins this almost feels like a gpuCI internal problem.

rjzamora · 2024-08-30T14:00:07Z

Looking at Jenkins this almost feels like a gpuCI internal problem.

Right, the failures are intermittent, and can always be re-run with a Rerun tests comment (typically takes a few minutes for gpuCI to turn green after you make the comment).

Our ops team is working on a replacement to our current Jenkins infrastructure at the moment. I'm sorry again for the noise.

fjetter · 2024-08-30T15:27:49Z

How long will it take to replace the Jenkins infra? I currently feel gpuCI is not delivering a lot of value and is just noise. Would you mind if we disabled this until it is reliable again?

rjzamora · 2024-08-30T23:20:44Z

How long will it take to replace the Jenkins infra? I currently feel gpuCI is not delivering a lot of value and is just noise. Would you mind if we disabled this until it is reliable again?

We are discussing this internally to figure out the best way to proceed, but I do have a strong preference to keep gpuCI turned on for now if you/others are willing.

Our team obviously finds gpuCI valuable, but I do understand why you would see things a different way. When gpuCI was actually broken a few weeks ago (not just flaky the way it is now), changes were merged into main that broke cupy support. In theory, gpuCI is a convenient way for contributors/maintainers to know right away if a new change is likely to break GPU compatibility.

The alternative is of course that we (RAPIDS) run our own nightly tests against main, and raise an issue when something breaks. In some cases, the fix will be simple. In others, the change could be a nightmare to roll back or fix. What would be an ideal developer experience on your end? I'm hoping we can work toward something that makes everyone "happy enough".

jakirkham · 2024-08-30T23:49:47Z

Roughly a year ago we had proposed moving Dask to a GitHub Actions based system for GPU CI in issue: dask/community#348

We didn't hear much from other maintainers there (admittedly there could have been offline discussion I'm unaware of)

Perhaps it is worth reading that issue and sharing your thoughts on that approach? 🙂

github-actions bot added the needs triage Needs a response from a contributor label Aug 14, 2024

rjzamora self-assigned this Aug 14, 2024

This was referenced Aug 14, 2024

Upgrade gpuCI and fix Dask Array failures with "cupy" backend #11309

Merged

gpuCI failing #11334

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpuCI broken #11312

gpuCI broken #11312

fjetter commented Aug 14, 2024

rjzamora commented Aug 14, 2024

rjzamora commented Aug 23, 2024

fjetter commented Aug 30, 2024

rjzamora commented Aug 30, 2024

fjetter commented Aug 30, 2024

rjzamora commented Aug 30, 2024

jakirkham commented Aug 30, 2024

gpuCI broken #11312

gpuCI broken #11312

Comments

fjetter commented Aug 14, 2024

rjzamora commented Aug 14, 2024

rjzamora commented Aug 23, 2024

fjetter commented Aug 30, 2024

rjzamora commented Aug 30, 2024

fjetter commented Aug 30, 2024

rjzamora commented Aug 30, 2024

jakirkham commented Aug 30, 2024