Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental GPU support #5605

Merged
merged 26 commits into from
Oct 27, 2023
Merged

Conversation

matiasinsaurralde
Copy link
Contributor

  • Implements basic GPU access by relying on the nvidia-container-runtime-hook.
  • Allows specifying the GPU ID to be exposed to the container on machines that host multiple GPUs.
  • A new WithGPU method is implemented to allow specifying GPU ID (or just all if you want to expose all GPUs):
...
	ctr := c.Container().From(cudaImage)
	contents, err := ctr.
		// WithGPU(dagger.ContainerWithGPUOpts{Devices: "GPU-5d8950fe-17a6-2fa7-9baa-afa83bba0e2b"}).
		WithGPU(dagger.ContainerWithGPUOpts{Devices: "all"}).
		WithExec([]string{"nvidia-smi", "-L"}).
		Stdout(ctx)
...
  • There's an integration test that tests functionality for both single GPU and multi GPU environments here.
  • Requires the host to add a shell script in /usr/bin/nvidia_helper.sh with the following contents (still don't know the best way to ship this and the commands look too verbose and probably unreadable if we inline them directly with WithExec, open to suggestions!): nvidia_helper.sh contents.
  • For now the dev engine container was switched completely to Ubuntu and we should be able to improve this part. For example, if the user requires GPU access, use Ubuntu as the base image. If not keep using Alpine. The reasons were mentioned here. TLDR is that Nvidia doesn't ship official container runtime tooling for Alpine as you can find here and it looks like it's generally better and safer to use an standard base image rather than hacking around Alpine to make it work.

Ticket is #4675

@TomChv TomChv requested review from jlongtine, vito and sipsma and removed request for vito August 10, 2023 13:48
Copy link
Member

@TomChv TomChv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing PR! I left few questions and comments and look forward for a deeper review when the PR is ready :D

cmd/shim/main.go Outdated Show resolved Hide resolved
cmd/shim/main.go Outdated Show resolved Hide resolved
core/schema/container.graphqls Outdated Show resolved Hide resolved
internal/mage/util/engine.go Outdated Show resolved Hide resolved
@TomChv
Copy link
Member

TomChv commented Aug 10, 2023

Requires the host to add a shell script in /usr/bin/nvidia_helper.sh with the following contents (still don't know the best way to ship this and the commands look too verbose and probably unreadable if we inline them directly with WithExec, open to suggestions!): nvidia_helper.sh contents.

We could embbed it in Dagger binary and insert it in the container on call to WithGPU

@matiasinsaurralde matiasinsaurralde force-pushed the gpu-access-2 branch 2 times, most recently from 7832a1b to d132697 Compare August 15, 2023 15:38
@matiasinsaurralde
Copy link
Contributor Author

Updates:

  • By default no GPU access is attempted.
  • An environment variable called _EXPERIMENTAL_DAGGER_GPU_SUPPORT needs to be set to enable GPU support.
  • If _EXPERIMENTAL_DAGGER_GPU_SUPPORT is set dev engine starts with Ubuntu and installs all Nvidia requirements.
  • If _EXPERIMENTAL_DAGGER_GPU_SUPPORT is not set, dev engine starts with Alpine with all regular dependencies.
  • If _EXPERIMENTAL_DAGGER_GPU_SUPPORT is set and WithGPU is used before WithExec, WithExec throws an error:
paperspace@psal6i8au:~/go/src/github.com/dagger/dagger$ go test -v -count 1 -timeout 1000s -run=TestGPUAccess ./core/integration
=== RUN   TestGPUAccess
    gpu_test.go:43: 
        	Error Trace:	/home/paperspace/go/src/github.com/dagger/dagger/core/integration/gpu_test.go:43
        	Error:      	Received unexpected error:
        	            	input:1: container.from.withGPU.withExec GPU support is not enabled, set _EXPERIMENTAL_DAGGER_GPU_SUPPORT
        	            	
        	            	Please visit https://dagger.io/help#go for troubleshooting guidance.
        	Test:       	TestGPUAccess
--- FAIL: TestGPUAccess (1.57s)
FAIL
FAIL	github.com/dagger/dagger/core/integration	1.593s
FAIL
  • If _EXPERIMENTAL_DAGGER_GPU_SUPPORT is set and WithGPU is used and there are available GPUs, selected GPUs should be exposed to the container as requested.

@shykes
Copy link
Contributor

shykes commented Aug 15, 2023

Replying here to a discord message by @matiasinsaurralde :

  • We introduce an environment variable to enable experimental GPU support. If this is not set we follow the regular Dagger behavior (Alpine is used, etc.).
  • If the environment variable is set we use an Ubuntu base image and setup all Nvidia dependencies on it.
  • I've also spent time testing different scenarios like what happens when you run in a host that doesn't have any Nvidia GPUs, etc.
  • Shim is also conditioned to the experimental GPU support flag, e.g. won't attempt to inject the Nvidia container runtime hook if this is not set.
  • Basic flow to try it out should be: (i) add export _EXPERIMENTAL_DAGGER_GPU_SUPPORT=1 to hack/dev. (ii) run ./hack/dev bash, (iii) Run the tests (or replace them with anything you would like to try): go test -v -count 1 -timeout 1000s -run=TestGPUAccess ./core/integration

Is the idea to move everything to ubuntu when the experimental gate is removed ? I worry about maintaining two different images for a long period of time.

Copy link
Member

@TomChv TomChv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement, I really like it!

I left a comment related to the actual usage outside dagger dogfeed :)

Careful btw, the DCO is currently failing :)

@@ -58,6 +61,17 @@ insecure-entitlements = ["security.insecure"]
{{ end -}}
`

// nvidiaSetupHelper provides the required steps to setup nvidia-container-toolkit:
const nvidiaSetupHelper = `
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I want to use in on my computer for my own pipeline? I'll not have access to this file since it's internal to mage.
Is there another way to handle that? Or should I prepare a container the same way you do it here but on my own.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomChv Good point, thinking this could be moved to the shim, and the file gets initialized there if GPU access is enabled?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that could work yeah, give it a try

Copy link
Contributor Author

@matiasinsaurralde matiasinsaurralde Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have been exploring this a bit, would it be better to setup the Nvidia runtime at this level? https://github.com/matiasinsaurralde/dagger/blob/gpu-access-2/core/container.go#L1034

So WithGPU is called, Nvidia runtime is setup and we still pass the parameters to shim (we'll always need this to signal GPU visibility to the prestart hook).
I don't see an alternative that doesn't involve installing the Nvidia runtime every time we create and start a container. We previously tried mounting Nvidia runtime files from the host into the container -so that no installation step happens- but turns out to be tricky if the container and the host aren't running similar environments.

On the other side I believe that if WithGPU introduces an additional step to run the helper script and install the Nvidia runtime it could play well with caching. Subsequent runs wouldn't be installing the runtime again. Makes sense?

Let me know if I misunderstood the scenario you described, still thinking about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have been exploring this a bit, would it be better to setup the Nvidia runtime at this level? https://github.com/matiasinsaurralde/dagger/blob/gpu-access-2/core/container.go#L1034

This would be better but you do not know which image is used by the container, since nvinda runtime can also be setup on ubuntu, (as far as I understand), this step will mostly fails except if the base image is correct.
That would become tricky become some part of the setup would be up to the user and some other would be on Dagger side, I think it would create a lack of flexibility.

But with Zenith we might be able to solve this issue thanks to special GPU environment (think about it as extension) that could be load by the user. This would make it much easier!

Copy link
Contributor Author

@matiasinsaurralde matiasinsaurralde Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomChv I didn't know about Zenith, just reading about it.

CUDA only supports four image types for now (Ubuntu, UBI, RockyLinux and CentOS) and installation steps for the runtime hook are limited too. A set of instructions work for Ubuntu and the other set for CentOS/RHEL: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#

Wandering if we could try to be smart here and perform some small probe to determine if the container image is running one or the other, e.g. a CentOS/RHEL will contain the dnf binary but Ubuntu won't, etc.

As an alternative WithGPU could introduce a configuration parameter that takes the distro flavor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I want to use in on my computer for my own pipeline? I'll not have access to this file since it's internal to mage.

I'm confused - wouldn't you be able to just export _EXPERIMENTAL_DAGGER_GPU_SUPPORT=1 and then ./hack/dev?

Copy link
Contributor Author

@matiasinsaurralde matiasinsaurralde Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomChv have been revisiting this. I think there's some confussion on the environment where the hook runs:

  • If we change the engine's base image to Ubuntu, we only need to install Nvidia Container Runtime at that level (engine container).
  • Shim should be able to find the path to nvidia-container-runtime-hook in the engine's container, that's all.
  • We don't need to install Nvidia tooling in the Dagger-created containers, we assume an image that's specified by the user already contains Nvidia dependencies. It could be a CUDA image directly -like nvidia/11.7.1-base-ubuntu20.04 or nvidia/11.7.1-base-centos7- or a custom image created by the user that's based on any of these original CUDA images. I will do some additional testing around this topic today.
  • We don't really need to have different behavior for distro flavors as we have control on which image to use for the engine's container.

@matiasinsaurralde
Copy link
Contributor Author

Replying here to a discord message by @matiasinsaurralde :

  • We introduce an environment variable to enable experimental GPU support. If this is not set we follow the regular Dagger behavior (Alpine is used, etc.).
  • If the environment variable is set we use an Ubuntu base image and setup all Nvidia dependencies on it.
  • I've also spent time testing different scenarios like what happens when you run in a host that doesn't have any Nvidia GPUs, etc.
  • Shim is also conditioned to the experimental GPU support flag, e.g. won't attempt to inject the Nvidia container runtime hook if this is not set.
  • Basic flow to try it out should be: (i) add export _EXPERIMENTAL_DAGGER_GPU_SUPPORT=1 to hack/dev. (ii) run ./hack/dev bash, (iii) Run the tests (or replace them with anything you would like to try): go test -v -count 1 -timeout 1000s -run=TestGPUAccess ./core/integration

Is the idea to move everything to ubuntu when the experimental gate is removed ? I worry about maintaining two different images for a long period of time.

@shykes I think that moving to Ubuntu after the experimental feature makes sense. After integrating these changes we could also spend some more time trying out wolfi, my initial experiments weren't successful: #4675 (comment)

Probably Ubuntu is generally better as it's an official Nvidia supported distro.

@TomChv
Copy link
Member

TomChv commented Aug 17, 2023

Probably Ubuntu is generally better as it's an official Nvidia supported distro.

I agree with that, the only disadvantage is that it will make our pipeline a bit slower because ubuntu is heavier than alpine.

Comment on lines 570 to 574
"""
Sets GPU access parameters for the given container, currently works for Nvidia only.
"""
withGPU(
devices: String
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More descriptive docs will help here, I wouldn't know what I'm supposed to set devices to. Also some other basic stuff like whether it's valid to call multiple times (to configure multiple devices), etc.

If the answer is "it's complicated" that's alright, but then we can just have a brief description here and maybe point to our official docs once those exist :-)

withGPU(
devices: String
): Container!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line with our other APIs, we should also have fields like gpu (to read which gpu is configured, if any) and withoutGPU to remove the setting

@@ -1025,9 +1027,18 @@ func (container *Container) WithPipeline(ctx context.Context, name, description
return container, nil
}

func (container *Container) WithExec(ctx context.Context, gw bkgw.Client, progSock *Socket, defaultPlatform specs.Platform, opts ContainerExecOpts) (*Container, error) { //nolint:gocyclo
type ContainerGPUOpts struct {
Devices string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a list of devices can we make it a []string here?

ctr := c.Container().From(cudaImage)
contents, err := ctr.
// WithGPU(dagger.ContainerWithGPUOpts{Devices: "GPU-5d8950fe-17a6-2fa7-9baa-afa83bba0e2b"}).
WithGPU(dagger.ContainerWithGPUOpts{Devices: "all"}).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd personally prefer if all wasn't a special string and instead we had a separate api call like WithAllGPUs (or similar, could instead be a bool option to WithGPU perhaps, though I like that less).

Just to cut back on the need for users to remember and type one-off strings like that correctly.

runArgs = append(runArgs, []string{"--gpus", "all"}...)
}
runArgs = append(runArgs, []string{
"--rm",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: actually just delete --rm I think, not sure why it was commented out, but it's useful to not remove the dev engine if it dies because you can still look at the logs if it crashed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed a few weeks ago

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update
apt-get install nvidia-container-toolkit -y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what the new image size is for the ubuntu image. Totally okay with the tradeoff here for the moment, just wondering what final number actually ends up being.

contents, err := ctr.
// WithGPU(dagger.ContainerWithGPUOpts{Devices: "GPU-5d8950fe-17a6-2fa7-9baa-afa83bba0e2b"}).
WithGPU(dagger.ContainerWithGPUOpts{Devices: "all"}).
WithExec([]string{"nvidia-smi", "-L"}).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just lists the GPUs to make sure they are visible right? If so, that's great for a basic test, but have you verified programs that actually utilize the GPU work?

I'm guessing there's probably some python ML libraries that could be run fairly in a WithExec, would be good to have that test too.

Copy link
Contributor Author

@matiasinsaurralde matiasinsaurralde Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sipsma I covered this a few weeks ago with a test called TestGPUAccessWithPython. It runs some pytorch computation using GPU inside the Dagger container.

@gerhard
Copy link
Member

gerhard commented Sep 11, 2023

Anything that I can do to help move this along @matiasinsaurralde?

@matiasinsaurralde
Copy link
Contributor Author

I've updated this PR to incorporate b40b4a6, this should unblock anyone who wants to test this: if service containers are disabled and GPU access is enabled the PR will work in its current form.

However as @vito pointed out on Discord we should aim to always support this feature due to the fact that service containers will be enabled by default (see #5557). I'm still rewriting and testing CNI setup for Ubuntu: https://github.com/dagger/dagger/blob/main/internal/mage/util/engine.go#L242

I've also updated the base container image to Ubuntu 22.04 -it was previously 20.04- due to incompatibilities with dnsmasq CLI flags: 0c4abace76e44267aa562d711a7e05dbbdd4e553
Ubuntu is only used when GPU access is enabled though.

@matiasinsaurralde matiasinsaurralde marked this pull request as ready for review September 20, 2023 04:36
@matiasinsaurralde
Copy link
Contributor Author

matiasinsaurralde commented Sep 20, 2023

@shykes / @gerhard / @samalba
A summary of latest changes:

  • Fixed service containers when using Ubuntu -when GPU access is enabled-. CNI plugin builds were failing because the compilation host was Alpine (probably related to the same issue with musl we initially spotted while working on this feature).
  • Simplified WithGPU so that it takes a list of devices directly:
ctr := c.Container().From(cudaImage)
ctr.WithGPU([]string{"0", "1"}).
// Or:
ctr.WithGPU([]string{"GPU-5d8950fe-17a6-2fa7-9baa-afa83bba0e2b"}).
  • Added WithAllGPUs which ends up passing the all keyword to the Nvidia Container Toolkit:
ctr := c.Container().From(cudaImage)
ctr.WithAllGPUs()

Need to look into SDK lint issues as I manually tweaked SDK Go code for the past tests. And also just try this with other SDKs.

@matiasinsaurralde matiasinsaurralde mentioned this pull request Sep 20, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Oct 5, 2023

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 7 days.

matiasinsaurralde and others added 14 commits October 27, 2023 18:11
Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Signed-off-by: Matias Insaurralde <matias@insaurral.de>
…nal Dagger image with GPU support

Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Significant (non merge conflict resolution) changes:
* Prefixed APIs w/ "experimental"
* Append `--gpus=all` when gpus are enabled in docker-image://
  connhelper logic
* Only publish amd64 image
* Add gpu image variant to engine:testpublish
* Run nvidia setup as commands rather than including extra script in
  image

Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Matias Insaurralde <matias@insaurral.de>
…nd GPU support is enabled

Signed-off-by: Matias Insaurralde <matias@insaurral.de>
… an arch that's not supported

Signed-off-by: Matias Insaurralde <matias@insaurral.de>
…ontaine build

Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
… the appropriate flag is used - for local testing

Signed-off-by: Matias Insaurralde <matias@insaurral.de>
@gerhard
Copy link
Member

gerhard commented Oct 27, 2023

My issues seem to be related to the pinned nvidia-driver package 515 which is too old for this.

I am unable to upgrade this package on this host:
image

I will restart this with a different image. Yesterday's base Ubuntu 20.04 seems to have worked fine with https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installation

Signed-off-by: Matias Insaurralde <matias@insaurral.de>
@matiasinsaurralde
Copy link
Contributor Author

Have added a quick sample in examples/sdk/go/gpu. After running ./hack/dev bash it should be possible to build it and run it (fc77da6):

$ cd examples/sdk/go/gpu
$ go build
$ ./gpu

Expected output:

Creating new Engine session... OK!
Establishing connection to Engine... 1: connect
1: > in init
1: starting engine 
1: starting engine [0.10s]
1: starting session 
1: [0.14s] OK!
1: starting session [0.05s]
1: connect DONE
OK!

6: resolve image config for docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
6: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
6: resolve image config for docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: resolve docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04@sha256:745cc6cbd3e36d20441a4fee04b7fab8d2785584cf0d2cf667408f5f773ec9e8 
11: resolve docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04@sha256:745cc6cbd3e36d20441a4fee04b7fab8d2785584cf0d2cf667408f5f773ec9e8 [0.02s]
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: sha256:e5c76c058a4460c199c66189c15efca36f4a512a35fe57273a979ccca58a0176 6.716KiB / 6.716KiB [0.11s]
11: sha256:ba47ed5447555d2c58ee4005528f925fc538e093f9fe584e7f0b1e5198ea6b94 0B / 183B 
11: sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3 0B / 45.66MiB 
11: sha256:ccfccf24900f621770df167b6811a582af777df108eee444c85ab551d39e9b77 0B / 7.575MiB 
11: sha256:ba47ed5447555d2c58ee4005528f925fc538e093f9fe584e7f0b1e5198ea6b94 183B / 183B [0.20s]
11: sha256:56e0351b98767487b3c411034be95479ed1710bb6be860db6df0be3a98653027 0B / 26.23MiB 
11: sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3 3MiB / 45.66MiB 
11: sha256:ccfccf24900f621770df167b6811a582af777df108eee444c85ab551d39e9b77 7.575MiB / 7.575MiB [0.39s]
11: sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3 23.76MiB / 45.66MiB 
11: sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3 40MiB / 45.66MiB 
11: sha256:56e0351b98767487b3c411034be95479ed1710bb6be860db6df0be3a98653027 16MiB / 26.23MiB 
11: sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3 45.66MiB / 45.66MiB 
11: sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3 45.66MiB / 45.66MiB [0.78s]
11: sha256:56e0351b98767487b3c411034be95479ed1710bb6be860db6df0be3a98653027 26.23MiB / 26.23MiB [0.75s]
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: extracting sha256:56e0351b98767487b3c411034be95479ed1710bb6be860db6df0be3a98653027 

11: extracting sha256:56e0351b98767487b3c411034be95479ed1710bb6be860db6df0be3a98653027 [1.85s]
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: extracting sha256:ccfccf24900f621770df167b6811a582af777df108eee444c85ab551d39e9b77 
11: extracting sha256:ccfccf24900f621770df167b6811a582af777df108eee444c85ab551d39e9b77 [0.51s]
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: extracting sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3 
11: extracting sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3 [1.63s]
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: extracting sha256:ba47ed5447555d2c58ee4005528f925fc538e093f9fe584e7f0b1e5198ea6b94 
11: extracting sha256:ba47ed5447555d2c58ee4005528f925fc538e093f9fe584e7f0b1e5198ea6b94 [0.01s]
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: extracting sha256:e5c76c058a4460c199c66189c15efca36f4a512a35fe57273a979ccca58a0176 
11: extracting sha256:e5c76c058a4460c199c66189c15efca36f4a512a35fe57273a979ccca58a0176 [0.01s]
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

9: exec nvidia-smi -L
9: [0.29s] GPU 0: Quadro P4000 (UUID: GPU-ca2c7679-d68c-5af1-f517-f991d89438e4)
9: exec nvidia-smi -L DONE
available GPUs GPU 0: Quadro P4000 (UUID: GPU-ca2c7679-d68c-5af1-f517-f991d89438e4)

By the way I will need to re-test with multiple GPUs after all the latest refactoring, will do it over the weekend.

If it's not relative, it's unlikely to work on other machines.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
@gerhard
Copy link
Member

gerhard commented Oct 27, 2023

I am picking this one up now. Third time lucky 🤞

Started with Ubuntu 20.04 server image this time with a P4000 card.

Capturing the commands that I ran as soon as I logged in:

sudo apt-get update && sudo apt-get upgrade -y
sudo apt-get install -y build-essential tmux
# consider tmux-ing it...

### DOCKER
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
# Add the repository to Apt sources:
echo \
  "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USER
newgrp docker
docker run hello-world

### NVIDIA
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
  && \
    sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo apt-get install -y nvidia-driver-535
nvidia-smi

sudo nvidia-ctk runtime configure --runtime=docker

### LOAD NEW DRIVERS
sudo reboot
nvidia-smi

### GOLANG
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
(echo; echo 'eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"') >> /home/paperspace/.bashrc
eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
brew install gcc golang

### THIS PR
git clone https://github.com/matiasinsaurralde/dagger.git
cd dagger
git checkout gpu-access-2

And now to check that this works:

_EXPERIMENTAL_DAGGER_GPU_SUPPORT=1 ./hack/dev bash
export DAGGER_GPU_TESTS_ENABLED=1

go test -v -count 1 -timeout 1000s -run=TestGPUAccess ./core/integration
# ...
=== RUN   TestGPUAccess
=== RUN   TestGPUAccess/nvidia/cuda:11.7.1-base-ubuntu20.04
=== RUN   TestGPUAccess/nvidia/cuda:11.7.1-base-ubuntu20.04/use_specific_GPU
    gpu_test.go:132: this test requires at least 2 GPUs to run
=== RUN   TestGPUAccess/nvidia/cuda:11.7.1-base-ubi8
=== RUN   TestGPUAccess/nvidia/cuda:11.7.1-base-ubi8/use_specific_GPU
    gpu_test.go:132: this test requires at least 2 GPUs to run
=== RUN   TestGPUAccess/nvidia/cuda:11.7.1-base-centos7
=== RUN   TestGPUAccess/nvidia/cuda:11.7.1-base-centos7/use_specific_GPU
    gpu_test.go:132: this test requires at least 2 GPUs to run
--- FAIL: TestGPUAccess (26.99s)
    --- FAIL: TestGPUAccess/nvidia/cuda:11.7.1-base-ubuntu20.04 (7.54s)
        --- FAIL: TestGPUAccess/nvidia/cuda:11.7.1-base-ubuntu20.04/use_specific_GPU (0.00s)
    --- FAIL: TestGPUAccess/nvidia/cuda:11.7.1-base-ubi8 (8.74s)
        --- FAIL: TestGPUAccess/nvidia/cuda:11.7.1-base-ubi8/use_specific_GPU (0.00s)
    --- FAIL: TestGPUAccess/nvidia/cuda:11.7.1-base-centos7 (10.54s)
        --- FAIL: TestGPUAccess/nvidia/cuda:11.7.1-base-centos7/use_specific_GPU (0.00s)
=== RUN   TestGPUAccessWithPython
=== RUN   TestGPUAccessWithPython/pytorch_CUDA_availibility_check
=== RUN   TestGPUAccessWithPython/pytorch_tensors_sample
--- PASS: TestGPUAccessWithPython (136.94s)
    --- PASS: TestGPUAccessWithPython/pytorch_CUDA_availibility_check (133.12s)
    --- PASS: TestGPUAccessWithPython/pytorch_tensors_sample (3.68s)
FAIL
FAIL    github.com/dagger/dagger/core/integration       163.948s
FAIL

Which of the following instances did you provision in Paperspace @matiasinsaurralde for the tests with 2 GPUs?
image

Check that the Go SDK GPU example works:

cd examples/sdk/go/gpu
go run main.go
Creating new Engine session... OK!
Establishing connection to Engine... 1: connect
1: > in init
1: starting engine
1: starting engine [0.08s]
1: starting session
1: [0.11s] OK!
1: starting session [0.03s]
1: connect DONE
OK!

6: resolve image config for docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
6: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
6: resolve image config for docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: resolve docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04@sha256:745cc6cbd3e36d20441a4fee04b7fab8d2785584cf0d2cf667408f5f773ec9e8
11: resolve docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04@sha256:745cc6cbd3e36d20441a4fee04b7fab8d2785584cf0d2cf667408f5f773ec9e8 [0.01s]
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

9: exec nvidia-smi -L CACHED
9: exec nvidia-smi -L CACHED
available GPUs GPU 0: Quadro P4000 (UUID: GPU-14985f0a-d0d7-2168-0baa-4a077ac0f6c1)

Copy link
Member

@gerhard gerhard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now works as advertised 🙌

Thank you to all that reviewed this PR & helped move along - it's been a long time coming!

Thank you for sticking with it @matiasinsaurralde & seeing it through 💪

Next steps (a.k.a. follow-up PRs):

  • Add docs (as already discussed in other comments)
    • Paperspace install instructions in my last comment might come in handy
    • ✨ Zenith module? ✨
  • Add instructions for multi-GPU tests (see my last comment)
  • Ensure that creating the release works - cc @sipsma
  • Test that the released CLI & Engine image work as advertised - cc @sipsma
  • Create a Zenith module that showcases this with an LLM - cc @lukemarsden @samalba

As soon as the checks go green, this will get merged 🚀

@gerhard gerhard merged commit 8c90760 into dagger:main Oct 27, 2023
44 checks passed
schlapzz pushed a commit to schlapzz/dagger that referenced this pull request Nov 24, 2023
* shim: incorporate GPU access hooks and pass GPU visibility parameters

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: extend container to implement WithGPU

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* engine: extend dockerImageProvider to pass GPU support flag

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: extend dev engine container with GPU support flag

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: use Ubuntu when the dev engine is initialized with GPU support

Also embed helper script for setting up Nvidia Container Toolkit on the Ubuntu base image.

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* sdk: update Go SDK

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: add GPU access tests

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* hack: temp change to disable service containers while enabling GPU support

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: bump Ubuntu version when GPU access is enabled

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* hack: always enable service containers and GPU access

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: update cniPlugins to be compatible with Ubuntu

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: change logic around WithGPU and implement WithAllGPUs

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: fix EnabledGPUs usage

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: refactor GPU integration test with new calls

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: only run GPU tests when DAGGER_GPU_TESTS_ENABLED is set

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* hack: remove experimental flags from dev script

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: update mage flows to support building and publishing an additional Dagger image with GPU support

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* api+publish fixups

Significant (non merge conflict resolution) changes:
* Prefixed APIs w/ "experimental"
* Append `--gpus=all` when gpus are enabled in docker-image://
  connhelper logic
* Only publish amd64 image
* Add gpu image variant to engine:testpublish
* Run nvidia setup as commands rather than including extra script in
  image

Signed-off-by: Erik Sipsma <erik@dagger.io>

* schema: fix context usage in GPU methods

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* engine: Ensure "gpu" suffix is used when pulling the engine's image and GPU support is enabled

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: panic if there's an attempt to build the GPU enabled image with an arch that's not supported

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: restore "no-cache" flag usage when running apk for dev engine containe build

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* Add changelog fragment

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* mage: fix engine's dev step so that it loads a GPU enabled image when the appropriate flag is used - for local testing

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* examples: add simple GPU example

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* Use latest available dagger Go package & fix replace

If it's not relative, it's unlikely to work on other machines.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

---------

Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
Co-authored-by: Erik Sipsma <erik@dagger.io>
Co-authored-by: Gerhard Lazu <gerhard@dagger.io>
Signed-off-by: Christian Schlatter <schlatter@puzzle.ch>
schlapzz pushed a commit to schlapzz/dagger that referenced this pull request Nov 24, 2023
* shim: incorporate GPU access hooks and pass GPU visibility parameters

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: extend container to implement WithGPU

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* engine: extend dockerImageProvider to pass GPU support flag

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: extend dev engine container with GPU support flag

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: use Ubuntu when the dev engine is initialized with GPU support

Also embed helper script for setting up Nvidia Container Toolkit on the Ubuntu base image.

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* sdk: update Go SDK

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: add GPU access tests

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* hack: temp change to disable service containers while enabling GPU support

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: bump Ubuntu version when GPU access is enabled

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* hack: always enable service containers and GPU access

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: update cniPlugins to be compatible with Ubuntu

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: change logic around WithGPU and implement WithAllGPUs

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: fix EnabledGPUs usage

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: refactor GPU integration test with new calls

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: only run GPU tests when DAGGER_GPU_TESTS_ENABLED is set

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* hack: remove experimental flags from dev script

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: update mage flows to support building and publishing an additional Dagger image with GPU support

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* api+publish fixups

Significant (non merge conflict resolution) changes:
* Prefixed APIs w/ "experimental"
* Append `--gpus=all` when gpus are enabled in docker-image://
  connhelper logic
* Only publish amd64 image
* Add gpu image variant to engine:testpublish
* Run nvidia setup as commands rather than including extra script in
  image

Signed-off-by: Erik Sipsma <erik@dagger.io>

* schema: fix context usage in GPU methods

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* engine: Ensure "gpu" suffix is used when pulling the engine's image and GPU support is enabled

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: panic if there's an attempt to build the GPU enabled image with an arch that's not supported

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: restore "no-cache" flag usage when running apk for dev engine containe build

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* Add changelog fragment

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* mage: fix engine's dev step so that it loads a GPU enabled image when the appropriate flag is used - for local testing

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* examples: add simple GPU example

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* Use latest available dagger Go package & fix replace

If it's not relative, it's unlikely to work on other machines.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

---------

Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
Co-authored-by: Erik Sipsma <erik@dagger.io>
Co-authored-by: Gerhard Lazu <gerhard@dagger.io>
schlapzz pushed a commit to schlapzz/dagger that referenced this pull request Nov 24, 2023
* shim: incorporate GPU access hooks and pass GPU visibility parameters

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: extend container to implement WithGPU

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* engine: extend dockerImageProvider to pass GPU support flag

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: extend dev engine container with GPU support flag

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: use Ubuntu when the dev engine is initialized with GPU support

Also embed helper script for setting up Nvidia Container Toolkit on the Ubuntu base image.

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* sdk: update Go SDK

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: add GPU access tests

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* hack: temp change to disable service containers while enabling GPU support

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: bump Ubuntu version when GPU access is enabled

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* hack: always enable service containers and GPU access

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: update cniPlugins to be compatible with Ubuntu

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: change logic around WithGPU and implement WithAllGPUs

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: fix EnabledGPUs usage

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: refactor GPU integration test with new calls

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: only run GPU tests when DAGGER_GPU_TESTS_ENABLED is set

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* hack: remove experimental flags from dev script

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: update mage flows to support building and publishing an additional Dagger image with GPU support

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* api+publish fixups

Significant (non merge conflict resolution) changes:
* Prefixed APIs w/ "experimental"
* Append `--gpus=all` when gpus are enabled in docker-image://
  connhelper logic
* Only publish amd64 image
* Add gpu image variant to engine:testpublish
* Run nvidia setup as commands rather than including extra script in
  image

Signed-off-by: Erik Sipsma <erik@dagger.io>

* schema: fix context usage in GPU methods

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* engine: Ensure "gpu" suffix is used when pulling the engine's image and GPU support is enabled

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: panic if there's an attempt to build the GPU enabled image with an arch that's not supported

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: restore "no-cache" flag usage when running apk for dev engine containe build

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* Add changelog fragment

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* mage: fix engine's dev step so that it loads a GPU enabled image when the appropriate flag is used - for local testing

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* examples: add simple GPU example

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* Use latest available dagger Go package & fix replace

If it's not relative, it's unlikely to work on other machines.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

---------

Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
Co-authored-by: Erik Sipsma <erik@dagger.io>
Co-authored-by: Gerhard Lazu <gerhard@dagger.io>
schlapzz pushed a commit to schlapzz/dagger that referenced this pull request Nov 24, 2023
* shim: incorporate GPU access hooks and pass GPU visibility parameters

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: extend container to implement WithGPU

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* engine: extend dockerImageProvider to pass GPU support flag

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: extend dev engine container with GPU support flag

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: use Ubuntu when the dev engine is initialized with GPU support

Also embed helper script for setting up Nvidia Container Toolkit on the Ubuntu base image.

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* sdk: update Go SDK

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: add GPU access tests

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* hack: temp change to disable service containers while enabling GPU support

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: bump Ubuntu version when GPU access is enabled

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* hack: always enable service containers and GPU access

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: update cniPlugins to be compatible with Ubuntu

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: change logic around WithGPU and implement WithAllGPUs

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: fix EnabledGPUs usage

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: refactor GPU integration test with new calls

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: only run GPU tests when DAGGER_GPU_TESTS_ENABLED is set

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* hack: remove experimental flags from dev script

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: update mage flows to support building and publishing an additional Dagger image with GPU support

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* api+publish fixups

Significant (non merge conflict resolution) changes:
* Prefixed APIs w/ "experimental"
* Append `--gpus=all` when gpus are enabled in docker-image://
  connhelper logic
* Only publish amd64 image
* Add gpu image variant to engine:testpublish
* Run nvidia setup as commands rather than including extra script in
  image

Signed-off-by: Erik Sipsma <erik@dagger.io>

* schema: fix context usage in GPU methods

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* engine: Ensure "gpu" suffix is used when pulling the engine's image and GPU support is enabled

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: panic if there's an attempt to build the GPU enabled image with an arch that's not supported

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: restore "no-cache" flag usage when running apk for dev engine containe build

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* Add changelog fragment

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* mage: fix engine's dev step so that it loads a GPU enabled image when the appropriate flag is used - for local testing

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* examples: add simple GPU example

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* Use latest available dagger Go package & fix replace

If it's not relative, it's unlikely to work on other machines.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

---------

Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
Co-authored-by: Erik Sipsma <erik@dagger.io>
Co-authored-by: Gerhard Lazu <gerhard@dagger.io>
schlapzz pushed a commit to schlapzz/dagger that referenced this pull request Nov 24, 2023
* shim: incorporate GPU access hooks and pass GPU visibility parameters

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: extend container to implement WithGPU

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* engine: extend dockerImageProvider to pass GPU support flag

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: extend dev engine container with GPU support flag

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: use Ubuntu when the dev engine is initialized with GPU support

Also embed helper script for setting up Nvidia Container Toolkit on the Ubuntu base image.

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* sdk: update Go SDK

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: add GPU access tests

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* hack: temp change to disable service containers while enabling GPU support

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: bump Ubuntu version when GPU access is enabled

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* hack: always enable service containers and GPU access

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: update cniPlugins to be compatible with Ubuntu

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: change logic around WithGPU and implement WithAllGPUs

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: fix EnabledGPUs usage

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: refactor GPU integration test with new calls

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* core: only run GPU tests when DAGGER_GPU_TESTS_ENABLED is set

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* hack: remove experimental flags from dev script

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: update mage flows to support building and publishing an additional Dagger image with GPU support

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* api+publish fixups

Significant (non merge conflict resolution) changes:
* Prefixed APIs w/ "experimental"
* Append `--gpus=all` when gpus are enabled in docker-image://
  connhelper logic
* Only publish amd64 image
* Add gpu image variant to engine:testpublish
* Run nvidia setup as commands rather than including extra script in
  image

Signed-off-by: Erik Sipsma <erik@dagger.io>

* schema: fix context usage in GPU methods

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* engine: Ensure "gpu" suffix is used when pulling the engine's image and GPU support is enabled

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: panic if there's an attempt to build the GPU enabled image with an arch that's not supported

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* mage: restore "no-cache" flag usage when running apk for dev engine containe build

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* Add changelog fragment

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* mage: fix engine's dev step so that it loads a GPU enabled image when the appropriate flag is used - for local testing

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* examples: add simple GPU example

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

* Use latest available dagger Go package & fix replace

If it's not relative, it's unlikely to work on other machines.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

---------

Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
Co-authored-by: Erik Sipsma <erik@dagger.io>
Co-authored-by: Gerhard Lazu <gerhard@dagger.io>
@gerhard gerhard mentioned this pull request Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants