-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add experimental GPU support #5605
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing PR! I left few questions and comments and look forward for a deeper review when the PR is ready :D
We could embbed it in Dagger binary and insert it in the container on call to |
7832a1b
to
d132697
Compare
Updates:
|
Replying here to a discord message by @matiasinsaurralde :
Is the idea to move everything to ubuntu when the experimental gate is removed ? I worry about maintaining two different images for a long period of time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement, I really like it!
I left a comment related to the actual usage outside dagger dogfeed :)
Careful btw, the DCO is currently failing :)
internal/mage/util/engine.go
Outdated
@@ -58,6 +61,17 @@ insecure-entitlements = ["security.insecure"] | |||
{{ end -}} | |||
` | |||
|
|||
// nvidiaSetupHelper provides the required steps to setup nvidia-container-toolkit: | |||
const nvidiaSetupHelper = ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if I want to use in on my computer for my own pipeline? I'll not have access to this file since it's internal to mage.
Is there another way to handle that? Or should I prepare a container the same way you do it here but on my own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomChv Good point, thinking this could be moved to the shim, and the file gets initialized there if GPU access is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that could work yeah, give it a try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have been exploring this a bit, would it be better to setup the Nvidia runtime at this level? https://github.com/matiasinsaurralde/dagger/blob/gpu-access-2/core/container.go#L1034
So WithGPU
is called, Nvidia runtime is setup and we still pass the parameters to shim (we'll always need this to signal GPU visibility to the prestart hook).
I don't see an alternative that doesn't involve installing the Nvidia runtime every time we create and start a container. We previously tried mounting Nvidia runtime files from the host into the container -so that no installation step happens- but turns out to be tricky if the container and the host aren't running similar environments.
On the other side I believe that if WithGPU
introduces an additional step to run the helper script and install the Nvidia runtime it could play well with caching. Subsequent runs wouldn't be installing the runtime again. Makes sense?
Let me know if I misunderstood the scenario you described, still thinking about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have been exploring this a bit, would it be better to setup the Nvidia runtime at this level? https://github.com/matiasinsaurralde/dagger/blob/gpu-access-2/core/container.go#L1034
This would be better but you do not know which image is used by the container, since nvinda runtime
can also be setup on ubuntu, (as far as I understand), this step will mostly fails except if the base image is correct.
That would become tricky become some part of the setup would be up to the user and some other would be on Dagger side, I think it would create a lack of flexibility.
But with Zenith we might be able to solve this issue thanks to special GPU environment (think about it as extension) that could be load by the user. This would make it much easier!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomChv I didn't know about Zenith, just reading about it.
CUDA only supports four image types for now (Ubuntu, UBI, RockyLinux and CentOS) and installation steps for the runtime hook are limited too. A set of instructions work for Ubuntu and the other set for CentOS/RHEL: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#
Wandering if we could try to be smart here and perform some small probe to determine if the container image is running one or the other, e.g. a CentOS/RHEL will contain the dnf
binary but Ubuntu won't, etc.
As an alternative WithGPU
could introduce a configuration parameter that takes the distro flavor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if I want to use in on my computer for my own pipeline? I'll not have access to this file since it's internal to mage.
I'm confused - wouldn't you be able to just export _EXPERIMENTAL_DAGGER_GPU_SUPPORT=1
and then ./hack/dev
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomChv have been revisiting this. I think there's some confussion on the environment where the hook runs:
- If we change the engine's base image to Ubuntu, we only need to install Nvidia Container Runtime at that level (engine container).
- Shim should be able to find the path to
nvidia-container-runtime-hook
in the engine's container, that's all. - We don't need to install Nvidia tooling in the Dagger-created containers, we assume an image that's specified by the user already contains Nvidia dependencies. It could be a CUDA image directly -like
nvidia/11.7.1-base-ubuntu20.04
ornvidia/11.7.1-base-centos7
- or a custom image created by the user that's based on any of these original CUDA images. I will do some additional testing around this topic today. - We don't really need to have different behavior for distro flavors as we have control on which image to use for the engine's container.
@shykes I think that moving to Ubuntu after the experimental feature makes sense. After integrating these changes we could also spend some more time trying out wolfi, my initial experiments weren't successful: #4675 (comment) Probably Ubuntu is generally better as it's an official Nvidia supported distro. |
I agree with that, the only disadvantage is that it will make our pipeline a bit slower because ubuntu is heavier than alpine. |
core/schema/container.graphqls
Outdated
""" | ||
Sets GPU access parameters for the given container, currently works for Nvidia only. | ||
""" | ||
withGPU( | ||
devices: String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More descriptive docs will help here, I wouldn't know what I'm supposed to set devices
to. Also some other basic stuff like whether it's valid to call multiple times (to configure multiple devices), etc.
If the answer is "it's complicated" that's alright, but then we can just have a brief description here and maybe point to our official docs once those exist :-)
core/schema/container.graphqls
Outdated
withGPU( | ||
devices: String | ||
): Container! | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In line with our other APIs, we should also have fields like gpu
(to read which gpu is configured, if any) and withoutGPU
to remove the setting
core/container.go
Outdated
@@ -1025,9 +1027,18 @@ func (container *Container) WithPipeline(ctx context.Context, name, description | |||
return container, nil | |||
} | |||
|
|||
func (container *Container) WithExec(ctx context.Context, gw bkgw.Client, progSock *Socket, defaultPlatform specs.Platform, opts ContainerExecOpts) (*Container, error) { //nolint:gocyclo | |||
type ContainerGPUOpts struct { | |||
Devices string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a list of devices can we make it a []string
here?
core/integration/gpu_test.go
Outdated
ctr := c.Container().From(cudaImage) | ||
contents, err := ctr. | ||
// WithGPU(dagger.ContainerWithGPUOpts{Devices: "GPU-5d8950fe-17a6-2fa7-9baa-afa83bba0e2b"}). | ||
WithGPU(dagger.ContainerWithGPUOpts{Devices: "all"}). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd personally prefer if all
wasn't a special string and instead we had a separate api call like WithAllGPUs
(or similar, could instead be a bool option to WithGPU
perhaps, though I like that less).
Just to cut back on the need for users to remember and type one-off strings like that correctly.
internal/mage/engine.go
Outdated
runArgs = append(runArgs, []string{"--gpus", "all"}...) | ||
} | ||
runArgs = append(runArgs, []string{ | ||
"--rm", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: actually just delete --rm
I think, not sure why it was commented out, but it's useful to not remove the dev engine if it dies because you can still look at the logs if it crashed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed a few weeks ago
internal/mage/util/engine.go
Outdated
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg | ||
curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list | ||
apt-get update | ||
apt-get install nvidia-container-toolkit -y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious what the new image size is for the ubuntu image. Totally okay with the tradeoff here for the moment, just wondering what final number actually ends up being.
core/integration/gpu_test.go
Outdated
contents, err := ctr. | ||
// WithGPU(dagger.ContainerWithGPUOpts{Devices: "GPU-5d8950fe-17a6-2fa7-9baa-afa83bba0e2b"}). | ||
WithGPU(dagger.ContainerWithGPUOpts{Devices: "all"}). | ||
WithExec([]string{"nvidia-smi", "-L"}). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just lists the GPUs to make sure they are visible right? If so, that's great for a basic test, but have you verified programs that actually utilize the GPU work?
I'm guessing there's probably some python ML libraries that could be run fairly in a WithExec
, would be good to have that test too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sipsma I covered this a few weeks ago with a test called TestGPUAccessWithPython
. It runs some pytorch computation using GPU inside the Dagger container.
f1f4bfc
to
8e70537
Compare
Anything that I can do to help move this along @matiasinsaurralde? |
9b634ac
to
0c4abac
Compare
I've updated this PR to incorporate b40b4a6, this should unblock anyone who wants to test this: if service containers are disabled and GPU access is enabled the PR will work in its current form. However as @vito pointed out on Discord we should aim to always support this feature due to the fact that service containers will be enabled by default (see #5557). I'm still rewriting and testing CNI setup for Ubuntu: https://github.com/dagger/dagger/blob/main/internal/mage/util/engine.go#L242 I've also updated the base container image to Ubuntu 22.04 -it was previously 20.04- due to incompatibilities with |
c1304e4
to
2764a32
Compare
@shykes / @gerhard / @samalba
ctr := c.Container().From(cudaImage)
ctr.WithGPU([]string{"0", "1"}).
// Or:
ctr.WithGPU([]string{"GPU-5d8950fe-17a6-2fa7-9baa-afa83bba0e2b"}).
ctr := c.Container().From(cudaImage)
ctr.WithAllGPUs()
Need to look into SDK lint issues as I manually tweaked SDK Go code for the past tests. And also just try this with other SDKs. |
This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
2764a32
to
f6ca78c
Compare
Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Signed-off-by: Matias Insaurralde <matias@insaurral.de>
…nal Dagger image with GPU support Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Significant (non merge conflict resolution) changes: * Prefixed APIs w/ "experimental" * Append `--gpus=all` when gpus are enabled in docker-image:// connhelper logic * Only publish amd64 image * Add gpu image variant to engine:testpublish * Run nvidia setup as commands rather than including extra script in image Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Matias Insaurralde <matias@insaurral.de>
…nd GPU support is enabled Signed-off-by: Matias Insaurralde <matias@insaurral.de>
… an arch that's not supported Signed-off-by: Matias Insaurralde <matias@insaurral.de>
…ontaine build Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
… the appropriate flag is used - for local testing Signed-off-by: Matias Insaurralde <matias@insaurral.de>
c622c80
to
aec0d23
Compare
My issues seem to be related to the pinned nvidia-driver package I am unable to upgrade this package on this host: I will restart this with a different image. Yesterday's base |
Signed-off-by: Matias Insaurralde <matias@insaurral.de>
Have added a quick sample in
Expected output:
By the way I will need to re-test with multiple GPUs after all the latest refactoring, will do it over the weekend. |
If it's not relative, it's unlikely to work on other machines. Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
I am picking this one up now. Third time lucky 🤞 Started with Ubuntu 20.04 server image this time with a P4000 card. Capturing the commands that I ran as soon as I logged in: sudo apt-get update && sudo apt-get upgrade -y
sudo apt-get install -y build-essential tmux
# consider tmux-ing it...
### DOCKER
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
# Add the repository to Apt sources:
echo \
"deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
"$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USER
newgrp docker
docker run hello-world
### NVIDIA
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
&& \
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo apt-get install -y nvidia-driver-535
nvidia-smi
sudo nvidia-ctk runtime configure --runtime=docker
### LOAD NEW DRIVERS
sudo reboot
nvidia-smi
### GOLANG
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
(echo; echo 'eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"') >> /home/paperspace/.bashrc
eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
brew install gcc golang
### THIS PR
git clone https://github.com/matiasinsaurralde/dagger.git
cd dagger
git checkout gpu-access-2 And now to check that this works: _EXPERIMENTAL_DAGGER_GPU_SUPPORT=1 ./hack/dev bash
export DAGGER_GPU_TESTS_ENABLED=1
go test -v -count 1 -timeout 1000s -run=TestGPUAccess ./core/integration
# ...
=== RUN TestGPUAccess
=== RUN TestGPUAccess/nvidia/cuda:11.7.1-base-ubuntu20.04
=== RUN TestGPUAccess/nvidia/cuda:11.7.1-base-ubuntu20.04/use_specific_GPU
gpu_test.go:132: this test requires at least 2 GPUs to run
=== RUN TestGPUAccess/nvidia/cuda:11.7.1-base-ubi8
=== RUN TestGPUAccess/nvidia/cuda:11.7.1-base-ubi8/use_specific_GPU
gpu_test.go:132: this test requires at least 2 GPUs to run
=== RUN TestGPUAccess/nvidia/cuda:11.7.1-base-centos7
=== RUN TestGPUAccess/nvidia/cuda:11.7.1-base-centos7/use_specific_GPU
gpu_test.go:132: this test requires at least 2 GPUs to run
--- FAIL: TestGPUAccess (26.99s)
--- FAIL: TestGPUAccess/nvidia/cuda:11.7.1-base-ubuntu20.04 (7.54s)
--- FAIL: TestGPUAccess/nvidia/cuda:11.7.1-base-ubuntu20.04/use_specific_GPU (0.00s)
--- FAIL: TestGPUAccess/nvidia/cuda:11.7.1-base-ubi8 (8.74s)
--- FAIL: TestGPUAccess/nvidia/cuda:11.7.1-base-ubi8/use_specific_GPU (0.00s)
--- FAIL: TestGPUAccess/nvidia/cuda:11.7.1-base-centos7 (10.54s)
--- FAIL: TestGPUAccess/nvidia/cuda:11.7.1-base-centos7/use_specific_GPU (0.00s)
=== RUN TestGPUAccessWithPython
=== RUN TestGPUAccessWithPython/pytorch_CUDA_availibility_check
=== RUN TestGPUAccessWithPython/pytorch_tensors_sample
--- PASS: TestGPUAccessWithPython (136.94s)
--- PASS: TestGPUAccessWithPython/pytorch_CUDA_availibility_check (133.12s)
--- PASS: TestGPUAccessWithPython/pytorch_tensors_sample (3.68s)
FAIL
FAIL github.com/dagger/dagger/core/integration 163.948s
FAIL Which of the following instances did you provision in Paperspace @matiasinsaurralde for the tests with 2 GPUs? Check that the Go SDK GPU example works: cd examples/sdk/go/gpu
go run main.go
Creating new Engine session... OK!
Establishing connection to Engine... 1: connect
1: > in init
1: starting engine
1: starting engine [0.08s]
1: starting session
1: [0.11s] OK!
1: starting session [0.03s]
1: connect DONE
OK!
6: resolve image config for docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
6: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
6: resolve image config for docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: resolve docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04@sha256:745cc6cbd3e36d20441a4fee04b7fab8d2785584cf0d2cf667408f5f773ec9e8
11: resolve docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04@sha256:745cc6cbd3e36d20441a4fee04b7fab8d2785584cf0d2cf667408f5f773ec9e8 [0.01s]
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE
9: exec nvidia-smi -L CACHED
9: exec nvidia-smi -L CACHED
available GPUs GPU 0: Quadro P4000 (UUID: GPU-14985f0a-d0d7-2168-0baa-4a077ac0f6c1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now works as advertised 🙌
Thank you to all that reviewed this PR & helped move along - it's been a long time coming!
Thank you for sticking with it @matiasinsaurralde & seeing it through 💪
Next steps (a.k.a. follow-up PRs):
- Add docs (as already discussed in other comments)
- Paperspace install instructions in my last comment might come in handy
- ✨ Zenith module? ✨
- Add instructions for multi-GPU tests (see my last comment)
- Ensure that creating the release works - cc @sipsma
- Test that the released CLI & Engine image work as advertised - cc @sipsma
- Create a Zenith module that showcases this with an LLM - cc @lukemarsden @samalba
As soon as the checks go green, this will get merged 🚀
* shim: incorporate GPU access hooks and pass GPU visibility parameters Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: extend container to implement WithGPU Signed-off-by: Matias Insaurralde <matias@insaurral.de> * engine: extend dockerImageProvider to pass GPU support flag Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: extend dev engine container with GPU support flag Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: use Ubuntu when the dev engine is initialized with GPU support Also embed helper script for setting up Nvidia Container Toolkit on the Ubuntu base image. Signed-off-by: Matias Insaurralde <matias@insaurral.de> * sdk: update Go SDK Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: add GPU access tests Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: temp change to disable service containers while enabling GPU support Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: bump Ubuntu version when GPU access is enabled Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: always enable service containers and GPU access Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: update cniPlugins to be compatible with Ubuntu Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: change logic around WithGPU and implement WithAllGPUs Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: fix EnabledGPUs usage Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: refactor GPU integration test with new calls Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: only run GPU tests when DAGGER_GPU_TESTS_ENABLED is set Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: remove experimental flags from dev script Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: update mage flows to support building and publishing an additional Dagger image with GPU support Signed-off-by: Matias Insaurralde <matias@insaurral.de> * api+publish fixups Significant (non merge conflict resolution) changes: * Prefixed APIs w/ "experimental" * Append `--gpus=all` when gpus are enabled in docker-image:// connhelper logic * Only publish amd64 image * Add gpu image variant to engine:testpublish * Run nvidia setup as commands rather than including extra script in image Signed-off-by: Erik Sipsma <erik@dagger.io> * schema: fix context usage in GPU methods Signed-off-by: Matias Insaurralde <matias@insaurral.de> * engine: Ensure "gpu" suffix is used when pulling the engine's image and GPU support is enabled Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: panic if there's an attempt to build the GPU enabled image with an arch that's not supported Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: restore "no-cache" flag usage when running apk for dev engine containe build Signed-off-by: Matias Insaurralde <matias@insaurral.de> * Add changelog fragment Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * mage: fix engine's dev step so that it loads a GPU enabled image when the appropriate flag is used - for local testing Signed-off-by: Matias Insaurralde <matias@insaurral.de> * examples: add simple GPU example Signed-off-by: Matias Insaurralde <matias@insaurral.de> * Use latest available dagger Go package & fix replace If it's not relative, it's unlikely to work on other machines. Signed-off-by: Gerhard Lazu <gerhard@dagger.io> --------- Signed-off-by: Matias Insaurralde <matias@insaurral.de> Signed-off-by: Erik Sipsma <erik@dagger.io> Signed-off-by: Gerhard Lazu <gerhard@dagger.io> Co-authored-by: Erik Sipsma <erik@dagger.io> Co-authored-by: Gerhard Lazu <gerhard@dagger.io> Signed-off-by: Christian Schlatter <schlatter@puzzle.ch>
* shim: incorporate GPU access hooks and pass GPU visibility parameters Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: extend container to implement WithGPU Signed-off-by: Matias Insaurralde <matias@insaurral.de> * engine: extend dockerImageProvider to pass GPU support flag Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: extend dev engine container with GPU support flag Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: use Ubuntu when the dev engine is initialized with GPU support Also embed helper script for setting up Nvidia Container Toolkit on the Ubuntu base image. Signed-off-by: Matias Insaurralde <matias@insaurral.de> * sdk: update Go SDK Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: add GPU access tests Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: temp change to disable service containers while enabling GPU support Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: bump Ubuntu version when GPU access is enabled Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: always enable service containers and GPU access Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: update cniPlugins to be compatible with Ubuntu Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: change logic around WithGPU and implement WithAllGPUs Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: fix EnabledGPUs usage Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: refactor GPU integration test with new calls Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: only run GPU tests when DAGGER_GPU_TESTS_ENABLED is set Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: remove experimental flags from dev script Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: update mage flows to support building and publishing an additional Dagger image with GPU support Signed-off-by: Matias Insaurralde <matias@insaurral.de> * api+publish fixups Significant (non merge conflict resolution) changes: * Prefixed APIs w/ "experimental" * Append `--gpus=all` when gpus are enabled in docker-image:// connhelper logic * Only publish amd64 image * Add gpu image variant to engine:testpublish * Run nvidia setup as commands rather than including extra script in image Signed-off-by: Erik Sipsma <erik@dagger.io> * schema: fix context usage in GPU methods Signed-off-by: Matias Insaurralde <matias@insaurral.de> * engine: Ensure "gpu" suffix is used when pulling the engine's image and GPU support is enabled Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: panic if there's an attempt to build the GPU enabled image with an arch that's not supported Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: restore "no-cache" flag usage when running apk for dev engine containe build Signed-off-by: Matias Insaurralde <matias@insaurral.de> * Add changelog fragment Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * mage: fix engine's dev step so that it loads a GPU enabled image when the appropriate flag is used - for local testing Signed-off-by: Matias Insaurralde <matias@insaurral.de> * examples: add simple GPU example Signed-off-by: Matias Insaurralde <matias@insaurral.de> * Use latest available dagger Go package & fix replace If it's not relative, it's unlikely to work on other machines. Signed-off-by: Gerhard Lazu <gerhard@dagger.io> --------- Signed-off-by: Matias Insaurralde <matias@insaurral.de> Signed-off-by: Erik Sipsma <erik@dagger.io> Signed-off-by: Gerhard Lazu <gerhard@dagger.io> Co-authored-by: Erik Sipsma <erik@dagger.io> Co-authored-by: Gerhard Lazu <gerhard@dagger.io>
* shim: incorporate GPU access hooks and pass GPU visibility parameters Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: extend container to implement WithGPU Signed-off-by: Matias Insaurralde <matias@insaurral.de> * engine: extend dockerImageProvider to pass GPU support flag Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: extend dev engine container with GPU support flag Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: use Ubuntu when the dev engine is initialized with GPU support Also embed helper script for setting up Nvidia Container Toolkit on the Ubuntu base image. Signed-off-by: Matias Insaurralde <matias@insaurral.de> * sdk: update Go SDK Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: add GPU access tests Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: temp change to disable service containers while enabling GPU support Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: bump Ubuntu version when GPU access is enabled Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: always enable service containers and GPU access Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: update cniPlugins to be compatible with Ubuntu Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: change logic around WithGPU and implement WithAllGPUs Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: fix EnabledGPUs usage Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: refactor GPU integration test with new calls Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: only run GPU tests when DAGGER_GPU_TESTS_ENABLED is set Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: remove experimental flags from dev script Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: update mage flows to support building and publishing an additional Dagger image with GPU support Signed-off-by: Matias Insaurralde <matias@insaurral.de> * api+publish fixups Significant (non merge conflict resolution) changes: * Prefixed APIs w/ "experimental" * Append `--gpus=all` when gpus are enabled in docker-image:// connhelper logic * Only publish amd64 image * Add gpu image variant to engine:testpublish * Run nvidia setup as commands rather than including extra script in image Signed-off-by: Erik Sipsma <erik@dagger.io> * schema: fix context usage in GPU methods Signed-off-by: Matias Insaurralde <matias@insaurral.de> * engine: Ensure "gpu" suffix is used when pulling the engine's image and GPU support is enabled Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: panic if there's an attempt to build the GPU enabled image with an arch that's not supported Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: restore "no-cache" flag usage when running apk for dev engine containe build Signed-off-by: Matias Insaurralde <matias@insaurral.de> * Add changelog fragment Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * mage: fix engine's dev step so that it loads a GPU enabled image when the appropriate flag is used - for local testing Signed-off-by: Matias Insaurralde <matias@insaurral.de> * examples: add simple GPU example Signed-off-by: Matias Insaurralde <matias@insaurral.de> * Use latest available dagger Go package & fix replace If it's not relative, it's unlikely to work on other machines. Signed-off-by: Gerhard Lazu <gerhard@dagger.io> --------- Signed-off-by: Matias Insaurralde <matias@insaurral.de> Signed-off-by: Erik Sipsma <erik@dagger.io> Signed-off-by: Gerhard Lazu <gerhard@dagger.io> Co-authored-by: Erik Sipsma <erik@dagger.io> Co-authored-by: Gerhard Lazu <gerhard@dagger.io>
* shim: incorporate GPU access hooks and pass GPU visibility parameters Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: extend container to implement WithGPU Signed-off-by: Matias Insaurralde <matias@insaurral.de> * engine: extend dockerImageProvider to pass GPU support flag Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: extend dev engine container with GPU support flag Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: use Ubuntu when the dev engine is initialized with GPU support Also embed helper script for setting up Nvidia Container Toolkit on the Ubuntu base image. Signed-off-by: Matias Insaurralde <matias@insaurral.de> * sdk: update Go SDK Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: add GPU access tests Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: temp change to disable service containers while enabling GPU support Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: bump Ubuntu version when GPU access is enabled Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: always enable service containers and GPU access Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: update cniPlugins to be compatible with Ubuntu Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: change logic around WithGPU and implement WithAllGPUs Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: fix EnabledGPUs usage Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: refactor GPU integration test with new calls Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: only run GPU tests when DAGGER_GPU_TESTS_ENABLED is set Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: remove experimental flags from dev script Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: update mage flows to support building and publishing an additional Dagger image with GPU support Signed-off-by: Matias Insaurralde <matias@insaurral.de> * api+publish fixups Significant (non merge conflict resolution) changes: * Prefixed APIs w/ "experimental" * Append `--gpus=all` when gpus are enabled in docker-image:// connhelper logic * Only publish amd64 image * Add gpu image variant to engine:testpublish * Run nvidia setup as commands rather than including extra script in image Signed-off-by: Erik Sipsma <erik@dagger.io> * schema: fix context usage in GPU methods Signed-off-by: Matias Insaurralde <matias@insaurral.de> * engine: Ensure "gpu" suffix is used when pulling the engine's image and GPU support is enabled Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: panic if there's an attempt to build the GPU enabled image with an arch that's not supported Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: restore "no-cache" flag usage when running apk for dev engine containe build Signed-off-by: Matias Insaurralde <matias@insaurral.de> * Add changelog fragment Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * mage: fix engine's dev step so that it loads a GPU enabled image when the appropriate flag is used - for local testing Signed-off-by: Matias Insaurralde <matias@insaurral.de> * examples: add simple GPU example Signed-off-by: Matias Insaurralde <matias@insaurral.de> * Use latest available dagger Go package & fix replace If it's not relative, it's unlikely to work on other machines. Signed-off-by: Gerhard Lazu <gerhard@dagger.io> --------- Signed-off-by: Matias Insaurralde <matias@insaurral.de> Signed-off-by: Erik Sipsma <erik@dagger.io> Signed-off-by: Gerhard Lazu <gerhard@dagger.io> Co-authored-by: Erik Sipsma <erik@dagger.io> Co-authored-by: Gerhard Lazu <gerhard@dagger.io>
* shim: incorporate GPU access hooks and pass GPU visibility parameters Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: extend container to implement WithGPU Signed-off-by: Matias Insaurralde <matias@insaurral.de> * engine: extend dockerImageProvider to pass GPU support flag Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: extend dev engine container with GPU support flag Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: use Ubuntu when the dev engine is initialized with GPU support Also embed helper script for setting up Nvidia Container Toolkit on the Ubuntu base image. Signed-off-by: Matias Insaurralde <matias@insaurral.de> * sdk: update Go SDK Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: add GPU access tests Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: temp change to disable service containers while enabling GPU support Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: bump Ubuntu version when GPU access is enabled Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: always enable service containers and GPU access Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: update cniPlugins to be compatible with Ubuntu Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: change logic around WithGPU and implement WithAllGPUs Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: fix EnabledGPUs usage Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: refactor GPU integration test with new calls Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: only run GPU tests when DAGGER_GPU_TESTS_ENABLED is set Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: remove experimental flags from dev script Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: update mage flows to support building and publishing an additional Dagger image with GPU support Signed-off-by: Matias Insaurralde <matias@insaurral.de> * api+publish fixups Significant (non merge conflict resolution) changes: * Prefixed APIs w/ "experimental" * Append `--gpus=all` when gpus are enabled in docker-image:// connhelper logic * Only publish amd64 image * Add gpu image variant to engine:testpublish * Run nvidia setup as commands rather than including extra script in image Signed-off-by: Erik Sipsma <erik@dagger.io> * schema: fix context usage in GPU methods Signed-off-by: Matias Insaurralde <matias@insaurral.de> * engine: Ensure "gpu" suffix is used when pulling the engine's image and GPU support is enabled Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: panic if there's an attempt to build the GPU enabled image with an arch that's not supported Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: restore "no-cache" flag usage when running apk for dev engine containe build Signed-off-by: Matias Insaurralde <matias@insaurral.de> * Add changelog fragment Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * mage: fix engine's dev step so that it loads a GPU enabled image when the appropriate flag is used - for local testing Signed-off-by: Matias Insaurralde <matias@insaurral.de> * examples: add simple GPU example Signed-off-by: Matias Insaurralde <matias@insaurral.de> * Use latest available dagger Go package & fix replace If it's not relative, it's unlikely to work on other machines. Signed-off-by: Gerhard Lazu <gerhard@dagger.io> --------- Signed-off-by: Matias Insaurralde <matias@insaurral.de> Signed-off-by: Erik Sipsma <erik@dagger.io> Signed-off-by: Gerhard Lazu <gerhard@dagger.io> Co-authored-by: Erik Sipsma <erik@dagger.io> Co-authored-by: Gerhard Lazu <gerhard@dagger.io>
nvidia-container-runtime-hook
.WithGPU
method is implemented to allow specifying GPU ID (or justall
if you want to expose all GPUs):/usr/bin/nvidia_helper.sh
with the following contents (still don't know the best way to ship this and the commands look too verbose and probably unreadable if we inline them directly withWithExec
, open to suggestions!): nvidia_helper.sh contents.Ticket is #4675