New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inject Nvidia GPUs using volume-mounts to isolate them to assigned pods #3718
Inject Nvidia GPUs using volume-mounts to isolate them to assigned pods #3718
Conversation
@arnaldo2792 Would really appreciate your review or someone else's from the team who has worked on this :) |
On another thought, since bottlerocket is not EKS exclusive, would it make sense to have these as settings so users can configure according to their needs? |
Thanks for the great catch @chiragjn! (yet another, nice!). As I mentioned elsewhere, we are still thinking on what changes we want to ask for this PR since as it is, it could break the ECS variant. Once again, thanks for the contribution, and I'll reply back soon with the suggestions! |
FWIW, I was able to build a custom AMI with these changes for 1.28-nvidia variant fairly easily Maybe till the contributing team decides on how to expose these in Settings API,
AFAIK, the device plugin is K8s only so the config can be safely changed. Not relevant to this discussion directly:
|
re: suggested implementation {{#if K8s}} Unfortunately, we don't provide a handlebars helper to evaluate if the current host is k8s or ECS. The solution that aligns best with Out-of-tree-builds (see #2669) is to have two different sub-packages for
Thanks for the heads up! I read the threads and I'll contact the author of this comment since we should align to what they are planning to do to fix this problem. |
Just curious if providing a handlebar for variant name being built might help and what kind of effort might that take, I'll be happy to give it a shot (with some guidance) if that is a good approach :) |
@@ -1,3 +1,6 @@ | |||
accept-nvidia-visible-devices-as-volume-mounts = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @chiragjn, we had an internal discussion about the changes in this PR. I'll provide diffs of examples of how to accomplish what we need and explanations of why they are needed to try to ease the learning curve.
The first thing is to have two configuration files:
nvidia-container-toolkit-config-ecs.toml
: this should be the file as it is todaynvidia-container-toollkit-config-k8s.toml
: this is thew new file with the configurations as you have them here.
Once you have the two files, you need to update the nvidia-container-toolkit.spec
file two add both sources as follows:
Source0: https://%{goimport}/archive/v%{gover}/nvidia-container-toolkit-%{gover}.tar.gz
-Source1: nvidia-container-toolkit-config.toml
+Source1: nvidia-container-toolkit-config-k8s.toml
Source2: nvidia-container-toolkit-tmpfiles.conf
Source3: nvidia-oci-hooks-json
Source4: nvidia-gpu-devices.rules
+Source5: nvidia-container-toolkit-config-ecs.toml
And install them like this under the %install
section in the spec file:
-install -m 0644 %{S:1} %{buildroot}%{_cross_factorydir}/etc/nvidia-container-runtime/config.toml
+install -m 0644 %{S:1} %{buildroot}%{_cross_factorydir}/etc/nvidia-container-runtime/
+install -m 0644 %{S:5} %{buildroot}%{_cross_factorydir}/etc/nvidia-container-runtime/
Having two files will allow us to prevent conditionally including one or the other based on the variant information. But, we still need a way to include either one, we will accomplish this by creating two sub-packages of nvidia-container-toolkit
. You can do this by adding something similar to the following lines right after the last %description
section in the spec:
%description
%{summary}.
+%package ecs
+Summary: Files specific for the ECS variants
+Requires: %{name}
+
+%description ecs
+%{summary}.
+
+%package k8s
+Summary: Files specific for the Kubernetes variants
+Requires: %{name}
+
+%description k8s
+%{summary}.
+
%prep
This will create the two subpackages: nvidia-container-toolkit-ecs
and nvidia-container-toolkit-k8s
. Notice in the diff the Requires: %{name}
snippet, this will guarantee that nvidia-container-toolkit
is installed alongside nvidia-container-toolkit-<subpackage>
. After this, you need to include the correct file per package in the %files
section:
%{_cross_templatedir}/nvidia-oci-hooks-json
-%{_cross_factorydir}/etc/nvidia-container-runtime/config.toml
%{_cross_tmpfilesdir}/nvidia-container-toolkit.conf
%{_cross_udevrulesdir}/90-nvidia-gpu-devices.rules
+
+%files ecs
+%{_cross_factorydir}/etc/nvidia-container-runtime/nvidia-container-toolkit-config-ecs.toml
+
+%files k8s
+%{_cross_factorydir}/etc/nvidia-container-runtime/nvidia-container-toolkit-config-k8s.toml
The last change in the spec file is to create the actual configuration file that will be used by nvidia-container-runtime-hook
. In Bottlerocket, we use the "factory" feature of tmpfilesd
to create certain files at /etc
. For /etc/nvidia-container-runtime/config.toml
the source of the factory is the file at %{_cross_factorydir}/etc/nvidia-container-runtime/config.toml
. Thus, to provide the file for the factory you will create a symlink at this location that points to the correct configuration file per variant. You can do this in a post
install script for each sub-package as follows:
+%post ecs -p <lua>
+posix.link("%{_cross_factorydir}/etc/nvidia-container-runtime/nvidia-container-toolkit-config-ecs.toml", "%{_cross_factorydir}/etc/nvidia-container-runtime/config.toml")
+
+%post k8s -p <lua>
+posix.link("%{_cross_factorydir}/etc/nvidia-container-runtime/nvidia-container-toolkit-config-k8s.toml", "%{_cross_factorydir}/etc/nvidia-container-runtime/config.toml")
+
The %post
scripts should be placed in between the %install
and %files
sections.
The last thing to glue the changes together is to update each *-nvidia
variant to include the variant-specific nvidia-container-toolkit
sub-package. You can do this by updating the file at variants/*-nvidia/Cargo.toml
file as follows:
included-packages = [
"ecs-agent",
# NVIDIA support
"ecs-gpu-init",
- "nvidia-container-toolkit",
+ "nvidia-container-toolkit-ecs",
"kmod-6.1-nvidia-tesla-535",
]
And that's it! If all this is too overwhelming, or you don't have the bandwidth to work on it, please let me know, I can take over your changes and drive the PR to completion 👍 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the extremely precise diff, you practically did all the work 😅
Anyway I have made the changes and was able to build an AMI and test it out and can confirm they work as expected
I also confirmed changes using an admin container
bash-5.1# pwd
/x86_64-bottlerocket-linux-gnu/sys-root/usr/share/factory/etc/nvidia-container-runtime
bash-5.1# ls -li
total 8
2325 -rw-r--r--. 2 root root 237 Jan 30 18:53 config.toml
2325 -rw-r--r--. 2 root root 237 Jan 30 18:53 nvidia-container-toolkit-config-k8s.toml
Just two questions
- Do the sub-packages need to appear in Cargo.lock too? I tried
cargo generate-lockfile
but nothing changed. - Should
config.toml
be a hard link or soft link? If I read the docs correctlyposix.link
by default creates hard link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to answer your questions, sorry!
- No, they don't,
included-packages
is a field we use in the build system - A hard link should be OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look great 🎉 ! Just have one request, could you please squash your commits? We don't squash them when we merge the PR 😅. And FYI, it seems like the key you used to sign the last commits wasn't uploaded to GitHub and the commits show as Unverified
.
In my end, I'm just missing testing the kubernetes variants. I'll do that first thing tomorrow. I already validated the ECS variants and things look good.
8d1adb6
to
6150a72
Compare
No worries, I have squashed and rebased :) |
packages/nvidia-container-toolkit/nvidia-container-toolkit.spec
Outdated
Show resolved
Hide resolved
packages/nvidia-container-toolkit/nvidia-container-toolkit.spec
Outdated
Show resolved
Hide resolved
@@ -30,6 +30,7 @@ included-packages = [ | |||
# NVIDIA support | |||
"ecs-gpu-init", | |||
"nvidia-container-toolkit", | |||
"nvidia-container-toolkit-ecs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I prefer to list only the most specific set of leaf packages here - you could drop "nvidia-container-toolkit" everywhere since the -k8s
or -ecs
subpackage will pull that in by way of dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done :)
Thanks for the review!
@chiragjn - thanks for the contribution! It looks pretty much ready to me, with a couple of minor nits that you could address if you have the cycles. |
I confirmed the PR fixes the problem, and I can't get all the devices when k8s-1.23
k8s-1.24
k8s-1.25
k8s-1.26
k8s-1.27
k8s-1.28
k8s-1.29
|
…ed pods Create separate container toolkit config for ECS and K8s Apply suggestions from code review Drop `nvidia-container-toolkit` because it is now a transitive dependency Motivation --- When using nodes with multiple gpus (e.g. g4dn.12xlarge), the default way of using nvidia-container-toolkit and nvidia-device-plugin leads to a problem where gpus are not exclusively isolated to the pods they are assigned to. This is because nvidia-device-plugin by default looks up NVIDIA_VISIBLE_DEVICES to decide which gpu devices to pass on to nvidia-container-toolkit / nvidia-container-cli to inject in the pod. Most nvidia cuda base images have env NVIDIA_VISIBLE_DEVICES=all baked into them which means a pod with such an image will get access to all gpu cards instead of exclusively getting the number requested in resources.limits yet the kubelet will only report the number in resources.limits as allocated. E.g. On a 4 GPU node, Pod 1 requests (with NVIDIA_VISIBLE_DEVICES=all in image) ``` resources: limits: nvidia.com/gpu: 1 ``` Pod 2 requests (with NVIDIA_VISIBLE_DEVICES=all in image) ``` resources: limits: nvidia.com/gpu: 2 ``` In this scenario, both pods get access to all 4 cards and node will report ``` nvidia.com/gpu Allocated: 3 Free: 1 ``` This isn't good because the deployed non-privileged pods are unaware of each other and expect exclusive access to requested cards.
428fe9f
to
911775f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Issue number: NA
Motivation
When using nodes with multiple gpus (e.g. g4dn.12xlarge), the default way of using nvidia-container-toolkit and nvidia-device-plugin leads to a problem where gpus are not exclusively isolated to the pods they are assigned to.
This is because nvidia-device-plugin by default looks up
NVIDIA_VISIBLE_DEVICES
to decide which gpu devices to pass on to nvidia-container-toolkit / nvidia-container-cli to inject in the pod.Most nvidia cuda base images have env
NVIDIA_VISIBLE_DEVICES=all
baked into them which means a pod with such an image will get access to all gpu cards instead of exclusively getting the number requested inresources.limits
yet the kubelet will only report the number inresources.limits
as allocated.E.g. On a 4 GPU node,
Pod 1 requests (with
NVIDIA_VISIBLE_DEVICES=all
in image)Pod 2 requests (with
NVIDIA_VISIBLE_DEVICES=all
in image)In this scenario, both pods get access to all 4 cards and node will report
This isn't good because the deployed non-privileged pods are unaware of each other and expect exclusive access to requested cards.
References:
Description of changes:
We follow the guidelines in the above docs.
--device-list-strategy volume-mounts
to pass allocated devices as volume mounts instead of bypassing and relying on value ofNVIDIA_VISIBLE_DEVICES
NVIDIA_VISIBLE_DEVICES
as env var will still be considered for privileged pods.Testing done:
I would need help/advice testing this out on actual bottlerocket nodes.EDIT: I built a custom AMI for EKS 1.28 following the docs and was able to confirm these changes work as expected.
My employer has been running this config on AL2 nodes without any issues ensuring correct bin packing per node. I am attaching tests we did for different scenarios.
Tests on AL2
NVIDIA_VISIBLE_DEVICES
As you can see, in non-privileged mode
NVIDIA_VISIBLE_DEVICES
will be ignored entirelySome of the info in the Google Docs linked above is outdated so doesn't align with above results
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.