-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support environments where devices come and go frequently #15
Comments
The way that I see this problem is that device reconfiguration typically happens out of band of the container lifecycle. Am I missing something where the vendor wouldn't be able to reconfigure the GPU? |
The problem with that model is that It seems much cleaner to me to have a CDI specific hook that knows how to query the state of the devices in the moment this information is required, instead of updating each and every tool to be CDI aware. |
Was actually thinking the same thing here. another example may be networking "sub-functions" we are working on where you can create network resource on the fly over the PCI function. |
It's unclear which method you say you are in favor of:
Having (1) does not mean that you can never have a static file -- it just means that the logic in the runtime would look like:
Instead of always just having:
|
@klueska While i have no particular objection of creating CDI spec on the fly by the runtime, I do have some thoughts around the subject:
|
The workflow for MIG is:
MIG was designed to be highly dynamic, and it's highly plausible that the set of MIG devices available on the machine can (and will) change across two different container runs. To complicate things slightly further, it should also be possible to reconfigure the available set of MIG devices from inside a container itself. We support this today in the
Yes, the idea is that the hook would query the current state of all NVIDIA devices on the system and generate a full CDI spec from this. Individual tools that operate on NVIDIA devices to change their state would not be responsible for this. At least in the NVIDIA use case, there is a single library called NVML, which the hook can use to query this information easily to generate the spec.
I would argue in the MIG case that not decoupling them increases the chance of a misconfiguration. Sometimes MIG devices will be configured via |
How container workloads will react if MIG configuration changes under them, e.g. half the compute units and memory disappears (I would have thought doing changes like this to require draining the node)? Or is the point of this to configure only part of the GPU that hasn't been allocated yet, and somehow make sure that k8s allocation side, and MIG configuration are in sync so that already allocated part of GPU already in use, is not modified?
What's the use-case for container manipulating the MIG configuration instead of "the system" doing it? |
And if not enough of the GPU is unallocated for the requested new GPU partitioning scheme, reduce k8s visible GPU capability when workloads finish [1], until eventually there's enough unallocated GPU capability to apply the requested new GPU partitioning scheme? [1] may take a really long time depending on whether workloads are jobs or services, unless workloads are forcibly evicted from the node. PS. As there have been some discussion about eventually generalizing this and the storage interface, I wonder how similar functionality would be implemented with storage (for storage types where no media formatting is needed after "re-partitioning")... |
This is unsupported. If that happens, all bets are off on what will happen inside the container. As a sysadmin, you should ensure this never happens by draining all GPU jobs before doing a MIG reconfiguration.
No. The point is just to make sure that when a container is started, the CDI spec that is queried accurately reflects the state of MIG on the underlying GPUs. Because the set of MIG devices can change throughout the lifetime of a node (it likely won't be very frequently, but it will happen sometimes), it would be nice if the current state could be queried at the time a container is started, instead of making sure that a static CDI file is always kept in sync with external tooling that changes the MIG config.
Just easy of administering the MIG changes. It's nice to be able to deploy a pod in K8s to change the MIG devices available on a node, rather than ssh into the machine or run ansible scripts directly on the host. Please see https://github.com/NVIDIA/mig-parted/tree/master/deployments/systemd for how we do this on our internal k8s setup at NVIDIA.
I don't quite understand the question, But in general, it is unsupported to change any MIG setting as long as there are GPU workloads running. You should only apply any MIG changes after stopping the GPU device plugin (and any other pods which may be reading the MIG state, such as gpu-feature-discovery) and draining all GPU jobs. Again, the main point of this proposal is just to have the option to have the CDI spec generated dynamically instead of read from a static file. Its intention is just to make sure that you always have the latest device info instead of stale information (in cases where the device info can change throughout the lifetime of a node). At a minimum, some vendor specific code is going to need to run at bootup to generate the static CDI spec for that vendor's devices. The proposal here is to just to allow this code to run |
Thanks, now it makes more sense!
Out of curiosity, do you have some support for automatically draining & keeping (only) GPU workloads out of the node, while admin deploys such a configuration pod/job? And would such facility make sense also for CDI? |
For automating starting / stopping of services around a MIG reconfiguration we have: However, knowing when it's OK to run that reconfiguration (i.e. after all GPU jobs have been drained) is something really hard to do generically. We have something like this internally, but it's very custom to our environment. In our internal K8s deployment, we don't let users submit pods directly to Kubernetes. Instead have our own custom To ensure reconfigure MIG, we then:
You could also imaging just shutting down the k8s-device-plugin (instead of cordoning off the entire node), but we don't have mixed workloads of CPU / GPU jobs running, so cordoning off the whole node is essentially equivalent (and a little easier). |
@zvonkok pointed out that this may not be limited to devices that come and go frequently. In the case of the use of NVIDIA GPUs, the drivers that are injected into a container include the driver version suffix. This means that upgrading the driver (which may not require a node reboot) will require a CDI spec to be regenerated to refer to the new driver version. |
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. |
This issue was automatically closed due to inactivity. |
NVIDIA recently released a new feature in its GPUs called MIG (short for Multi-Instance GPU).
This feature allows one to partition a GPU into a set of "MIG devices", each of which appears to the software consuming them as a mini-GPU with a fixed partition of memory and a fixed partition of compute resources.
From a user's perspective, referring to one of these MIG devices is similar to referring to a full GPU (i.e. a unique UUID can be used to specify them). However, unlike full GPUs, the creation / deletion of these MIG devices is highly dynamic.
To support these types of devices, it would be great if an option existed for the runtime to call an executable to generate the CDI spec on the fly for a vendor, rather than simply reading a static file on disk that represents the CDI for a vendor.
NOTE: This doesn't necessarily require any changes to the spec itself, but rather a change in the way it is consumed.
/cc @RenaudWasTaken @mrunalp @bart0sh @kad @adrianchiris
The text was updated successfully, but these errors were encountered: