Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support environments where devices come and go frequently #15

Closed
klueska opened this issue Aug 20, 2020 · 14 comments
Closed

Support environments where devices come and go frequently #15

klueska opened this issue Aug 20, 2020 · 14 comments

Comments

@klueska
Copy link

klueska commented Aug 20, 2020

NVIDIA recently released a new feature in its GPUs called MIG (short for Multi-Instance GPU).

This feature allows one to partition a GPU into a set of "MIG devices", each of which appears to the software consuming them as a mini-GPU with a fixed partition of memory and a fixed partition of compute resources.

From a user's perspective, referring to one of these MIG devices is similar to referring to a full GPU (i.e. a unique UUID can be used to specify them). However, unlike full GPUs, the creation / deletion of these MIG devices is highly dynamic.

To support these types of devices, it would be great if an option existed for the runtime to call an executable to generate the CDI spec on the fly for a vendor, rather than simply reading a static file on disk that represents the CDI for a vendor.

NOTE: This doesn't necessarily require any changes to the spec itself, but rather a change in the way it is consumed.

/cc @RenaudWasTaken @mrunalp @bart0sh @kad @adrianchiris

@RenaudWasTaken
Copy link
Contributor

RenaudWasTaken commented Sep 1, 2020

The way that I see this problem is that device reconfiguration typically happens out of band of the container lifecycle.
Typically a user calls a vendor tool (in your case nvidia-smi) to reconfigure the GPU.
At this point, I wonder if it is reasonable to expect the vendor tool to just generate the CDI specification at the same time as creating the new devices (i.e: as part of the nvidia-smi call) rather than add a new hook in the runtimes.

Am I missing something where the vendor wouldn't be able to reconfigure the GPU?

@klueska
Copy link
Author

klueska commented Sep 1, 2020

The problem with that model is that nvidia-smi is not the only tool that can be used to configure these MIG devices on a GPU. Every such tool would need to be aware of the fact that it may need to interface with CDI and thus generate this file (and make sure that it's not stepping on any other tools toes).

It seems much cleaner to me to have a CDI specific hook that knows how to query the state of the devices in the moment this information is required, instead of updating each and every tool to be CDI aware.

@adrianchiris
Copy link
Contributor

adrianchiris commented Sep 1, 2020

The way that I see this problem is that device reconfiguration typically happens out of band of the container lifecycle.
Typically a user calls a vendor tool (in your case nvidia-smi) to reconfigure the GPU.
At this point, I wonder if it is reasonable to expect the vendor tool to just generate the CDI specification at the same time as > creating the new devices (i.e: as part of the nvidia-smi call) rather than add a new hook in the runtimes.

Am I missing something where the vendor wouldn't be able to reconfigure the GPU?

Was actually thinking the same thing here.

another example may be networking "sub-functions" we are working on where you can create network resource on the fly over the PCI function.
This can be created together with its CDI spec out of band or if k8s is used then by a device plugin .

@klueska
Copy link
Author

klueska commented Sep 1, 2020

@adrianchiris

It's unclear which method you say you are in favor of:

  1. Having a vendor-specific hook to generate the CDI on the fly (over stdout) each time a container runtime asks for it
  2. Updating all tools that create dynamic resources to be "CDI aware" so they can regenerate CDI files every time something is reconfigured.

Having (1) does not mean that you can never have a static file -- it just means that the logic in the runtime would look like:

if (vendor.CDIFile.IsExecutable()) {
    spec = vendor.CDIFile.Execute()
} else {
    spec = vendor.CDIFile.Read()
}

Instead of always just having:

    spec = vendor.CDIFile.Read()

@adrianchiris
Copy link
Contributor

@klueska
In my previous comment when i wrote "on the fly", i meant when a resource is requested instead of it existing on server boot.
At this point, i am leaning towards the static approach (having device and its CDI created prior to container runtime execution)

While i have no particular objection of creating CDI spec on the fly by the runtime, I do have some thoughts around the subject:

  • Is it a problem that justifies (another) hook in the runtime ? can you provide some more info on what you are facing with MIGs devices?
  • Are all devices (of the same vendor/type ofc) created by the various tools equal ? is it a "one CDI generating hook to rule them all" ?
  • By allowing this decoupling, are we increasing the chance of miss configuration ?

@klueska
Copy link
Author

klueska commented Sep 7, 2020

can you provide some more info on what you are facing with MIGs devices

The workflow for MIG is:

  1. Machine boots up
  2. User puts the GPU into a special mode called MIG mode
  3. User dynamically partition the GPU into a set of MIG devices of different sizes
  4. User runs some workloads on the MIG partitions inside containers
  5. User dynamically partitions the GPU into a different set of MIG devices of different sizes
  6. User runs some workloads on these new MIG partitions inside containers

MIG was designed to be highly dynamic, and it's highly plausible that the set of MIG devices available on the machine can (and will) change across two different container runs.

To complicate things slightly further, it should also be possible to reconfigure the available set of MIG devices from inside a container itself. We support this today in the nvidia-docker2 stack, and it would be unfortunate to lose this functionality once CDI is available (or have to continue to work around it instead of having direct support for it).

Are all devices (of the same vendor/type ofc) created by the various tools equal ? is it a "one CDI generating hook to rule them all" ?

Yes, the idea is that the hook would query the current state of all NVIDIA devices on the system and generate a full CDI spec from this. Individual tools that operate on NVIDIA devices to change their state would not be responsible for this. At least in the NVIDIA use case, there is a single library called NVML, which the hook can use to query this information easily to generate the spec.

By allowing this decoupling, are we increasing the chance of miss configuration ?

I would argue in the MIG case that not decoupling them increases the chance of a misconfiguration. Sometimes MIG devices will be configured via nvidia-smi, sometimes they will be configured by other NVIDIA tools using NVML under the hood, while still others will use their own third-party tools that call into NVML themselves. Unless all of these tools also know that they are supposed to (and how to) regenerate the CDI spec (for all NVIDIA devices, not just the MIG devices), then the CDI file will quickly get out of sync with what is actually reflected in the hardware.

@eero-t
Copy link

eero-t commented Feb 26, 2021

MIG was designed to be highly dynamic, and it's highly plausible that the set of MIG devices available on the machine can (and will) change across two different container runs.

How container workloads will react if MIG configuration changes under them, e.g. half the compute units and memory disappears (I would have thought doing changes like this to require draining the node)?

Or is the point of this to configure only part of the GPU that hasn't been allocated yet, and somehow make sure that k8s allocation side, and MIG configuration are in sync so that already allocated part of GPU already in use, is not modified?

To complicate things slightly further, it should also be possible to reconfigure the available set of MIG devices from inside a container itself. We support this today in the nvidia-docker2 stack, and it would be unfortunate to lose this functionality once CDI is available (or have to continue to work around it instead of having direct support for it).

What's the use-case for container manipulating the MIG configuration instead of "the system" doing it?

@eero-t
Copy link

eero-t commented Feb 26, 2021

Or is the point of this to configure only part of the GPU that hasn't been allocated yet, and somehow make sure that k8s allocation side, and MIG configuration are in sync so that already allocated part of GPU already in use, is not modified?

And if not enough of the GPU is unallocated for the requested new GPU partitioning scheme, reduce k8s visible GPU capability when workloads finish [1], until eventually there's enough unallocated GPU capability to apply the requested new GPU partitioning scheme?

[1] may take a really long time depending on whether workloads are jobs or services, unless workloads are forcibly evicted from the node.

PS. As there have been some discussion about eventually generalizing this and the storage interface, I wonder how similar functionality would be implemented with storage (for storage types where no media formatting is needed after "re-partitioning")...

@klueska
Copy link
Author

klueska commented Feb 26, 2021

How container workloads will react if MIG configuration changes under them, e.g. half the compute units and memory disappears (I would have thought doing changes like this to require draining the node)?

This is unsupported. If that happens, all bets are off on what will happen inside the container. As a sysadmin, you should ensure this never happens by draining all GPU jobs before doing a MIG reconfiguration.

Or is the point of this to configure only part of the GPU that hasn't been allocated yet, and somehow make sure that k8s allocation side, and MIG configuration are in sync so that already allocated part of GPU already in use, is not modified?

No. The point is just to make sure that when a container is started, the CDI spec that is queried accurately reflects the state of MIG on the underlying GPUs. Because the set of MIG devices can change throughout the lifetime of a node (it likely won't be very frequently, but it will happen sometimes), it would be nice if the current state could be queried at the time a container is started, instead of making sure that a static CDI file is always kept in sync with external tooling that changes the MIG config.

What's the use-case for container manipulating the MIG configuration instead of "the system" doing it?

Just easy of administering the MIG changes. It's nice to be able to deploy a pod in K8s to change the MIG devices available on a node, rather than ssh into the machine or run ansible scripts directly on the host. Please see https://github.com/NVIDIA/mig-parted/tree/master/deployments/systemd for how we do this on our internal k8s setup at NVIDIA.

And if not enough of the GPU is unallocated for the requested new GPU partitioning scheme, reduce k8s visible GPU capability when workloads finish [1], until eventually there's enough unallocated GPU capability to apply the requested new GPU partitioning scheme?

I don't quite understand the question, But in general, it is unsupported to change any MIG setting as long as there are GPU workloads running. You should only apply any MIG changes after stopping the GPU device plugin (and any other pods which may be reading the MIG state, such as gpu-feature-discovery) and draining all GPU jobs.

Again, the main point of this proposal is just to have the option to have the CDI spec generated dynamically instead of read from a static file. Its intention is just to make sure that you always have the latest device info instead of stale information (in cases where the device info can change throughout the lifetime of a node). At a minimum, some vendor specific code is going to need to run at bootup to generate the static CDI spec for that vendor's devices. The proposal here is to just to allow this code to run just-in-time as the CDI spec is being looked up instead of relying on it being generated only once at bootup (or at device reconfiguration, which can happen from many different sources).

@eero-t
Copy link

eero-t commented Feb 26, 2021

As a sysadmin, you should ensure this never happens by draining all GPU jobs before doing a MIG reconfiguration.

Thanks, now it makes more sense!

It's nice to be able to deploy a pod in K8s to change the MIG devices available on a node, rather than ssh into the machine or run ansible scripts directly on the host.

Out of curiosity, do you have some support for automatically draining & keeping (only) GPU workloads out of the node, while admin deploys such a configuration pod/job? And would such facility make sense also for CDI?

@klueska
Copy link
Author

klueska commented Feb 26, 2021

For automating starting / stopping of services around a MIG reconfiguration we have:
https://github.com/NVIDIA/mig-parted/tree/master/deployments/systemd

However, knowing when it's OK to run that reconfiguration (i.e. after all GPU jobs have been drained) is something really hard to do generically. We have something like this internally, but it's very custom to our environment.

In our internal K8s deployment, we don't let users submit pods directly to Kubernetes. Instead have our own custom job spec that users use to submit jobs to the system. We then generate the pod spec from this and make sure to put any jobs that ask for GPUs in a specific namespace.

To ensure reconfigure MIG, we then:

  1. Cordon the node so no more jobs land on it
  2. Monitor the node to wait for all jobs with the namespace for GPU jobs to finish
  3. Do the MIG reconfiguration
  4. Uncordone the node

You could also imaging just shutting down the k8s-device-plugin (instead of cordoning off the entire node), but we don't have mixed workloads of CPU / GPU jobs running, so cordoning off the whole node is essentially equivalent (and a little easier).

@elezar
Copy link
Contributor

elezar commented Jul 7, 2022

@zvonkok pointed out that this may not be limited to devices that come and go frequently. In the case of the use of NVIDIA GPUs, the drivers that are injected into a container include the driver version suffix. This means that upgrading the driver (which may not require a node reboot) will require a CDI spec to be regenerated to refer to the new driver version.

Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

Copy link

This issue was automatically closed due to inactivity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants