Switch to GPU Operator #29

j0hnL · 2020-03-11T20:45:49Z

Is your feature request related to a problem? Please describe.
we have a branch using nvidia-device-plugin but not GPU Operator

Describe the solution you'd like
a new branch that uses GPU Operator not nvidia-device-plugin

Describe alternatives you've considered

Additional context

lwilson · 2020-03-12T12:59:12Z

Do we need a separate branch, or can we just update the main development line? Operator is a replacement for device-plugin, correct?

j0hnL · 2020-05-08T15:24:45Z

as I understand it the GPU Operator performs the following tasks:

detect if GPU exists
Install Nvidia device driver on host system
deploy nvidia-device-plugin
Label nodes as NVIDIA-GPU
deploy prometheus and graphana for GPU

Omnia is currently missing the last two steps, Labeling and Monitoring. We also use a newer upstream driver from El Repo. @lwilson do you think it's better to just do the label and monitoring rather than go backwards using the "Operator"?

lwilson · 2020-05-08T16:10:58Z

I think we should do this the other way around, @j0hnL. Instead of us replicating the work Nvidia is doing in gpu-operator, we should just leverage the operator.

lwilson · 2020-05-11T21:57:17Z

Looks like in order to support GPU Operator we need to:

Switch to CoreOS (see Support for Fedora CoreOS #57), or
Switch container runtime to CRI-O (Container runtime alternatives to Docker #76 https://cri-o.io/)

We want to do both, but perhaps we should start with CRI-O on CentOS, and then look into CoreOS.

j0hnL · 2020-05-19T14:46:21Z

Since we are not currently going to switch to CRI-O should we modify this issue? Or close it and open a new one? We are currently doing everything except Labeling and graphana. It is still worthwhile to put some effort into the auto label and monitoring stuff.

lwilson · 2020-11-10T21:56:33Z

@j0hnL looks like Nvidia GPU Operator now supports FOSS Kubernetes on CentOS8: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-gpu-operator

I think it's time to switch from manually installing the GPU drivers to using the operator to handle that task.

FYI, this will reduce the number of difference accelerators we have to automatically detect/enable (see #108). We may put that feature on pause while we re-evaluate the operator landscape.

j0hnL · 2020-12-08T16:19:11Z

Here's the updated strategy to handle MIG. This uses helm to deploy services supporting GPUs in the k8s cluster

nvdp/nvidia-device-plugin 0.7.0
nvgfd/gpu-feature-discovery 0.2.0

https://docs.nvidia.com/datacenter/cloud-native/kubernetes/mig-k8s.html

lwilson · 2020-12-08T16:29:13Z

@j0hnL it seems like the MIG functionality is fairly static (i.e., it must be setup by nvidia-smi and is not dynamically detected by the nvidia-device-plugin. What should we use as a recommended setup?

j0hnL · 2020-12-08T16:46:18Z

@lwilson First default is to use the entire GPU or migStrategy=none We should consider how to set them up in different modes too. it requires a reset of kubectl when changing mig Stragegy so this would result in a node dropping out and joining back again or a fresh start of the cluster.

lwilson · 2020-12-08T16:49:19Z

@j0hnL definitely agree the default should be none. It appears we can split GPUs in different ways on different nodes (and potentially different ways on different units within the same node). It could become very complicated. How would we represent that in ansible/YAML?

j0hnL · 2020-12-08T17:05:03Z

@lwilson three basic options: none, single, mixed. Defaulting to none is simple, we could provide a switch for single, and create example deployments mixed. Mixed will certainly get complex, I'm open to any ideas for how to represent mixed modes.

… structure (dell#29) Add support for Gaudi prometheus exporter Signed-off-by: Yupeng Zhang yupeng.zhang@intel.com fix merge conflict

j0hnL added the enhancement New feature or request label Mar 11, 2020

lwilson added this to the v0.1 milestone Mar 11, 2020

lwilson removed this from the v0.1 milestone Mar 23, 2020

lwilson changed the title ~~provide branch with GPU Operator Installed~~ Switch to GPU Operator May 11, 2020

lwilson mentioned this issue Nov 10, 2020

Switch helm plays from shell module to community.kubernetes.helm #144

Closed

j0hnL linked a pull request Dec 8, 2020 that will close this issue

update nvidia-device-plugin #158

Merged

j0hnL closed this as completed in #158 Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to GPU Operator #29

Switch to GPU Operator #29

j0hnL commented Mar 11, 2020

lwilson commented Mar 12, 2020

j0hnL commented May 8, 2020

lwilson commented May 8, 2020

lwilson commented May 11, 2020 •

edited

Loading

j0hnL commented May 19, 2020

lwilson commented Nov 10, 2020 •

edited

Loading

j0hnL commented Dec 8, 2020

lwilson commented Dec 8, 2020

j0hnL commented Dec 8, 2020

lwilson commented Dec 8, 2020

j0hnL commented Dec 8, 2020

Switch to GPU Operator #29

Switch to GPU Operator #29

Comments

j0hnL commented Mar 11, 2020

lwilson commented Mar 12, 2020

j0hnL commented May 8, 2020

lwilson commented May 8, 2020

lwilson commented May 11, 2020 • edited Loading

j0hnL commented May 19, 2020

lwilson commented Nov 10, 2020 • edited Loading

j0hnL commented Dec 8, 2020

lwilson commented Dec 8, 2020

j0hnL commented Dec 8, 2020

lwilson commented Dec 8, 2020

j0hnL commented Dec 8, 2020

lwilson commented May 11, 2020 •

edited

Loading

lwilson commented Nov 10, 2020 •

edited

Loading