Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to GPU Operator #29

Closed
j0hnL opened this issue Mar 11, 2020 · 11 comments · Fixed by #158
Closed

Switch to GPU Operator #29

j0hnL opened this issue Mar 11, 2020 · 11 comments · Fixed by #158
Labels
enhancement New feature or request

Comments

@j0hnL
Copy link
Collaborator

j0hnL commented Mar 11, 2020

Is your feature request related to a problem? Please describe.
we have a branch using nvidia-device-plugin but not GPU Operator

Describe the solution you'd like
a new branch that uses GPU Operator not nvidia-device-plugin

Describe alternatives you've considered

Additional context

@j0hnL j0hnL added the enhancement New feature or request label Mar 11, 2020
@lwilson lwilson added this to the v0.1 milestone Mar 11, 2020
@lwilson
Copy link
Collaborator

lwilson commented Mar 12, 2020

Do we need a separate branch, or can we just update the main development line? Operator is a replacement for device-plugin, correct?

@lwilson lwilson removed this from the v0.1 milestone Mar 23, 2020
@j0hnL
Copy link
Collaborator Author

j0hnL commented May 8, 2020

as I understand it the GPU Operator performs the following tasks:

  • detect if GPU exists
  • Install Nvidia device driver on host system
  • deploy nvidia-device-plugin
  • Label nodes as NVIDIA-GPU
  • deploy prometheus and graphana for GPU

Omnia is currently missing the last two steps, Labeling and Monitoring. We also use a newer upstream driver from El Repo. @lwilson do you think it's better to just do the label and monitoring rather than go backwards using the "Operator"?

@lwilson
Copy link
Collaborator

lwilson commented May 8, 2020

I think we should do this the other way around, @j0hnL. Instead of us replicating the work Nvidia is doing in gpu-operator, we should just leverage the operator.

@lwilson lwilson changed the title provide branch with GPU Operator Installed Switch to GPU Operator May 11, 2020
@lwilson
Copy link
Collaborator

lwilson commented May 11, 2020

Looks like in order to support GPU Operator we need to:

We want to do both, but perhaps we should start with CRI-O on CentOS, and then look into CoreOS.

@j0hnL
Copy link
Collaborator Author

j0hnL commented May 19, 2020

Since we are not currently going to switch to CRI-O should we modify this issue? Or close it and open a new one? We are currently doing everything except Labeling and graphana. It is still worthwhile to put some effort into the auto label and monitoring stuff.

@lwilson
Copy link
Collaborator

lwilson commented Nov 10, 2020

@j0hnL looks like Nvidia GPU Operator now supports FOSS Kubernetes on CentOS8: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-gpu-operator

I think it's time to switch from manually installing the GPU drivers to using the operator to handle that task.

FYI, this will reduce the number of difference accelerators we have to automatically detect/enable (see #108). We may put that feature on pause while we re-evaluate the operator landscape.

@j0hnL
Copy link
Collaborator Author

j0hnL commented Dec 8, 2020

Here's the updated strategy to handle MIG. This uses helm to deploy services supporting GPUs in the k8s cluster

  • nvdp/nvidia-device-plugin 0.7.0
  • nvgfd/gpu-feature-discovery 0.2.0

https://docs.nvidia.com/datacenter/cloud-native/kubernetes/mig-k8s.html

@lwilson
Copy link
Collaborator

lwilson commented Dec 8, 2020

@j0hnL it seems like the MIG functionality is fairly static (i.e., it must be setup by nvidia-smi and is not dynamically detected by the nvidia-device-plugin. What should we use as a recommended setup?

@j0hnL
Copy link
Collaborator Author

j0hnL commented Dec 8, 2020

@lwilson First default is to use the entire GPU or migStrategy=none We should consider how to set them up in different modes too. it requires a reset of kubectl when changing mig Stragegy so this would result in a node dropping out and joining back again or a fresh start of the cluster.

@lwilson
Copy link
Collaborator

lwilson commented Dec 8, 2020

@j0hnL definitely agree the default should be none. It appears we can split GPUs in different ways on different nodes (and potentially different ways on different units within the same node). It could become very complicated. How would we represent that in ansible/YAML?

@j0hnL
Copy link
Collaborator Author

j0hnL commented Dec 8, 2020

@lwilson three basic options: none, single, mixed. Defaulting to none is simple, we could provide a switch for single, and create example deployments mixed. Mixed will certainly get complex, I'm open to any ideas for how to represent mixed modes.

@j0hnL j0hnL linked a pull request Dec 8, 2020 that will close this issue
@j0hnL j0hnL closed this as completed in #158 Dec 8, 2020
dweineha pushed a commit to dweineha/omnia that referenced this issue Aug 30, 2024
… structure (dell#29)

Add support for Gaudi prometheus exporter

Signed-off-by: Yupeng Zhang yupeng.zhang@intel.com

fix merge conflict
dweineha pushed a commit to dweineha/omnia that referenced this issue Sep 4, 2024
… structure (dell#29)

Add support for Gaudi prometheus exporter

Signed-off-by: Yupeng Zhang yupeng.zhang@intel.com

fix merge conflict
ghandoura pushed a commit to ghandoura/omnia that referenced this issue Sep 30, 2024
… structure (dell#29)

Add support for Gaudi prometheus exporter

Signed-off-by: Yupeng Zhang yupeng.zhang@intel.com

fix merge conflict
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants