-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to GPU Operator #29
Comments
Do we need a separate branch, or can we just update the main development line? Operator is a replacement for device-plugin, correct? |
as I understand it the GPU Operator performs the following tasks:
Omnia is currently missing the last two steps, Labeling and Monitoring. We also use a newer upstream driver from El Repo. @lwilson do you think it's better to just do the label and monitoring rather than go backwards using the "Operator"? |
I think we should do this the other way around, @j0hnL. Instead of us replicating the work Nvidia is doing in |
Looks like in order to support GPU Operator we need to:
We want to do both, but perhaps we should start with CRI-O on CentOS, and then look into CoreOS. |
Since we are not currently going to switch to CRI-O should we modify this issue? Or close it and open a new one? We are currently doing everything except Labeling and graphana. It is still worthwhile to put some effort into the auto label and monitoring stuff. |
@j0hnL looks like Nvidia GPU Operator now supports FOSS Kubernetes on CentOS8: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-gpu-operator I think it's time to switch from manually installing the GPU drivers to using the operator to handle that task. FYI, this will reduce the number of difference accelerators we have to automatically detect/enable (see #108). We may put that feature on pause while we re-evaluate the operator landscape. |
Here's the updated strategy to handle MIG. This uses
https://docs.nvidia.com/datacenter/cloud-native/kubernetes/mig-k8s.html |
@j0hnL it seems like the MIG functionality is fairly static (i.e., it must be setup by |
@lwilson First default is to use the entire GPU or |
@j0hnL definitely agree the default should be |
@lwilson three basic options: |
… structure (dell#29) Add support for Gaudi prometheus exporter Signed-off-by: Yupeng Zhang yupeng.zhang@intel.com fix merge conflict
… structure (dell#29) Add support for Gaudi prometheus exporter Signed-off-by: Yupeng Zhang yupeng.zhang@intel.com fix merge conflict
… structure (dell#29) Add support for Gaudi prometheus exporter Signed-off-by: Yupeng Zhang yupeng.zhang@intel.com fix merge conflict
Is your feature request related to a problem? Please describe.
we have a branch using nvidia-device-plugin but not GPU Operator
Describe the solution you'd like
a new branch that uses GPU Operator not nvidia-device-plugin
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: