# Level 2: GPU Operator Configuration

Now we have a new GPU node, we are going to configure our environment to make use of it.

## OpenShift Operators

There are two Operators that help us discover and Configure our new GPU node.

1. When using new hardware in our system the **Node Feature Discovery Operator** (NFD) automatically detects the new hardware features and system configuration.
2. The **NVIDIA GPU Operator** configures the drivers, plugins and adapters for the GPU.

We can see these Operators by browsing to OpenShift Console > Operators > Installed Operators

and select the NFD operator:

![images/openshift-nfd.png](images/openshift-nfd.png)

and select the NVIDIA GPU operator:

![images/nvidia-gpu-operator.png](images/nvidia-gpu-operator.png)


The environment already has the two main configuration resource items deployed for these operators:

- The [NodeFeatureDiscovery](https://github.com/eformat/rhoai-policy-collection/blob/main/gitops/applications/gpu/base/nfd-cr.yaml) instance
- The [NVIDIA GPU ClusterPolicy](https://github.com/eformat/rhoai-policy-collection/blob/main/gitops/applications/gpu/base/gpu-cluster-policy.yaml) instance

When a new machine is deployed, the nfd-worker DeamonSet gets created on the new node. 

Login to OpenShift.

In [1]:
!oc login -u admin -p ${ADMIN_PASSWORD} --server=https://api.sno.${BASE_DOMAIN}:6443 --insecure-skip-tls-verify


Login successful.

You have access to 106 projects, the list has been suppressed. You can list all projects with 'oc projects'

Using project "ai-roadshow".


Check the nfd worker pods:

In [2]:
!oc -n openshift-nfd get pods -l app=nfd-worker -o wide

NAME               READY   STATUS    RESTARTS        AGE     IP            NODE                                        NOMINATED NODE   READINESS GATES
nfd-worker-8vrhf   1/1     Running   0               5m51s   10.0.15.75    ip-10-0-15-75.us-east-2.compute.internal    <none>           <none>
nfd-worker-mz8s2   1/1     Running   16 (3h7m ago)   4d18h   10.0.29.181   ip-10-0-29-181.us-east-2.compute.internal   <none>           <none>


Check nvidia gpu operator pods:

In [3]:
!oc -n nvidia-gpu-operator get pods

NAME                                           READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-dp7mv                    2/2     Running     0          3h3m
gpu-feature-discovery-jv4kt                    0/2     Init:0/2    0          3m46s
gpu-operator-8fc78cd6c-npd2d                   1/1     Running     7          4d18h
nvidia-container-toolkit-daemonset-d59mm       1/1     Running     0          3h3m
nvidia-container-toolkit-daemonset-j2kpq       0/1     Init:0/1    0          3m46s
nvidia-cuda-validator-5gv7c                    0/1     Completed   0          3h2m
nvidia-dcgm-exporter-7xkhs                     1/1     Running     0          3h3m
nvidia-dcgm-exporter-95wwx                     0/1     Init:0/2    0          3m46s
nvidia-dcgm-r5hc9                              0/1     Init:0/1    0          3m46s
nvidia-dcgm-tmhnd                              1/1     Running     0          3h3m
nvidia-device-plugin-daemonset-k4cx9           0/2     Init:0/2    0          3m46s

After a bit of time (~5min) all of the pods in openshift-nfd, and nvidia-gpu-operator namespaces start up OK and the node features are discovered and the Node is labelled appropriately.

Browse to the Node Features under the NFD Operator in the web console and select the node we added in the [first exercise](Level1_add_gpu_node.ipynb).

![images/node-feature-gpu.png](images/node-feature-gpu.png)

## Configure GPU Sharing

Because of the previous configuration already in the cluster - we can see a couple of things have already been configured for us for the A10 node:

```yaml
nvidia.com/gpu.replicas: 8
nvidia.com/gpu-sharing-strategy: time-slicing
```

The time slicing is configured so that eventhough we have only 1 physical GPU, we could schedule up to 8 separate workloads (GPU Sharing).

The configuration for this is setup in two places in the codebase we used to deploy the environment already.

(1) The [NVIDIA GPU ClusterPolicy](https://github.com/eformat/rhoai-policy-collection/blob/main/gitops/applications/gpu/base/gpu-cluster-policy.yaml#L62)

```yaml
  devicePlugin:
    enabled: true
    config:
      name: "time-slicing-config"
```

(2) A ConfigMap called [time-slicing-config](https://github.com/eformat/rhoai-policy-collection/blob/main/gitops/applications/gpu/overlay/sno/configmap.yaml)

```yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: nvidia-gpu-operator
data:
    NVIDIA-L4: |-
      version: v1
      flags:
        migStrategy: none
      sharing:
        timeSlicing:
          resources:
          - name: nvidia.com/gpu
            replicas: 8
```

But we want to modify the configuration now to take account of our new A10 GPU node and set up a different number of replicas for that GPU.

Let's do that now.

To configure the A10 GPU for **4 replicas** - for example, perhaps we want this GPU to host less, but more important workloads (so less sharing). We could set this to **1** for only one workload (no sharing) or a combination of values. See the [NVIDIA GPU Sharing documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html#applying-multiple-node-specific-configurations) for more details on available values.

In [4]:
!oc -n nvidia-gpu-operator patch cm time-slicing-config --type=merge --patch \
    '"data": {"NVIDIA-A10G": "version: v1\nflags:\n  migStrategy: none\nsharing:\n  timeSlicing:\n    resources:\n    - name: nvidia.com/gpu\n      replicas: 4"}'

configmap/time-slicing-config patched


In [5]:
!oc -n nvidia-gpu-operator get cm time-slicing-config -o yaml

apiVersion: v1
data:
  NVIDIA-A10G: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4
  NVIDIA-L4: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 8
kind: ConfigMap
metadata:
  creationTimestamp: "2025-06-25T04:43:58Z"
  name: time-slicing-config
  namespace: nvidia-gpu-operator
  resourceVersion: "6511538"
  uid: 1d86c8bf-492d-413b-ab8b-5b8ebbbf977c


For this configuration to take affect, we need to label our node with the device plugin config for the A10.

<div class="alert alert-block alert-info">
<b>Tip:</b> We use automation to label our initial SNO node using Policy as Code - you can see the policy <a href=https://github.com/eformat/rhoai-policy-collection/blob/main/gitops/applications/policy-collection/overlays/sno/CM-Configuration-Management/policy-gpu-node-label.yaml#L113-L114" target="_blank"><b>gpu-node-label-sno</b></a> here, and browse to it from the Governance Policy view in OpenShift Web Console > ACM view</div>



In [6]:
!oc label node -l beta.kubernetes.io/instance-type=g5.xlarge nvidia.com/device-plugin.config=NVIDIA-A10G --overwrite

node/ip-10-0-15-75.us-east-2.compute.internal labeled


It may take a minute, then we can check how many GPUs we have allocatable on our g5.xlarge node.

In [7]:
!oc get $(oc get node -o name -l beta.kubernetes.io/instance-type=g5.xlarge) -o=jsonpath={.status.allocatable} | python -m json.tool

{
    "cpu": "3500m",
    "ephemeral-storage": "114345831029",
    "hugepages-1Gi": "0",
    "hugepages-2Mi": "0",
    "memory": "15031512Ki",
    "nvidia.com/gpu": "4",
    "pods": "250"
}


In [8]:
%%bash
while [ "$(oc get "$(oc get node -o name -l beta.kubernetes.io/instance-type=g5.xlarge | head -n1)" \
         -o=jsonpath='{.status.allocatable.nvidia\.com/gpu}')" != "4" ]; do
    printf '.'
    sleep 1
done
echo "GPU count is now 4"
oc get $(oc get node -o name -l beta.kubernetes.io/instance-type=g5.xlarge) -o=jsonpath={.status.allocatable} | python -m json.tool

GPU count is now 4
{
    "cpu": "3500m",
    "ephemeral-storage": "114345831029",
    "hugepages-1Gi": "0",
    "hugepages-2Mi": "0",
    "memory": "15031512Ki",
    "nvidia.com/gpu": "4",
    "pods": "250"
}


OK, so this is now showing 4 as expected. Similarly we can check our original SNO node:

In [9]:
!oc get $(oc get node -o name -l beta.kubernetes.io/instance-type=g6.8xlarge) -o=jsonpath={.status.allocatable}| python -m json.tool

{
    "cpu": "30500m",
    "ephemeral-storage": "384800664142",
    "hugepages-1Gi": "0",
    "hugepages-2Mi": "0",
    "memory": "122748836Ki",
    "nvidia.com/gpu": "8",
    "pods": "500"
}


Which still shows 8 GPUs as expected.

## Add Hardware Profile for RHOAI

Hardware profiles enable administrators to create profiles for additional types of identifiers, limit workload resource allocations, and target workloads to specific nodes by including tolerations and nodeSelectors in profiles. 

They have superseeded `accelerator profiles` which are now labelled `legacy` but are still deployed OK.

We need to create a profile for our new GPU node.

![images/a10-hardware-profile.png](images/a10-hardware-profile.png)

In [10]:
%%bash
oc apply -f- << EOF
apiVersion: dashboard.opendatahub.io/v1alpha1
kind: HardwareProfile
metadata:
  annotations:
    opendatahub.io/dashboard-feature-visibility: '[]'
  name: nvidia-a10-shared
  namespace: redhat-ods-applications
spec:
  description: ""
  displayName: Nvidia A10 (Shared)
  enabled: true
  identifiers:
  - defaultCount: 2
    displayName: CPU
    identifier: cpu
    maxCount: 4
    minCount: 1
    resourceType: CPU
  - defaultCount: 4Gi
    displayName: Memory
    identifier: memory
    maxCount: 8Gi
    minCount: 2Gi
    resourceType: Memory
  - defaultCount: 1
    displayName: nvidia.com/gpu
    identifier: nvidia.com/gpu
    minCount: 1
    resourceType: Accelerator
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-A10G-SHARED
  tolerations: []
EOF

hardwareprofile.dashboard.opendatahub.io/nvidia-a10-shared created


<div class="alert alert-block alert-success">
<b>Success:</b> We have successfully configured time slicing using different configuration for our 2 GPUs.
</div>

Continue to the [next notebook](./Level3_new_gpu_workload.ipynb) to learn how to deploy a cool workload with on your new gpu worker node.