## Running a larger LLM on multiple GPU and multiple Nodes

👷‍♂️ Work In Progress 👷‍♂️

This lab has been tested to work OK - it just needs a bit more documentation, explanation and debug for the cells. 

If you jumped to here from Level4 notebook then carry on ! 🪏

The notebook is based partly on the product documentation with some enhancements. Some useful Links.

- https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.21/html/serving_models/serving-large-models_serving-large-models#deploying-models-using-multiple-gpu-nodes_serving-large-models
- https://access.redhat.com/articles/6966373
- https://github.com/rh-aiservices-bu/multi-node-multi-gpu-poc

### GPU Aggregation Overview

Compute workloads can benefit from using separate GPU partitions. The flexibility of GPU partitioning allows a single GPU to be shared and used by small, medium, and large-sized workloads. GPU partitions can be a valid option for executing Deep Learning workloads. An example is Deep Learning training and inferencing workflows, which utilize smaller datasets but are highly dependent on the size of the data/model, and users may need to decrease batch sizes.

#### Why GPU Aggregation?

Some Large Language Models (LLMs), such as Llama-3-70B and Falcon 180B, can be too large to fit into the memory of a single GPU (vRAM). Or in some cases, GPUs that would be large-enough might be difficult to obtain. If you find yourself in such a situation, it is natural to wonder whether an aggregation of multiple, smaller GPUs can be used instead of one single large GPU.

Thankfully, the answer is essentially Yes. To address these challenges, we can use more advanced configurations to distribute the LLM workload across several GPUs. One option is leveraging tensor parallelism, where the LLM is split across several GPUs, with each GPU processing a portion of the model's tensors. This approach ensures efficient utilization of available resources (GPUs) across one or several workers.

Some Serving Runtimes, such as vLLM, support tensor parallelism, allowing for both single-worker and multi-worker configurations (the difference whether your GPUs are all in the same machine, or are spread across machines).

#### Components of GPU Aggregation

GPU Aggregation is a complex topic and there are many components to consider. Fundamentally there are four core concepts to consider: 

* Tensor Parallelism
* Pipeline Parallelism
* Data Parallelism
* Expert Parallelism

<img src="images/gpu-aggregation.png"
     alt="GPU Aggregation"
     style="width:75%;">

### In this lab

We are going to deploy a larger LLM across both our GPU enabled nodes. This needs both GPUs in full to run.

#### 💡 Free up GPU memory to run this exercise

In this lab environment we can remove the running LLM models using the configmap mechanism that controls the model deployments using Policy-as-Code to undeploy the existing lab models

In [None]:
!oc create configmap undeploy-sno-deepseek-qwen3-vllm -n llama-serving

In [None]:
!oc create configmap undeploy-llama3-2-3b -n llama-serving

**Note:** If you need to redeploy these models then simply delete the config maps. The order matters for correct startup. Deploy Llama3 first. We have limited GPU NVRAM so we use vLLMs `gpu_memory_utilization` parameter when loading the models. This works on available GPU memory so we need to load Llama3 (gpu_memory_utilization=0.5) first then DeepSeek (gpu_memory_utilization=0.8) second.

```bash
# 1 - redeploy llama3-2-3b
oc delete configmap undeploy-llama3-2-3b -n llama-serving
# 2 - redeploy deepseek-qwen3
oc delete configmap undeploy-sno-deepseek-qwen3-vllm -n llama-serving
```

### Configure RWX Storage

In [1]:
!oc login -u admin -p ${ADMIN_PASSWORD} --server=https://api.sno.${BASE_DOMAIN}:6443 --insecure-skip-tls-verify


Login successful.

You have access to 116 projects, the list has been suppressed. You can list all projects with 'oc projects'

Using project "kserve-demo".


Check we have the **efs-sc** storage class configured. If not - check with your cluster admin !

In [2]:
!oc get sc efs-sc

NAME     PROVISIONER       RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
efs-sc   efs.csi.aws.com   Delete          Immediate           false                  3d1h


### Download Larger Model for Inference to Storage

For demonstration purpoeses - let's select a model that we know will not fit on our single 24Gi GPU. Lets try RedHatAI/Mistral-Small-24B-Instruct-2501-FP8-dynamic which is a good quality quantized model that has ~30Gi of safetensor weights and will also need KV cache - so will definitely not fit on our single GPU.

https://huggingface.co/RedHatAI/Mistral-Small-24B-Instruct-2501-FP8-dynamic/tree/main

In [3]:
!oc new-project kserve-demo

Error from server (AlreadyExists): project.project.openshift.io "kserve-demo" already exists


create ...


In [15]:
%env MODEL_PATH=mistral-small

env: MODEL_PATH=mistral-small


In [9]:
%%bash
oc apply -f- << EOF
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ${MODEL_PATH}-pvc
spec:
  accessModes:
    - ReadWriteMany
  volumeMode: Filesystem
  resources:
    requests:
      storage: 30Gi
  storageClassName: efs-sc
EOF

persistentvolumeclaim/mistral-small-pvc unchanged


Lets grab a YAML file that will help us download the Hugging Face model to a PVC.

In [11]:
!curl -o download-model-to-pvc.yaml https://raw.githubusercontent.com/eformat/rhoai-policy-collection/refs/heads/main/gitops/applications/model-download/download-model-to-pvc.yaml


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1552  100  1552    0     0   4958      0 --:--:-- --:--:-- --:--:--  4958


In [None]:
%env PVC_CLAIM_NAME=mistral-small-pvc
%env HF_TOKEN=hf_...
%env MODEL=RedHatAI/Mistral-Small-24B-Instruct-2501-FP8-dynamic

In [29]:
!cat download-model-to-pvc.yaml | envsubst | oc apply -f-

pod/download-model created


Wait until pod completes successfully ~apprx 6-8min

### Create Inference

RHOAI comes with the templates needed to run multinode multigpu, lets use them to create the ServingRuntime

In [None]:
oc process vllm-multinode-runtime-template -n redhat-ods-applications | oc apply -n kserve-demo -f-

In [2]:
%env INFERENCE_NAME=mistral-small
%env MODEL_PATH=mistrall-small

env: INFERENCE_NAME=mistral-small
env: MODEL_PATH=mistrall-small


In [None]:
oc apply -f- << EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
    serving.kserve.io/autoscalerClass: external
  name: ${INFERENCE_NAME}
spec:
  predictor:
    model:
      modelFormat:
        name: vLLM
      runtime: vllm-multinode-runtime
      storageUri: pvc://${PVC_CLAIM_NAME}/${MODEL_PATH}
    workerSpec: {}
    tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
EOF

The templates have hard limits which are pretty excessive for our resources, trim them so we only set QoS to burstable i.e. set requests only

In [None]:
-- Serving Runtime ... prune resources HARD!
spec:
  containers:
    - resources:
        requests:
          cpu: '1'
          memory: 2Gi
      readinessProbe:


Tail the logs on the inference pod, we should see the safetensor shards loading

```bash
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:25:53 [ray_distributed_executor.py:357] If certain env vars should NOT be copied to workers, add them to /tmp/.config/vllm/ray_non_carry_over_env_vars.json file
Loading safetensors checkpoint shards:   0% Completed | 0/6 [00:00<?, ?it/s]pid=912) 
Loading safetensors checkpoint shards:  17% Completed | 1/6 [00:10<00:50, 10.05s/it] 
Loading safetensors checkpoint shards:  33% Completed | 2/6 [00:10<00:16,  4.24s/it] 
Loading safetensors checkpoint shards:  50% Completed | 3/6 [00:13<00:10,  3.61s/it] 
Loading safetensors checkpoint shards:  67% Completed | 4/6 [00:20<00:10,  5.01s/it] 
Loading safetensors checkpoint shards:  83% Completed | 5/6 [00:22<00:04,  4.04s/it] 
Loading safetensors checkpoint shards: 100% Completed | 6/6 [00:32<00:00,  6.14s/it] 
Loading safetensors checkpoint shards: 100% Completed | 6/6 [00:32<00:00,  5.46s/it]
```

After some time the OpenAI API becomes ready

```bash
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:28] Available routes are:
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /openapi.json, Methods: HEAD, GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /docs, Methods: HEAD, GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: HEAD, GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /redoc, Methods: HEAD, GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /health, Methods: GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /load, Methods: GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /ping, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /ping, Methods: GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /tokenize, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /detokenize, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /v1/models, Methods: GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /version, Methods: GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /v1/completions, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /v1/embeddings, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /pooling, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /classify, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /score, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /v1/score, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /rerank, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /v1/rerank, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /v2/rerank, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /invocations, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /metrics, Methods: GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO:     Started server process [1]
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO:     Waiting for application startup.
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO:     Application startup complete.
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO:     10.128.0.218:55526 - "GET /metrics HTTP/1.1" 200 OK
```

Check Pod Status

In [None]:
!oc get pods -o wide

In [None]:
NAME                                              READY   STATUS      RESTARTS   AGE     IP             NODE                                       NOMINATED NODE   READINESS GATES
download-model                                    0/1     Completed   0          6h9m    10.129.0.47    ip-10-0-37-35.us-east-2.compute.internal   <none>           <none>
mistral-small-predictor-5dbf9cbd8d-gbpbm          1/1     Running     0          6m10s   10.128.1.222   ip-10-0-40-85.us-east-2.compute.internal   <none>           <none>
mistral-small-predictor-worker-7489767864-l6mfg   1/1     Running     0          6m10s   10.129.0.163   ip-10-0-37-35.us-east-2.compute.internal   <none>           <none>
tools-56447bb8b-27wsl                             1/1     Running     0          6h9m    10.129.0.45    ip-10-0-37-35.us-east-2.compute.internal   <none>           <none>


We can also check nvidia-smi for GPU NVRAM usage stats

In [None]:
%env DEMO_NAMESPACE=kserve-demo
%env MODEL_NAME=mistral-small

In [None]:
!podName=$(oc get pod -n $DEMO_NAMESPACE -l app=isvc.$MODEL_NAME-predictor --no-headers|cut -d' ' -f1)
!workerPodName=$(kubectl get pod -n $DEMO_NAMESPACE -l app=isvc.$MODEL_NAME-predictor-worker --no-headers|cut -d' ' -f1)
!oc -n $DEMO_NAMESPACE wait --for=condition=ready pod/${podName} --timeout=300s


We can see model loaded across both of out GPU nodes

In [None]:
!echo "### HEAD NODE GPU Memory Size"
!oc -n $DEMO_NAMESPACE exec $podName -- nvidia-smi
!echo "### Worker NODE GPU Memory Size"
!oc -n $DEMO_NAMESPACE exec $workerPodName -- nvidia-smi

In [None]:
### HEAD NODE GPU Memory Size
Defaulted container "kserve-container" out of: kserve-container, ray-tls-generator (init)
Sun Jul  6 08:31:40 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08             Driver Version: 570.148.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:36:00.0 Off |                    0 |
| N/A   54C    P0             35W /   72W |   20252MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A             912      C   ray::RayWorkerWrapper                 20244MiB |
+-----------------------------------------------------------------------------------------+
### Worker NODE GPU Memory Size
Defaulted container "worker-container" out of: worker-container, ray-tls-generator (init)
Sun Jul  6 08:31:43 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08             Driver Version: 570.148.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
|  0%   45C    P0             98W /  300W |   20775MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A             238      C   ray::RayWorkerWrapper                 20766MiB |
+-----------------------------------------------------------------------------------------+


Lets create a route so we can test the inference endpoint

In [None]:
%%bash
oc apply -f- << EOF
kind: Route
apiVersion: route.openshift.io/v1
metadata:
  name: ${INFERENCE_NAME}
  labels:
    app: isvc.${INFERENCE_NAME}-predictor
    component: predictor
    isvc.generation: "1"
    serving.kserve.io/inferenceservice: ${INFERENCE_NAME}
  annotations:
    openshift.io/host.generated: "true"
spec:
  to:
    kind: Service
    name: ${INFERENCE_NAME}-predictor
    weight: 100
  port:
    targetPort: http
  tls:
    termination: edge
    insecureEdgeTerminationPolicy: Redirect
  wildcardPolicy: None
EOF

In [None]:
%%bash
oc wait --for=condition=ready pod/${podName} -n $DEMO_NAMESPACE --timeout=300s

In [None]:
%env isvc_url=$(oc get route -n $DEMO_NAMESPACE |grep $MODEL_NAME| awk '{print $2}')

Check

In [None]:
curl https://$isvc_url/v1/completions \
   -H "Content-Type: application/json" \
   -d "{
        \"model\": \"$MODEL_NAME\",
        \"prompt\": \"What is the biggest mountain in the world?\",
        \"max_tokens\": 100,
        \"temperature\": 0
    }"

Mauna Kea indeed !!

In [None]:
{"id":"cmpl-3a23f5db101f416192910105c6036cc8","object":"text_completion","created":1751791369,"model":"mistral-small","choices":[{"index":0,"text":" The answer is not Mount Everest. The biggest mountain in the world is actually Mauna Kea in Hawaii. Mauna Kea is a dormant volcano that rises 13,796 feet (4,205 meters) above sea level, but it is also 19,680 feet (6,000 meters) tall when measured from its base on the ocean floor. This makes it the tallest mountain in the world when measured from base to peak","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":10,"total_tokens":110,"completion_tokens":100,"prompt_tokens_details":null},"kv_transfer_params":null}virt:~/git/multi-node-multi-gpu-poc ⎇ main#cb26faa$


Ray is used internally by vLLM and we can check its status

In [None]:
%%bash
oc exec -i pod/${podName} -- /bin/sh -s << EOF
ray status
EOF

In [None]:
Defaulted container "kserve-container" out of: kserve-container, ray-tls-generator (init)
======== Autoscaler status: 2025-07-06 08:44:21.676093 ========
Node status
---------------------------------------------------------------
Active:
 1 node_8a3ca93eb4c1f37584480f0c5611a3a46a998ce028ad0bfd8910e793
 1 node_5e6faa16c3d7c131993ab9d79177f7a4d004b08961024a4c3a221fb3
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/36.0 CPU
 2.0/2.0 GPU (2.0 used of 2.0 reserved in placement groups)
 0B/121.49GiB memory
 0B/13.91GiB object_store_memory

Total Constraints:
 (no request_resources() constraints)
Total Demands:
 (no resource demands)
