## Running a larger LLM on multiple GPU and multiple Nodes

If you jumped to here from Level4 notebook then carry on ! 🪏

The notebook is based partly on the product documentation with some enhancements. Some useful Links.

- https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.22/html/serving_models/serving-large-models_serving-large-models#deploying-models-using-multiple-gpu-nodes_serving-large-models
- https://access.redhat.com/articles/6966373
- https://github.com/rh-aiservices-bu/multi-node-multi-gpu-poc

### GPU Aggregation Overview

Compute workloads can benefit from using separate GPU partitions. The flexibility of GPU partitioning allows a single GPU to be shared and used by small, medium, and large-sized workloads. GPU partitions can be a valid option for executing Deep Learning workloads. An example is Deep Learning training and inferencing workflows, which utilize smaller datasets but are highly dependent on the size of the data/model, and users may need to decrease batch sizes.

#### Why GPU Aggregation?

Some Large Language Models (LLMs), such as Llama-3-70B and Falcon 180B, can be too large to fit into the memory of a single GPU (vRAM). Or in some cases, GPUs that would be large-enough might be difficult to obtain. If you find yourself in such a situation, it is natural to wonder whether an aggregation of multiple, smaller GPUs can be used instead of one single large GPU.

Thankfully, the answer is essentially Yes. To address these challenges, we can use more advanced configurations to distribute the LLM workload across several GPUs. One option is leveraging tensor parallelism, where the LLM is split across several GPUs, with each GPU processing a portion of the model's tensors. This approach ensures efficient utilization of available resources (GPUs) across one or several workers.

Some Serving Runtimes, such as vLLM, support tensor parallelism, allowing for both single-worker and multi-worker configurations (the difference whether your GPUs are all in the same machine, or are spread across machines).

#### Components of GPU Aggregation

GPU Aggregation is a complex topic and there are many components to consider. Fundamentally there are four core concepts to consider: 

* Tensor Parallelism
* Pipeline Parallelism
* Data Parallelism
* Expert Parallelism

In tensor parallelism, each GPU processes a slice of a tensor and only aggregates the full tensor when necessary for specific operations. This approach allows larger models to run efficiently across multiple devices while maintaining performance.

Pipeline parallelism differs from tensor parallelism in that it splits the model vertically (across layers) rather than horizontally (across tensor dimensions). Each GPU or node in the pipeline processes a complete subset of the model's layers, making it particularly effective for extremely large models like DeepSeek R1 or Llama 3.1 405B that cannot fit on a single node.

Data Parallelism (DP) replicates the model across multiple GPUs. Data batches are evenly distributed between GPUs and the data-parallel GPUs process them independently. While the computation workload is efficiently distributed across GPUs, inter-GPU communication is required in order to keep the model replicas consistent between training steps.

Expert parallelism is a specialized distributed computing technique designed specifically for Mixture of Experts (MoE) models. Unlike traditional parallelism strategies that distribute computation across all model parameters, expert parallelism leverages the sparse activation pattern of MoE architectures where only a subset of experts are activated for each input token.

<img src="images/gpu-aggregation.png"
     alt="GPU Aggregation"
     style="width:75%;">

### In this lab

We are going to deploy a larger LLM across both our GPU enabled nodes. This needs both GPUs in full to run.

#### 💡 Free up GPU memory to run this exercise

We can stop the vLLM inference model servers that are running in the namespace `llama-serving`.

Browse to the Models > Model deployments page in Red Hat OpenShift AI Web Console. **Stop** both the model deployments.

![images/model-serving-stop-start.png](images/model-serving-stop-start.png)

⚠️ **Note:** If you need to redeploy these models then simply start them up again. The **order** you start them up **matters** for correct startup. 

Deploy Llama3 first, then DeepSeek second.

We have limited GPU NVRAM so we use vLLMs `gpu_memory_utilization` parameter when loading the models. This works on available GPU memory so we need to load Llama3 (gpu_memory_utilization=0.5) first then DeepSeek (gpu_memory_utilization=0.8) second.


### Configure RWX Storage

In [None]:
!oc login -u admin -p ${ADMIN_PASSWORD} --server=https://api.${BASE_DOMAIN}:6443 --insecure-skip-tls-verify


Login successful.

You have access to 115 projects, the list has been suppressed. You can list all projects with 'oc projects'

Using project "ai-roadshow".


Check we have the **efs-sc** storage class configured. If not - check with your cluster admin !

In [4]:
!oc get sc efs-sc

NAME     PROVISIONER       RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
efs-sc   efs.csi.aws.com   Delete          Immediate           false                  19h


### Download Larger Model for Inference to Storage

For demonstration purpoeses - let's select a model that we know will not fit on our single 24Gi GPU. Lets try RedHatAI/Mistral-Small-24B-Instruct-2501-FP8-dynamic which is a good quality quantized model that has ~30Gi of safetensor weights and will also need KV cache - so will definitely not fit on our single GPU.

https://huggingface.co/RedHatAI/Mistral-Small-24B-Instruct-2501-FP8-dynamic/tree/main

In [None]:
!oc new-project kserve-demo

Download the model into our PVC.


In [6]:
%env MODEL_PATH=mistral-small

env: MODEL_PATH=mistral-small


In [19]:
%%bash
oc apply -f- << EOF
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ${MODEL_PATH}-pvc
spec:
  accessModes:
    - ReadWriteMany
  volumeMode: Filesystem
  resources:
    requests:
      storage: 30Gi
  storageClassName: efs-sc
EOF

persistentvolumeclaim/mistral-small-pvc created


Lets grab a YAML file that will help us download the Hugging Face model to a PVC.

In [8]:
!curl -o download-model-to-pvc.yaml https://raw.githubusercontent.com/eformat/rhoai-policy-collection/refs/heads/main/gitops/applications/model-download/download-model-to-pvc.yaml


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1552  100  1552    0     0   5602      0 --:--:-- --:--:-- --:--:--  5582


Make sure to set your Hugging Face token **HF_TOKEN**

In [None]:
%env PVC_CLAIM_NAME=mistral-small-pvc
%env HF_TOKEN=hf_
%env MODEL=RedHatAI/Mistral-Small-24B-Instruct-2501-FP8-dynamic

Now create the downloader pod

In [38]:
!cat download-model-to-pvc.yaml | envsubst | oc apply -f-

pod/download-model created


Wait until pod completes successfully ~apprx 6-8min

Follow the logs:

In [21]:
!oc -n kserve-demo logs -c download-model download-model

Collecting pip
  Downloading pip-25.2-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-25.2-py3-none-any.whl (1.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 122.4 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.2
    Uninstalling pip-24.2:
      Successfully uninstalled pip-24.2
Successfully installed pip-25.2
Collecting huggingface_hub
  Downloading huggingface_hub-0.34.4-py3-none-any.whl.metadata (14 kB)
Collecting filelock (from huggingface_hub)
  Downloading filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting fsspec>=2023.5.0 (from huggingface_hub)
  Downloading fsspec-2025.7.0-py3-none-any.whl.metadata (12 kB)
Collecting packaging>=20.9 (from huggingface_hub)
  Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
Collecting pyyaml>=5.1 (from huggingface_hub)
  Downloading PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting re

Wait till download completes

### Create Inference

RHOAI comes with the templates needed to run multinode multigpu, lets use them to create the ServingRuntime

In [28]:
!oc process vllm-multinode-runtime-template -n redhat-ods-applications | oc apply -n kserve-demo -f-

servingruntime.serving.kserve.io/vllm-multinode-runtime created


The important part of the template is the GPU Aggregation and Sharing settings.

```yaml
      pipelineParallelSize: 2  # the number of nodes we have
      tensorParallelSize: 1    # the number of GPUs that are available for vLLM on a node
```

Let's create the inference service now.

The templates have hard limits which are pretty excessive for our resources, trim them so we only set QoS to burstable i.e. set requests only

In [29]:
!oc patch servingruntime vllm-multinode-runtime -n kserve-demo --type='json' -p='[{"op": "remove", "path": "/spec/containers/0/resources"}]'

servingruntime.serving.kserve.io/vllm-multinode-runtime patched


In [30]:
!oc patch servingruntime vllm-multinode-runtime -n kserve-demo --type='json' -p='[{"op": "add", "path": "/spec/containers/0/resources", "value": {"requests":{"cpu":"1","memory":"2Gi"}}}]'

servingruntime.serving.kserve.io/vllm-multinode-runtime patched


In [31]:
!oc patch servingruntime vllm-multinode-runtime -n kserve-demo --type='json' -p='[{"op": "remove", "path": "/spec/workerSpec/containers/0/resources"}]'

servingruntime.serving.kserve.io/vllm-multinode-runtime patched


In [32]:
!oc patch servingruntime vllm-multinode-runtime -n kserve-demo --type='json' -p='[{"op": "add", "path": "/spec/workerSpec/containers/0/resources", "value": {"requests":{"cpu":"1","memory":"2Gi"}}}]'

servingruntime.serving.kserve.io/vllm-multinode-runtime patched


In [39]:
%env INFERENCE_NAME=mistral-small
%env MODEL_PATH=mistral-small

env: INFERENCE_NAME=mistral-small
env: MODEL_PATH=mistral-small


In [40]:
%%bash
oc apply -f- << EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
    serving.kserve.io/autoscalerClass: external
  name: ${INFERENCE_NAME}
spec:
  predictor:
    model:
      modelFormat:
        name: vLLM
      runtime: vllm-multinode-runtime
      storageUri: pvc://${PVC_CLAIM_NAME}/${MODEL_PATH}
    workerSpec: {}
    tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
EOF

inferenceservice.serving.kserve.io/mistral-small created


Tail the logs on the inference pod, we should see the safetensor shards loading

```bash
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:25:53 [ray_distributed_executor.py:357] If certain env vars should NOT be copied to workers, add them to /tmp/.config/vllm/ray_non_carry_over_env_vars.json file
Loading safetensors checkpoint shards:   0% Completed | 0/6 [00:00<?, ?it/s]pid=912) 
Loading safetensors checkpoint shards:  17% Completed | 1/6 [00:10<00:50, 10.05s/it] 
Loading safetensors checkpoint shards:  33% Completed | 2/6 [00:10<00:16,  4.24s/it] 
Loading safetensors checkpoint shards:  50% Completed | 3/6 [00:13<00:10,  3.61s/it] 
Loading safetensors checkpoint shards:  67% Completed | 4/6 [00:20<00:10,  5.01s/it] 
Loading safetensors checkpoint shards:  83% Completed | 5/6 [00:22<00:04,  4.04s/it] 
Loading safetensors checkpoint shards: 100% Completed | 6/6 [00:32<00:00,  6.14s/it] 
Loading safetensors checkpoint shards: 100% Completed | 6/6 [00:32<00:00,  5.46s/it]
```

After some time the OpenAI API becomes ready

```bash
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:28] Available routes are:
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /openapi.json, Methods: HEAD, GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /docs, Methods: HEAD, GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: HEAD, GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /redoc, Methods: HEAD, GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /health, Methods: GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /load, Methods: GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /ping, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /ping, Methods: GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /tokenize, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /detokenize, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /v1/models, Methods: GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /version, Methods: GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /v1/completions, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /v1/embeddings, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /pooling, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /classify, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /score, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /v1/score, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /rerank, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /v1/rerank, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /v2/rerank, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /invocations, Methods: POST
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO 07-06 08:28:20 [launcher.py:36] Route: /metrics, Methods: GET
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO:     Started server process [1]
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO:     Waiting for application startup.
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO:     Application startup complete.
mistral-small-predictor-5dbf9cbd8d-gbpbm kserve-container INFO:     10.128.0.218:55526 - "GET /metrics HTTP/1.1" 200 OK
```

Check Pod Status

In [41]:
!oc get pods -o wide

NAME                                           READY   STATUS    RESTARTS   AGE    IP             NODE                                             NOMINATED NODE   READINESS GATES
mistral-small-predictor-587648f5c4-97vf5       0/1     Running   0          2m2s   10.129.0.54    ip-10-0-94-103.ap-southeast-2.compute.internal   <none>           <none>
mistral-small-predictor-worker-dcc588b-95xdc   0/1     Running   0          2m2s   10.128.1.166   ip-10-0-84-186.ap-southeast-2.compute.internal   <none>           <none>


We can also check nvidia-smi for GPU NVRAM usage stats

In [43]:
%env DEMO_NAMESPACE=kserve-demo
%env MODEL_NAME=mistral-small

env: DEMO_NAMESPACE=kserve-demo
env: MODEL_NAME=mistral-small


In [46]:
%%bash
podName=$(oc get pod -n $DEMO_NAMESPACE -l app=isvc.$MODEL_NAME-predictor --no-headers|cut -d' ' -f1)
workerPodName=$(oc get pod -n $DEMO_NAMESPACE -l app=isvc.$MODEL_NAME-predictor-worker --no-headers|cut -d' ' -f1)
oc -n $DEMO_NAMESPACE wait --for=condition=ready pod/${podName} --timeout=300s


pod/mistral-small-predictor-587648f5c4-97vf5 condition met


We can see model loaded across both of out GPU nodes

In [53]:
%%bash
podName=$(oc get pod -n $DEMO_NAMESPACE -l app=isvc.$MODEL_NAME-predictor --no-headers|cut -d' ' -f1)
workerPodName=$(oc get pod -n $DEMO_NAMESPACE -l app=isvc.$MODEL_NAME-predictor-worker --no-headers|cut -d' ' -f1)
echo "### HEAD NODE GPU Memory Size"
oc -n $DEMO_NAMESPACE exec $podName -c kserve-container -- nvidia-smi
echo "### Worker NODE GPU Memory Size"
oc -n $DEMO_NAMESPACE exec $workerPodName -c worker-container -- nvidia-smi

### HEAD NODE GPU Memory Size
Wed Aug 13 01:55:38 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08             Driver Version: 570.148.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
|  0%   34C    P0             92W /  300W |   20751MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                  

Lets create a route so we can test the inference endpoint

In [54]:
%%bash
oc apply -f- << EOF
kind: Route
apiVersion: route.openshift.io/v1
metadata:
  name: ${INFERENCE_NAME}
  labels:
    app: isvc.${INFERENCE_NAME}-predictor
    component: predictor
    isvc.generation: "1"
    serving.kserve.io/inferenceservice: ${INFERENCE_NAME}
  annotations:
    openshift.io/host.generated: "true"
spec:
  to:
    kind: Service
    name: ${INFERENCE_NAME}-predictor
    weight: 100
  port:
    targetPort: http
  tls:
    termination: edge
    insecureEdgeTerminationPolicy: Redirect
  wildcardPolicy: None
EOF

route.route.openshift.io/mistral-small created


Check endpoint

In [70]:
%%bash
isvc_url=$(oc get route -n $DEMO_NAMESPACE |grep $MODEL_NAME| awk '{print $2}')

curl -s https://$isvc_url/v1/completions \
   -H "Content-Type: application/json" \
   -d "{
        \"model\": \"$MODEL_NAME\",
        \"prompt\": \"What is the biggest mountain in the world?\",
        \"max_tokens\": 100,
        \"temperature\": 0
    }" | python -m json.tool

{
    "id": "cmpl-b91ccade2c334f46b8d1e28d3079fad1",
    "object": "text_completion",
    "created": 1755050407,
    "model": "mistral-small",
    "choices": [
        {
            "index": 0,
            "text": " The answer is not Mount Everest. The biggest mountain in the world is actually Mauna Kea in Hawaii. While Mount Everest is the highest peak above sea level, Mauna Kea is the tallest when measured from base to peak. Mauna Kea is a dormant volcano that rises about 33,500 feet (10,210 meters) from its base on the ocean floor to its peak, which is 13,796 feet (",
            "logprobs": null,
            "finish_reason": "length",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 10,
        "total_tokens": 110,
        "completion_tokens": 100,
        "prompt_tokens_details": null
    },
    "kv_transfer_params": null
}


Mauna Kea indeed !!

Multi-node vLLM uses [Ray](https://www.ray.io) to distribute the model across multiple nodes and vLLM manages the Ray instance for you.

The multi-node vLLM distribution does not depend on any external Ray instances or RHOAI's distributed training tooling such as KubeRay or CodeFlare.

We can check the ray status.

In [72]:
%%bash
podName=$(oc get pod -n $DEMO_NAMESPACE -l app=isvc.$MODEL_NAME-predictor --no-headers|cut -d' ' -f1)
oc exec -i pod/${podName} -- /bin/sh -s << EOF
ray status
EOF

Defaulted container "kserve-container" out of: kserve-container, ray-tls-generator (init)


Node status
---------------------------------------------------------------
Active:
 1 node_e05ad693f65a468ac50a6829ae2a508bdbb1f4e5196c3a4c830f6a14
 1 node_622ab5f41cc58269dba5009434f8f1a6989bb824b43ed297054d1338
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/36.0 CPU
 2.0/2.0 GPU (2.0 used of 2.0 reserved in placement groups)
 0B/120.06GiB memory
 0B/15.86GiB object_store_memory

Total Constraints:
 (no request_resources() constraints)
Total Demands:
 (no resource demands)
