# Summary

## GPU enabled Kubeflow notebook
In this notebook we simulate training a model across multiple compute nodes, each of which using multiple GPUs with the tensorflow framework. We demonstrate this with a convolutional neural network for image classification trained on cifar10 dataset. 

This example was tested on: 
* [charmed kubeflow](https://charmed-kubeflow.io/)
* in a [MicroK8s](https://microk8s.io/) kubernetes cluster
* deployed across multiple NVIDIA DGX nodes, each with 8GPUs. 
* using drivers: NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7
* using `tensorflow-gpu==2.10.0` (see `requirements.txt` for other dependencies)

This example was executed from a Kubeflow Notebook Server with 4CPU, 16GB RAM, and 8GPUs, but it could run with as little as 2GPUs.

This is a local demonstration of multi-node, multi-gpu distributed training.  This demonstration is executed by running two separate instances of the same training process and setting them up to coordinate as if they were on separate compute nodes.  Typically these would be executed on different machines and communicate remotely.

This demo is a modified version of the [Tensorflow Multi-worker training with Keras](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras) demo.

# Setup and Helpers

In [1]:
import json
import os
import tensorflow as tf

from gpu_distributed_utilities import build_and_compile_cnn_model, get_cifar10_dataset

2023-02-24 14:35:24.130201: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-24 14:35:24.283509: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-02-24 14:35:25.040054: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvrtc.so.11.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-24 14:35:25.040215: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load d

Distributed training is configured via the `TF_CONFIG` environment variable.  We clear it here and will explicitly set it later

In [2]:
os.environ.pop('TF_CONFIG', None)

To simulate multi-node training, each instance will only be shown a subset of the GPUs on this node.  

In [3]:
GPU_VISIBLE_DEVICES = {
    0: "0,1",
    1: "6,7"
}

Helpers for setting up the separate node environments below:

In [4]:
def emit_tf_config(role_id):
    tf_config = {
    'cluster': {
        # This defines two workers in the group, each reachable on localhost
        'worker': ['localhost:12345', 'localhost:23456']
    },
        # This defines the type and index of this particular worker
        'task': {'type': 'worker', 'index': role_id}
    }
    os.environ['TF_CONFIG'] = json.dumps(tf_config)

def emit_visible_devices(role_id):
    os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
    os.environ["CUDA_VISIBLE_DEVICES"]=GPU_VISIBLE_DEVICES[role_id]

Kill any previous background processes, just in case...

In [5]:
%killbgscripts

All background processes were killed.


# Deploy the training workload

Below we deploy two workers, both defined by the included script `gpu_distributed_main.py`.  That script essentially just does:

```python
def main(...):
    # Get some data
    X_train_scaled, y_train_encoded, X_test_scaled, y_test_encoded = get_cifar10_dataset()

    # Set up a CNN model using the tf.distribute.MultiWorkerMirroredStrategy() scope, which tells the 
    # model to coordinate with other nodes
    strategy = tf.distribute.MultiWorkerMirroredStrategy()
    with strategy.scope():
        model = build_and_compile_cnn_model()

    # Fit the model, as per some settings
    model.fit(x=X_train_scaled, y=y_train_encoded, epochs=epochs, batch_size=batch_size)
```

Where we import model setup from `gpu_distributed_utilities.py`.  

Executing `gpu_distributed_worker.py` results in a worker that has data, knows how to coordinate with other workers (inferred from the `TF_CONFIG` environment variable, as shown below), and is waiting for everyone to be ready.  Once all workers are online, the training will begin

## Start the chief worker

The 0th worker is the chief, and often takes on additional tasks like coordination between workers or saving checkpoints.  In our case, the chief has no additional roles.

Set environment variables for the TFConfig and visible devices, so we can launch a process that has that specific configuration

In [6]:
role_id = 0  # Chief

emit_tf_config(role_id)
emit_visible_devices(role_id)

And we can confirm these were set properly

In [7]:
print(f"CUDA_VISIBLE_DEVICES = {os.environ['CUDA_VISIBLE_DEVICES']}")
print(f"TF_CONFIG = {os.environ['TF_CONFIG']}")

CUDA_VISIBLE_DEVICES = 0,1
TF_CONFIG = {"cluster": {"worker": ["localhost:12345", "localhost:23456"]}, "task": {"type": "worker", "index": 0}}


Now execute this chief worker in the background

In [8]:
%%bash --bg -s "$role_id"
echo working on job_$1
python gpu_distributed_worker.py --batch-size=256 --epochs 10 &> job_$1.log

And if we `tail -f job_0.log`, we should see something like:

```
tail -f job_0.log

To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-23 22:02:00.803116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38406 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:07:00.0, compute capability: 8.0
2023-02-23 22:02:00.805756: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 38406 MB memory:  -> device: 1, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:0f:00.0, compute capability: 8.0
2023-02-23 22:02:00.841467: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:worker/replica:0/task:0/device:GPU:0 with 38406 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:07:00.0, compute capability: 8.0
2023-02-23 22:02:00.843643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:worker/replica:0/task:0/device:GPU:1 with 38406 MB memory:  -> device: 1, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:0f:00.0, compute capability: 8.0
2023-02-23 22:02:00.879935: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12345, 1 -> localhost:23456}
2023-02-23 22:02:00.880131: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12345, 1 -> localhost:23456}
2023-02-23 22:02:00.881226: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:438] Started server with target: grpc://localhost:12345
2023-02-23 22:02:00.884138: I tensorflow/core/distributed_runtime/coordination/coordination_service.cc:526] /job:worker/replica:0/task:0 has connected to coordination service. Incarnation: 4418798052974762676
2023-02-23 22:02:00.890014: I tensorflow/core/distributed_runtime/coordination/coordination_service_agent.cc:281] Coordination agent has successfully connected.
```

where we see the worker is ready and waiting for others.

## Start the additional worker(s)

We repeat the same task as above, except we pass a different `role_id` to the process.

In [9]:
role_id = 1  # Worker

emit_tf_config(role_id)
emit_visible_devices(role_id)

And we can confirm these were set properly

In [10]:
print(f"CUDA_VISIBLE_DEVICES = {os.environ['CUDA_VISIBLE_DEVICES']}")
print(f"TF_CONFIG = {os.environ['TF_CONFIG']}")

CUDA_VISIBLE_DEVICES = 6,7
TF_CONFIG = {"cluster": {"worker": ["localhost:12345", "localhost:23456"]}, "task": {"type": "worker", "index": 1}}


Where we see the worker index of 1 and the updated visible CUDA devices.  

Now execute this worker in the background

In [11]:
%%bash --bg -s "$role_id"
echo working on job_$1
python gpu_distributed_worker.py --batch-size=256 --epochs 10 &> job_$1.log

And if we `tail -f job_1.log`, we should see this second worker set up like the first, and then see both workers start training the model.  

With training running, we should also be able to see activity in the GPUs using the `nvidia-smi` tool.  For example:

```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   30C    P0    79W / 400W |  39872MiB / 40536MiB |      8%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   30C    P0    68W / 400W |  39872MiB / 40536MiB |      6%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   29C    P0    55W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   29C    P0    51W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   33C    P0    53W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   32C    P0    55W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   34C    P0    82W / 400W |  39872MiB / 40536MiB |      7%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   37C    P0    75W / 400W |  39872MiB / 40536MiB |      7%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
```

where we see GPUs 0, 1, 6, and 7 are working.

This demonstrates how we can coordinate multiple processes using GPUs to train the same model.  To extend this to multiple nodes, all we need to do is update the `TF_CONFIG` environment variables to add workers from different machines.