# NERSC Cluster Deploy Tutorial: Tuning Hyperparameters of a Distributed TensorFlow Model using Ray Train & Tune

📖 [Back to Table of Contents](../README.md)<br>
⬅ [Previous notebook](./ex_01_pytorch_ray_train_tune.ipynb) <br>
<!-- ➡ [Next notebook](./ex_03_pytorch_ray_hvd.ipynb) <br> -->

----


## Introduction

We are going to run an example Ray Train & Tune code. This example looks at tunning hyperparameters of a distrbuted TensorFlow Model. This tutorial is following the code in this example: 
https://docs.ray.io/en/latest/train/examples/tf/tune_tensorflow_mnist_example.html

> **Note**:
> To setup the environment for the notebook, execute on command line: `./setup.sh 2` then select the kernel `tensorflow-2.9.0` in the notebook

This Ray cluster will be setup using the NERSC TensorFlow module and deployed on Perlmutter.



# Starting Ray Cluster

## Superfacility API

To deploy the Ray cluster via the NERSC Superfacility API you require a valid API client. 

To create a valid client visit your profile page in [Iris](https://iris.nersc.gov/):

<img src="img/iris_profile_header.png" width="800" />

Then scroll down to the **Superfacility API Clients** section and click the "+ New Client" button which will produce this window:

<img src="img/new_sf_api_client.png" width="400" />

![title](img/new_sf_api_client.png)

To submit and deploy a Ray cluster we require the highest security level (<span style="color:red">RED</span>). **[This client id is valid for 2 days]**

Once created then saved the `client_id` string and `private_key` dictionary (you can also save the private key in PEM format) ready for use with the `SuperfacilityAPI` library.

> **Note**:
> This step should only be repeated if your client has expired


For more information about the NERSC Superfacility API visit the [documenation](https://docs.nersc.gov/services/sfapi/).

In [1]:
from SuperfacilityAPI import SuperfacilityAPI, SuperfacilityAccessToken
from utility import load_secrets

# Replace with your client id string and private key dictionary
client_id, private_key = load_secrets()
# client_id = "<your client id string>"
# private_key = "<your private key dict>"

api_key = SuperfacilityAccessToken(
    client_id = client_id,
    private_key = private_key
)
sfp_api = SuperfacilityAPI(api_key)

## Creating Ray Cluster

To create a ray cluster on NERSC compute nodes, execute the `deploy_ray_cluster` function with your desired slurm sbatch options.

In [2]:
from nersc_cluster_deploy import deploy_ray_cluster
from utility import user_account

slurm_options = {
    'qos': 'debug',
    'account': user_account(),
    'nodes': '2',
    't': '00:30:00'
}
site = 'perlmutter'
module_load = 'tensorflow/2.9.0'

job = deploy_ray_cluster(
    sfp_api,
    slurm_options,
    site,
    job_setup = [f'module load {module_load}']
)

In [3]:
job

{'error': None, 'jobid': '5906487', 'task_id': '11934'}

Now the job has been submitted, check on the job status

In [4]:
import os
import pandas as pd
sqs_table = sfp_api.get_jobs(site=site, user=os.getlogin(), sacct=False)
sqs_df = pd.DataFrame(sqs_table['output'])
sqs_df

Unnamed: 0,account,tres_per_node,min_cpus,min_tmp_disk,end_time,features,group,over_subscribe,jobid,name,...,partition,nodelist(reason),start_time,state,uid,submit_time,licenses,core_spec,schednodes,work_dir
0,dasrepo_g,,128,0,2023-03-03T20:04:04,gpu&a100&hbm40g,75235,NO,5906487,sbatch,...,gpu_ss11,"nid[001084,002408]",2023-03-03T19:34:04,RUNNING,75235,2023-03-03T19:34:03,u2:1,,(null),/global/u2/a/asnaylor


Check job log

In [13]:
!cat ~/slurm-{job['jobid']}.out

In case of issues, please refer to our known issues: https://docs.nersc.gov/current/
and open a help ticket if your issue is not listed: https://help.nersc.gov/
[slurm] - Starting ray HEAD
2023-03-03 19:34:09,561	INFO usage_lib.py:435 -- Usage stats collection is disabled.
2023-03-03 19:34:09,561	INFO scripts.py:710 -- [37mLocal node IP[39m: [1mnid001084[22m
2023-03-03 19:34:12,083	SUCC scripts.py:747 -- [32m--------------------[39m
2023-03-03 19:34:12,083	SUCC scripts.py:748 -- [32mRay runtime started.[39m
2023-03-03 19:34:12,083	SUCC scripts.py:749 -- [32m--------------------[39m
2023-03-03 19:34:12,083	INFO scripts.py:751 -- [36mNext steps[39m
2023-03-03 19:34:12,083	INFO scripts.py:752 -- To connect to this Ray runtime from another node, run
2023-03-03 19:34:12,083	INFO scripts.py:755 -- [1m  ray start --address='nid001084:6379'[22m
2023-03-03 19:34:12,083	INFO scripts.py:771 -- Alternatively, use the following Python code:
2023-03-03 19:34:12,084	INFO scripts.py:773 

## Connect to Ray Cluster

Get the Ray cluster head node ip address to connect to the cluster

In [7]:
from nersc_cluster_deploy import get_ray_cluster_address
import ray

cluster_address = get_ray_cluster_address(
    sfp_api,
    job['jobid'],
    site
)
ray.init(cluster_address)

0,1
Python version:,3.9.15
Ray version:,2.3.0
Dashboard:,http://127.0.0.1:8265


Check all nodes connected to cluster

In [12]:
from nersc_cluster_deploy import ray_cluster_summary

ray_cluster_summary()

Cluster Summary
---------------
Nodes: 2
CPU:   256
GPU:   8
RAM:   307.8 GB


## Setup PyTorch Model

In [14]:
from ray import tune
from ray.train.tensorflow import TensorflowTrainer
from ray.air.config import ScalingConfig

from ray.train.examples.tf.tensorflow_mnist_example import train_func
from ray.tune.tune_config import TuneConfig
from ray.tune.tuner import Tuner

In [15]:
def tune_tensorflow_mnist(
    num_workers: int = 2, num_samples: int = 2, use_gpu: bool = False
):
    trainer = TensorflowTrainer(
        train_loop_per_worker=train_func,
        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
    )
    tuner = Tuner(
        trainer,
        tune_config=TuneConfig(num_samples=num_samples, metric="accuracy", mode="max"),
        param_space={
            "train_loop_config": {
                "lr": tune.loguniform(1e-4, 1e-1),
                "batch_size": tune.choice([32, 64, 128]),
                "epochs": 3,
            }
        },
    )
    results = tuner.fit()
    best_accuracy = results.get_best_result().metrics["accuracy"]
    print(f"Best accuracy config: {best_accuracy}")
    return results

## Train Model

In [16]:
node_resources = ray.cluster_resources()
num_workers = int(node_resources['GPU'])
use_gpu = True

num_samples = 2
smoke_test = False

In [17]:
results = tune_tensorflow_mnist(
            num_workers=num_workers,
            num_samples=num_samples,
            use_gpu=use_gpu,
)

0,1
Current time:,2023-03-03 19:36:04
Running for:,00:00:56.40
Memory:,58.5/251.3 GiB

Trial name,status,loc,train_loop_config/ba tch_size,train_loop_config/lr,iter,total time (s),loss,accuracy,_timestamp
TensorflowTrainer_8f7c2_00000,TERMINATED,pid=77617,64,0.00248349,3,24.3731,1.7592,0.641936,1677900935
TensorflowTrainer_8f7c2_00001,TERMINATED,128.55.66.148:8644,32,0.000111707,3,18.1692,2.29254,0.116016,1677900960


[2m[36m(RayTrainWorker pid=4550, ip=128.55.66.148)[0m 2023-03-03 19:35:18.950723: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
[2m[36m(RayTrainWorker pid=4550, ip=128.55.66.148)[0m To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2m[36m(RayTrainWorker pid=77896)[0m 2023-03-03 19:35:18.988012: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
[2m[36m(RayTrainWorker pid=77896)[0m To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2m[36m(RayTrainWorker pid=77894)[0m 2023-03-03 19:35:19.019755: I tensorflow/core/platform/cpu_feature_guard.cc:193] T

[2m[36m(RayTrainWorker pid=4550, ip=128.55.66.148)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=4549, ip=128.55.66.148)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=4548, ip=128.55.66.148)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=4551, ip=128.55.66.148)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=77897)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=77896)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=77894)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=77895)[0m Epoch 1/3


[2m[36m(RayTrainWorker pid=4550, ip=128.55.66.148)[0m 2023-03-03 19:35:27.075037: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=4549, ip=128.55.66.148)[0m 2023-03-03 19:35:27.085144: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=4548, ip=128.55.66.148)[0m 2023-03-03 19:35:27.070050: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=4551, ip=128.55.66.148)[0m 2023-03-03 19:35:27.096142: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=77897)[0m 2023-03-03 19:35:27.051567: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=77896)[0m 2023-03-03 19:35:27.057737: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=77894)[0m 2023-03-03 19:35:27.04

 1/70 [..............................] - ETA: 7:23 - loss: 2.3114 - accuracy: 0.0996
 1/70 [..............................] - ETA: 7:23 - loss: 2.3114 - accuracy: 0.0996
 1/70 [..............................] - ETA: 7:23 - loss: 2.3114 - accuracy: 0.0996
 1/70 [..............................] - ETA: 7:23 - loss: 2.3114 - accuracy: 0.0996
 1/70 [..............................] - ETA: 7:26 - loss: 2.3114 - accuracy: 0.0996
 1/70 [..............................] - ETA: 7:26 - loss: 2.3114 - accuracy: 0.0996
 1/70 [..............................] - ETA: 7:26 - loss: 2.3114 - accuracy: 0.0996
 1/70 [..............................] - ETA: 7:25 - loss: 2.3114 - accuracy: 0.0996
 3/70 [>.............................] - ETA: 2s - loss: 2.3143 - accuracy: 0.0977  
 3/70 [>.............................] - ETA: 2s - loss: 2.3143 - accuracy: 0.0977  
 3/70 [>.............................] - ETA: 2s - loss: 2.3143 - accuracy: 0.0977  
 3/70 [>.............................] - ETA: 2s - loss: 2.3143 -

[2m[36m(RayTrainWorker pid=77897)[0m Exception ignored in: <function Pool.__del__ at 0x151d27b3f790>
[2m[36m(RayTrainWorker pid=77897)[0m Traceback (most recent call last):
[2m[36m(RayTrainWorker pid=77897)[0m   File "/global/common/software/nersc/pm-2022q4/sw/tensorflow/2.9.0/lib/python3.9/multiprocessing/pool.py", line 268, in __del__
[2m[36m(RayTrainWorker pid=77897)[0m     self._change_notifier.put(None)
[2m[36m(RayTrainWorker pid=77897)[0m   File "/global/common/software/nersc/pm-2022q4/sw/tensorflow/2.9.0/lib/python3.9/multiprocessing/queues.py", line 377, in put
[2m[36m(RayTrainWorker pid=77897)[0m     self._writer.send_bytes(obj)
[2m[36m(RayTrainWorker pid=77897)[0m   File "/global/common/software/nersc/pm-2022q4/sw/tensorflow/2.9.0/lib/python3.9/multiprocessing/connection.py", line 205, in send_bytes
[2m[36m(RayTrainWorker pid=77897)[0m     self._send_bytes(m[offset:offset + size])
[2m[36m(RayTrainWorker pid=77897)[0m   File "/global/common/software/

[2m[36m(RayTrainWorker pid=82925)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=8896, ip=128.55.66.148)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=8897, ip=128.55.66.148)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=8895, ip=128.55.66.148)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=8898, ip=128.55.66.148)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=82922)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=82924)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=82923)[0m Epoch 1/3


[2m[36m(RayTrainWorker pid=82925)[0m 2023-03-03 19:35:52.704582: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=82923)[0m 2023-03-03 19:35:52.709968: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=82924)[0m 2023-03-03 19:35:52.725927: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=8896, ip=128.55.66.148)[0m 2023-03-03 19:35:52.743536: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=8897, ip=128.55.66.148)[0m 2023-03-03 19:35:52.701424: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=8895, ip=128.55.66.148)[0m 2023-03-03 19:35:52.743394: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=8898, ip=128.55.66.148)[0m 2023-03-03 19:35:52.72

 1/70 [..............................] - ETA: 4:41 - loss: 2.3077 - accuracy: 0.0859
 1/70 [..............................] - ETA: 4:41 - loss: 2.3077 - accuracy: 0.0859
 1/70 [..............................] - ETA: 4:43 - loss: 2.3077 - accuracy: 0.0859
 1/70 [..............................] - ETA: 4:42 - loss: 2.3077 - accuracy: 0.0859
 3/70 [>.............................] - ETA: 2s - loss: 2.3137 - accuracy: 0.0625  
 3/70 [>.............................] - ETA: 2s - loss: 2.3137 - accuracy: 0.0625  
 3/70 [>.............................] - ETA: 2s - loss: 2.3137 - accuracy: 0.0625  
 3/70 [>.............................] - ETA: 2s - loss: 2.3137 - accuracy: 0.0625  
 5/70 [=>............................] - ETA: 2s - loss: 2.3128 - accuracy: 0.0617
 5/70 [=>............................] - ETA: 2s - loss: 2.3128 - accuracy: 0.0617
 5/70 [=>............................] - ETA: 2s - loss: 2.3128 - accuracy: 0.0617
 5/70 [=>............................] - ETA: 2s - loss: 2.3128 - accur

[2m[36m(RayTrainWorker pid=82924)[0m Exception ignored in: <function Pool.__del__ at 0x145a2c069790>
[2m[36m(RayTrainWorker pid=82924)[0m Traceback (most recent call last):
[2m[36m(RayTrainWorker pid=82924)[0m   File "/global/common/software/nersc/pm-2022q4/sw/tensorflow/2.9.0/lib/python3.9/multiprocessing/pool.py", line 268, in __del__
[2m[36m(RayTrainWorker pid=82924)[0m     self._change_notifier.put(None)
[2m[36m(RayTrainWorker pid=82924)[0m   File "/global/common/software/nersc/pm-2022q4/sw/tensorflow/2.9.0/lib/python3.9/multiprocessing/queues.py", line 377, in put
[2m[36m(RayTrainWorker pid=82924)[0m     self._writer.send_bytes(obj)
[2m[36m(RayTrainWorker pid=82924)[0m   File "/global/common/software/nersc/pm-2022q4/sw/tensorflow/2.9.0/lib/python3.9/multiprocessing/connection.py", line 205, in send_bytes
[2m[36m(RayTrainWorker pid=82924)[0m     self._send_bytes(m[offset:offset + size])
[2m[36m(RayTrainWorker pid=82924)[0m   File "/global/common/software/

Best accuracy config: 0.6419363617897034


[2m[36m(TunerInternal pid=77000)[0m 2023-03-03 19:36:04,602	INFO tune.py:798 -- Total run time: 56.46 seconds (56.40 seconds for the tuning loop).


In [18]:
log_dir = str(results.get_best_result().log_dir)

## Close cluster conection and stop job

In [19]:
ray.shutdown()

In [20]:
sfp_api.delete_job(site, job['jobid'])

{'task_id': '0', 'status': 'OK', 'error': None}

## Explore Training in Tensorboard

In [26]:
import nersc_tensorboard_helper
%load_ext tensorboard

In [27]:
%tensorboard --logdir $log_dir --port 0

In [28]:
nersc_tensorboard_helper.tb_address()