# NERSC Cluster Deploy Tutorial: Tuning Hyperparameters of a Distributed TensorFlow Model using Ray Train & Tune

📖 [Back to Table of Contents](../README.md)<br>
⬅ [Previous notebook](./ex_01_pytorch_ray_train_tune.ipynb) <br>
<!-- ➡ [Next notebook](./ex_03_pytorch_ray_hvd.ipynb) <br> -->

----


## Introduction

We are going to run an example Ray Train & Tune code. This example looks at tunning hyperparameters of a distrbuted TensorFlow Model. This tutorial is following the code in this example: 
https://docs.ray.io/en/latest/train/examples/tf/tune_tensorflow_mnist_example.html

> **Note**:
> To setup the environment for the notebook, execute on command line: `./setup.sh 2` then select the kernel `tensorflow-2.9.0` in the notebook

This Ray cluster will be setup using the NERSC TensorFlow module and deployed on Perlmutter.



In [2]:
import sys

In [3]:
# !{sys.executable} -m pip install --upgrade git+https://github.com/asnaylor/nersc_cluster_deploy.git

## Creating Ray Cluster via Jupyter Config Job

To create a ray cluster on NERSC compute nodes, execute the `deploy_ray_cluster` function with your desired slurm sbatch options.

In [5]:
from nersc_cluster_deploy import deploy_ray_cluster

module_load = 'tensorflow/2.9.0'

rayCluster = deploy_ray_cluster(
    job_setup = f'module load {module_load}'
)

2023-05-04 13:20:22,609 INFO <Service-Manager> Starting up cluster
2023-05-04 13:20:22,648 INFO <RayHead> Starting service
2023-05-04 13:20:22,648 DEBUG <RayHead> Running cmd: /bin/bash -c "export RAY_GRAFANA_IFRAME_HOST=https://jupyter.nersc.gov/user/asnaylor/muller-configurable-gpu/proxy/3000; module load tensorflow/2.9.0;srun --nodes=1 --ntasks=1 --cpus-per-task=124 --gpus-per-task=4 -w nid001069  ray start --head --block --port 6379 --num-cpus=124 --num-gpus=4"
In case of issues, please refer to our known issues: https://docs.nersc.gov/current/
and open a help ticket if your issue is not listed: https://help.nersc.gov/


2023-05-04 13:20:25,039	INFO usage_lib.py:435 -- Usage stats collection is disabled.
2023-05-04 13:20:25,039	INFO scripts.py:710 -- [37mLocal node IP[39m: [1m128.55.173.50[22m
2023-05-04 13:20:28,370	SUCC scripts.py:747 -- [32m--------------------[39m
2023-05-04 13:20:28,370	SUCC scripts.py:748 -- [32mRay runtime started.[39m
2023-05-04 13:20:28,370	SUCC scripts.py:749 -- [32m--------------------[39m
2023-05-04 13:20:28,370	INFO scripts.py:751 -- [36mNext steps[39m
2023-05-04 13:20:28,370	INFO scripts.py:752 -- To connect to this Ray runtime from another node, run
2023-05-04 13:20:28,370	INFO scripts.py:755 -- [1m  ray start --address='128.55.173.50:6379'[22m
2023-05-04 13:20:28,370	INFO scripts.py:771 -- Alternatively, use the following Python code:
2023-05-04 13:20:28,370	INFO scripts.py:773 -- [35mimport[39m[26m ray
2023-05-04 13:20:28,370	INFO scripts.py:777 -- ray[35m.[39m[26minit(address[35m=[39m[26m[33m'auto'[39m[26m)
2023-05-04 13:20:28,370	INFO script

2023-05-04 13:20:32,688 INFO <Prometheus> Starting service
2023-05-04 13:20:32,688 DEBUG <Prometheus> Running cmd: srun --nodes=1 --ntasks=1 --cpus-per-task=2 --gpus-per-task=0 -w nid001069 shifter --image=prom/prometheus:v2.42.0 --volume=/mscratch/sd/a/asnaylor/ray_cluster/prometheus:/prometheus /bin/prometheus --config.file=/tmp/ray/session_latest/metrics/prometheus/prometheus.yml --storage.tsdb.path=/prometheus
2023-05-04 13:20:32,697 INFO <Grafana> Starting service
2023-05-04 13:20:32,698 DEBUG <Grafana> Running cmd: srun --nodes=1 --ntasks=1 --cpus-per-task=2 --gpus-per-task=0 -w nid001069 shifter --image=grafana/grafana-oss:9.4.3 --volume=/mscratch/sd/a/asnaylor/ray_cluster/grafana:/grafana --env GF_PATHS_DATA=/grafana --env GF_PATHS_PLUGINS=/grafana/plugins --env GF_SERVER_ROOT_URL=https://jupyter.nersc.gov/user/asnaylor/muller-configurable-gpu/proxy/3000/ --env GF_PATHS_CONFIG=/tmp/ray/session_latest/metrics/grafana/grafana.ini --env GF_PATHS_PROVISIONING=/tmp/ray/session_lates

Creating Ray workers via srun
[slurm] starting script...


In case of issues, please refer to our known issues: https://docs.nersc.gov/current/
and open a help ticket if your issue is not listed: https://help.nersc.gov/


[slurm] - Starting 2 Ray worker nodes
    - 0 at nid001072
    - 1 at nid001073


ts=2023-05-04T20:20:33.666Z caller=main.go:512 level=info msg="No time or size retention was set so using the default time retention" duration=15d
ts=2023-05-04T20:20:33.666Z caller=main.go:556 level=info msg="Starting Prometheus Server" mode=server version="(version=2.42.0, branch=HEAD, revision=225c61122d88b01d1f0eaaee0e05b6f3e0567ac0)"
ts=2023-05-04T20:20:33.666Z caller=main.go:561 level=info build_context="(go=go1.19.5, platform=linux/amd64, user=root@c67d48967507, date=20230201-07:53:32)"
ts=2023-05-04T20:20:33.666Z caller=main.go:562 level=info host_details="(Linux 5.14.21-150400.24.46_12.0.71-cray_shasta_c #1 SMP Sun Apr 30 16:36:43 UTC 2023 (92112fd) x86_64 nid001069 )"
ts=2023-05-04T20:20:33.666Z caller=main.go:563 level=info fd_limits="(soft=131072, hard=131072)"
ts=2023-05-04T20:20:33.666Z caller=main.go:564 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2023-05-04T20:20:33.670Z caller=web.go:561 level=info component=web msg="Start listening for connections" addr

logger=settings t=2023-05-04T13:20:34.071154249-07:00 level=info msg="Starting Grafana" version=9.4.3 commit=cf0a135595 branch=HEAD compiled=2023-03-02T12:28:42-08:00
logger=settings t=2023-05-04T13:20:34.07136202-07:00 level=warn msg="\"sentry\" frontend logging provider is deprecated and will be removed in the next major version. Use \"grafana\" provider instead."
logger=settings t=2023-05-04T13:20:34.071373011-07:00 level=info msg="Config loaded from" file=/usr/share/grafana/conf/defaults.ini
logger=settings t=2023-05-04T13:20:34.07137771-07:00 level=info msg="Config loaded from" file=/tmp/ray/session_latest/metrics/grafana/grafana.ini
logger=settings t=2023-05-04T13:20:34.071380876-07:00 level=info msg="Config overridden from command line" arg="default.paths.data=/grafana"
logger=settings t=2023-05-04T13:20:34.071383571-07:00 level=info msg="Config overridden from command line" arg="default.paths.logs=/var/log/grafana"
logger=settings t=2023-05-04T13:20:34.071386156-07:00 level=inf

[2023-05-04 13:20:36,308 W 65434 65434] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-05-04 13:20:36,729 W 13188 13188] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?


2023-05-04 13:20:36,168	INFO scripts.py:866 -- [37mLocal node IP[39m: [1m10.250.1.90[22m
2023-05-04 13:20:37,310	SUCC scripts.py:878 -- [32m--------------------[39m
2023-05-04 13:20:37,310	SUCC scripts.py:879 -- [32mRay runtime started.[39m
2023-05-04 13:20:37,310	SUCC scripts.py:880 -- [32m--------------------[39m
2023-05-04 13:20:37,310	INFO scripts.py:882 -- To terminate the Ray runtime, run
2023-05-04 13:20:37,310	INFO scripts.py:883 -- [1m  ray stop[22m
2023-05-04 13:20:37,310	INFO scripts.py:888 -- [36m[1m--block[22m[39m
2023-05-04 13:20:37,310	INFO scripts.py:889 -- This command will now block forever until terminated by a signal.
2023-05-04 13:20:37,310	INFO scripts.py:892 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
2023-05-04 13:20:36,600	INFO scripts.py:866 -- [37mLocal node IP[39m: [1m10.250.1.57[22m
2023-05-04 13:2

## Connect to Ray + Grafana dashboards

Go to the website:

In [6]:
rayCluster.ray_dashboard_url

'https://jupyter.nersc.gov/user/asnaylor/muller-configurable-gpu/proxy/localhost:8265/'

logger=context userId=0 orgId=1 uname= t=2023-05-04T13:21:15.580095717-07:00 level=info msg="Request Completed" method=GET path=/api/live/ws status=-1 remote_addr=192.184.142.228 time_ms=22 duration=22.123443ms size=0 referer= handler=/api/live/ws
logger=context userId=0 orgId=1 uname= t=2023-05-04T13:21:15.614224362-07:00 level=info msg="Request Completed" method=GET path=/api/live/ws status=-1 remote_addr=192.184.142.228 time_ms=22 duration=22.052646ms size=0 referer= handler=/api/live/ws
logger=live t=2023-05-04T13:21:15.831438451-07:00 level=info msg="Initialized channel handler" channel=grafana/dashboard/uid/rayDefaultDashboard address=grafana/dashboard/uid/rayDefaultDashboard
logger=context userId=0 orgId=1 uname= t=2023-05-04T13:21:31.481380674-07:00 level=info msg="Request Completed" method=GET path=/api/live/ws status=-1 remote_addr=192.184.142.228 time_ms=22 duration=22.092083ms size=0 referer= handler=/api/live/ws
logger=context userId=0 orgId=1 uname= t=2023-05-04T13:21:31.

## Connect to Ray Cluster

In [7]:
import ray

if ray.is_initialized:
    ray.shutdown()

ray.init(address='auto')

2023-05-04 13:22:08,961	INFO worker.py:1364 -- Connecting to existing Ray cluster at address: 128.55.173.50:6379...
2023-05-04 13:22:08,967	INFO worker.py:1544 -- Connected to Ray cluster. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m


0,1
Python version:,3.9.15
Ray version:,2.3.0
Dashboard:,http://127.0.0.1:8265


Check all nodes connected to cluster

In [8]:
from nersc_cluster_deploy.ray import cluster_summary

cluster_summary()

Nodes: 3
CPU:   380
GPU:   12
RAM:   479.44 GB


## Setup Tensorflow Model

In [9]:
from ray import tune
from ray.train.tensorflow import TensorflowTrainer
from ray.air.config import ScalingConfig

from ray.train.examples.tf.tensorflow_mnist_example import train_func
from ray.tune.tune_config import TuneConfig
from ray.tune.tuner import Tuner

In [10]:
def tune_tensorflow_mnist(
    num_workers: int = 2, num_samples: int = 2, use_gpu: bool = False
):
    trainer = TensorflowTrainer(
        train_loop_per_worker=train_func,
        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
    )
    tuner = Tuner(
        trainer,
        tune_config=TuneConfig(num_samples=num_samples, metric="accuracy", mode="max"),
        param_space={
            "train_loop_config": {
                "lr": tune.loguniform(1e-4, 1e-1),
                "batch_size": tune.choice([32, 64, 128]),
                "epochs": 3,
            }
        },
    )
    results = tuner.fit()
    best_accuracy = results.get_best_result().metrics["accuracy"]
    print(f"Best accuracy config: {best_accuracy}")
    return results

## Train Model

In [11]:
node_resources = ray.cluster_resources()
num_workers = int(node_resources['GPU'])
use_gpu = True

num_samples = 2
smoke_test = False

In [12]:
results = tune_tensorflow_mnist(
            num_workers=num_workers,
            num_samples=num_samples,
            use_gpu=use_gpu,
)

0,1
Current time:,2023-05-04 13:23:49
Running for:,00:01:20.24
Memory:,49.3/251.3 GiB

Trial name,status,loc,train_loop_config/ba tch_size,train_loop_config/lr,iter,total time (s),loss,accuracy,_timestamp
TensorflowTrainer_64592_00000,TERMINATED,10.250.1.90:67193,128,0.0237213,3,34.3788,0.337646,0.904846,1683231790
TensorflowTrainer_64592_00001,TERMINATED,128.55.173.50:71652,128,0.0987073,3,27.3244,0.214705,0.937519,1683231826


[2m[36m(RayTrainWorker pid=67426, ip=128.55.173.52)[0m 2023-05-04 13:22:46.774803: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
[2m[36m(RayTrainWorker pid=67426, ip=128.55.173.52)[0m To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2m[36m(RayTrainWorker pid=65112)[0m 2023-05-04 13:22:46.757586: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
[2m[36m(RayTrainWorker pid=65112)[0m To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2m[36m(RayTrainWorker pid=65111)[0m 2023-05-04 13:22:46.724651: I tensorflow/core/platform/cpu_feature_guard.cc:193]

[2m[36m(RayTrainWorker pid=65111)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=65112)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=67428, ip=128.55.173.52)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=67425, ip=128.55.173.52)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=67427, ip=128.55.173.52)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=67426, ip=128.55.173.52)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=65114)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=65113)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=14961, ip=128.55.173.51)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=14960, ip=128.55.173.51)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=14959, ip=128.55.173.51)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=14958, ip=128.55.173.51)[0m Epoch 1/3


[2m[36m(RayTrainWorker pid=67428, ip=128.55.173.52)[0m 2023-05-04 13:22:54.786681: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=67425, ip=128.55.173.52)[0m 2023-05-04 13:22:54.810038: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=67427, ip=128.55.173.52)[0m 2023-05-04 13:22:54.800374: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=67426, ip=128.55.173.52)[0m 2023-05-04 13:22:54.769584: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=65114)[0m 2023-05-04 13:22:54.814059: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=65113)[0m 2023-05-04 13:22:54.814243: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=65112)[0m 2023-05-04 13:22:5

 1/70 [..............................] - ETA: 5:53 - loss: 2.3059 - accuracy: 0.1335
 1/70 [..............................] - ETA: 5:54 - loss: 2.3059 - accuracy: 0.1335
 1/70 [..............................] - ETA: 5:55 - loss: 2.3059 - accuracy: 0.1335
 1/70 [..............................] - ETA: 5:55 - loss: 2.3059 - accuracy: 0.1335
 2/70 [..............................] - ETA: 4s - loss: 2.2952 - accuracy: 0.1670  
 2/70 [..............................] - ETA: 4s - loss: 2.2952 - accuracy: 0.1670  
 2/70 [..............................] - ETA: 4s - loss: 2.2952 - accuracy: 0.1670  
 2/70 [..............................] - ETA: 4s - loss: 2.2952 - accuracy: 0.1670  
 2/70 [..............................] - ETA: 4s - loss: 2.2952 - accuracy: 0.1670  
 2/70 [..............................] - ETA: 4s - loss: 2.2952 - accuracy: 0.1670  
 2/70 [..............................] - ETA: 4s - loss: 2.2952 - accuracy: 0.1670  
 2/70 [..............................] - ETA: 4s - loss: 2.2952 -

Trial name,_time_this_iter_s,_timestamp,_training_iteration,accuracy,date,done,episodes_total,experiment_id,experiment_tag,hostname,iterations_since_restore,loss,node_ip,pid,should_checkpoint,time_since_restore,time_this_iter_s,time_total_s,timestamp,timesteps_since_restore,timesteps_total,training_iteration,trial_id,warmup_time
TensorflowTrainer_64592_00000,5.03086,1683231790,3,0.904846,2023-05-04_13-23-10,True,,ee96ebb085364252bb1a9f284b8f65d9,"0_batch_size=128,lr=0.0237",nid001073,3,0.337646,10.250.1.90,67193,True,34.3788,5.03014,34.3788,1683231790,0,,3,64592_00000,0.0223877
TensorflowTrainer_64592_00001,4.89204,1683231826,3,0.937519,2023-05-04_13-23-46,True,,0d6b5dce0aae48d4b06a6ecdb90c1eeb,"1_batch_size=128,lr=0.0987",nid001069,3,0.214705,128.55.173.50,71652,True,27.3244,4.88759,27.3244,1683231826,0,,3,64592_00001,0.00834465


[2m[36m(RayTrainWorker pid=67428, ip=128.55.173.52)[0m Epoch 2/3
[2m[36m(RayTrainWorker pid=67425, ip=128.55.173.52)[0m Epoch 2/3
[2m[36m(RayTrainWorker pid=67427, ip=128.55.173.52)[0m Epoch 2/3
[2m[36m(RayTrainWorker pid=67426, ip=128.55.173.52)[0m Epoch 2/3
[2m[36m(RayTrainWorker pid=65114)[0m Epoch 2/3
[2m[36m(RayTrainWorker pid=65113)[0m Epoch 2/3
[2m[36m(RayTrainWorker pid=65112)[0m Epoch 2/3
[2m[36m(RayTrainWorker pid=65111)[0m Epoch 2/3
[2m[36m(RayTrainWorker pid=14958, ip=128.55.173.51)[0m Epoch 2/3
[2m[36m(RayTrainWorker pid=14961, ip=128.55.173.51)[0m Epoch 2/3
[2m[36m(RayTrainWorker pid=14960, ip=128.55.173.51)[0m Epoch 2/3
[2m[36m(RayTrainWorker pid=14959, ip=128.55.173.51)[0m Epoch 2/3
 1/70 [..............................] - ETA: 4s - loss: 0.4688 - accuracy: 0.8796
 1/70 [..............................] - ETA: 4s - loss: 0.4688 - accuracy: 0.8796
 1/70 [..............................] - ETA: 4s - loss: 0.4688 - accuracy: 0.8796
 1/70 [

[2m[36m(RayTrainWorker pid=67425, ip=128.55.173.52)[0m Exception ignored in: <function Pool.__del__ at 0x7f4744baad30>
[2m[36m(RayTrainWorker pid=67425, ip=128.55.173.52)[0m Traceback (most recent call last):
[2m[36m(RayTrainWorker pid=67425, ip=128.55.173.52)[0m   File "/global/common/software/nersc/pm-2022q4/sw/tensorflow/2.9.0/lib/python3.9/multiprocessing/pool.py", line 268, in __del__
[2m[36m(RayTrainWorker pid=67425, ip=128.55.173.52)[0m     self._change_notifier.put(None)
[2m[36m(RayTrainWorker pid=67425, ip=128.55.173.52)[0m   File "/global/common/software/nersc/pm-2022q4/sw/tensorflow/2.9.0/lib/python3.9/multiprocessing/queues.py", line 377, in put
[2m[36m(RayTrainWorker pid=67425, ip=128.55.173.52)[0m     self._writer.send_bytes(obj)
[2m[36m(RayTrainWorker pid=67425, ip=128.55.173.52)[0m   File "/global/common/software/nersc/pm-2022q4/sw/tensorflow/2.9.0/lib/python3.9/multiprocessing/connection.py", line 205, in send_bytes
[2m[36m(RayTrainWorker pid=674

logger=context userId=0 orgId=1 uname= t=2023-05-04T13:23:15.145330931-07:00 level=info msg="Request Completed" method=GET path=/api/live/ws status=-1 remote_addr=192.184.142.228 time_ms=22 duration=22.089709ms size=0 referer= handler=/api/live/ws
logger=context userId=0 orgId=1 uname= t=2023-05-04T13:23:15.180684601-07:00 level=info msg="Request Completed" method=GET path=/api/live/ws status=-1 remote_addr=192.184.142.228 time_ms=21 duration=21.999163ms size=0 referer= handler=/api/live/ws


[2m[36m(RayTrainWorker pid=19201, ip=128.55.173.51)[0m 2023-05-04 13:23:23.841159: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
[2m[36m(RayTrainWorker pid=19201, ip=128.55.173.51)[0m To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2m[36m(RayTrainWorker pid=73366, ip=128.55.173.52)[0m 2023-05-04 13:23:23.843376: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
[2m[36m(RayTrainWorker pid=73366, ip=128.55.173.52)[0m To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2m[36m(RayTrainWorker pid=19196, ip=128.55.173.51)[0m 2023-05-04 13:23:23.988249

[2m[36m(RayTrainWorker pid=71882)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=71883)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=71885)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=73367, ip=128.55.173.52)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=73366, ip=128.55.173.52)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=73365, ip=128.55.173.52)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=73368, ip=128.55.173.52)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=19199, ip=128.55.173.51)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=19201, ip=128.55.173.51)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=19195, ip=128.55.173.51)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=19196, ip=128.55.173.51)[0m Epoch 1/3
[2m[36m(RayTrainWorker pid=71884)[0m Epoch 1/3


[2m[36m(RayTrainWorker pid=19199, ip=128.55.173.51)[0m 2023-05-04 13:23:30.889145: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=19201, ip=128.55.173.51)[0m 2023-05-04 13:23:30.887971: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=19195, ip=128.55.173.51)[0m 2023-05-04 13:23:30.886963: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=19196, ip=128.55.173.51)[0m 2023-05-04 13:23:30.874601: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=73367, ip=128.55.173.52)[0m 2023-05-04 13:23:30.892859: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWorker pid=73366, ip=128.55.173.52)[0m 2023-05-04 13:23:30.947236: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8302
[2m[36m(RayTrainWork

 1/70 [..............................] - ETA: 5:19 - loss: 2.3187 - accuracy: 0.0820
 1/70 [..............................] - ETA: 5:18 - loss: 2.3187 - accuracy: 0.0820
 1/70 [..............................] - ETA: 5:18 - loss: 2.3187 - accuracy: 0.0820
 1/70 [..............................] - ETA: 5:19 - loss: 2.3187 - accuracy: 0.0820
 1/70 [..............................] - ETA: 5:19 - loss: 2.3187 - accuracy: 0.0820
 1/70 [..............................] - ETA: 5:19 - loss: 2.3187 - accuracy: 0.0820
 1/70 [..............................] - ETA: 5:20 - loss: 2.3187 - accuracy: 0.0820
 1/70 [..............................] - ETA: 5:20 - loss: 2.3187 - accuracy: 0.0820
 2/70 [..............................] - ETA: 4s - loss: 2.2817 - accuracy: 0.2051  
 2/70 [..............................] - ETA: 4s - loss: 2.2817 - accuracy: 0.2051  
 2/70 [..............................] - ETA: 4s - loss: 2.2817 - accuracy: 0.2051  
 2/70 [..............................] - ETA: 4s - loss: 2.2817 -

2023-05-04 13:23:49,481	INFO tune.py:798 -- Total run time: 80.31 seconds (80.24 seconds for the tuning loop).


Best accuracy config: 0.9375185966491699


[2m[36m(RayTrainWorker pid=73368, ip=128.55.173.52)[0m [2023-05-04 13:23:49,764 E 73368 76760] logging.cc:104: Stack trace: 
[2m[36m(RayTrainWorker pid=73368, ip=128.55.173.52)[0m  /global/homes/a/asnaylor/.local/muller/tensorflow2.9.0/lib/python3.9/site-packages/ray/_raylet.so(+0xd5621a) [0x7fb30654321a] ray::operator<<()
[2m[36m(RayTrainWorker pid=73368, ip=128.55.173.52)[0m /global/homes/a/asnaylor/.local/muller/tensorflow2.9.0/lib/python3.9/site-packages/ray/_raylet.so(+0xd589d8) [0x7fb3065459d8] ray::TerminateHandler()
[2m[36m(RayTrainWorker pid=73368, ip=128.55.173.52)[0m /global/common/software/nersc/pm-2022q4/sw/tensorflow/2.9.0/bin/../lib/libstdc++.so.6(+0xb135a) [0x7fb30545f35a] __cxxabiv1::__terminate()
[2m[36m(RayTrainWorker pid=73368, ip=128.55.173.52)[0m /global/common/software/nersc/pm-2022q4/sw/tensorflow/2.9.0/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7fb30545f3c5]
[2m[36m(RayTrainWorker pid=73368, ip=128.55.173.52)[0m /global/common/software/nersc/pm-20

logger=context userId=0 orgId=1 uname= t=2023-05-04T13:24:06.479970034-07:00 level=info msg="Request Completed" method=GET path=/api/live/ws status=-1 remote_addr=192.184.142.228 time_ms=22 duration=22.344382ms size=0 referer= handler=/api/live/ws
logger=context userId=0 orgId=1 uname= t=2023-05-04T13:24:08.046006495-07:00 level=info msg="Request Completed" method=GET path=/api/live/ws status=-1 remote_addr=192.184.142.228 time_ms=21 duration=21.994005ms size=0 referer= handler=/api/live/ws
logger=context userId=0 orgId=1 uname= t=2023-05-04T13:24:09.500446134-07:00 level=info msg="Request Completed" method=GET path=/api/live/ws status=-1 remote_addr=192.184.142.228 time_ms=21 duration=21.399928ms size=0 referer= handler=/api/live/ws
logger=context userId=0 orgId=1 uname= t=2023-05-04T13:24:09.719966573-07:00 level=info msg="Request Completed" method=GET path=/api/live/ws status=-1 remote_addr=192.184.142.228 time_ms=22 duration=22.295206ms size=0 referer= handler=/api/live/ws
logger=c

In [13]:
log_dir = str(results.get_best_result().log_dir)

In [14]:
log_dir

'/global/homes/a/asnaylor/ray_results/TensorflowTrainer_2023-05-04_13-22-29/TensorflowTrainer_64592_00001_1_batch_size=128,lr=0.0987_2023-05-04_13-23-16'

## Close cluster conection and stop job

In [19]:
ray.shutdown()

In [20]:
rayCluster.shutdown()

{'task_id': '0', 'status': 'OK', 'error': None}

In [None]:
!scancel -u "$USER"

## Explore Training in Tensorboard

In [15]:
import nersc_tensorboard_helper
%load_ext tensorboard

In [16]:
%tensorboard --logdir $log_dir --port 0

In [17]:
nersc_tensorboard_helper.tb_address()

logger=cleanup t=2023-05-04T13:30:34.993048016-07:00 level=info msg="Completed cleanup jobs" duration=21.267014ms
logger=cleanup t=2023-05-04T13:40:34.989994269-07:00 level=info msg="Completed cleanup jobs" duration=18.38086ms
