# NERSC Cluster Deploy Tutorial: Tuning Hyperparameters of a Distributed PyTorch Model with PBT using Ray Train & Tune

📖 [Back to Table of Contents](../README.md)<br>
<!-- ⬅ [Previous notebook](./ex_01_pytorch_ray_hvd.ipynb) <br> -->
➡ [Next notebook](./ex_02_tensorflow_ray_train_tune.ipynb) <br>

----


## Introduction

We are going to run an example Ray Train & Tune code. This example looks at tunning hyperparameters of a distrbuted PyTorch Model with PBT. This tutorial is following the code in this example: https://docs.ray.io/en/latest/train/examples/pytorch/tune_cifar_torch_pbt_example.html

> **Note**:
> To setup the environment for the notebook, execute on command line: `./setup.sh 1` then select the kernel `pytorch-1.13.1` in the notebook

This Ray cluster will be setup using the NERSC PyTorch module and deployed on Perlmutter.



# Starting Ray Cluster

## Superfacility API

To deploy the Ray cluster via the NERSC Superfacility API you require a valid API client. 

To create a valid client visit your profile page in [Iris](https://iris.nersc.gov/):

<img src="img/iris_profile_header.png" width="800" />

Then scroll down to the **Superfacility API Clients** section and click the "+ New Client" button which will produce this window:

<img src="img/new_sf_api_client.png" width="400" />

To submit and deploy a Ray cluster we require the highest security level (<span style="color:red">RED</span>). **[This client id is valid for 2 days]**

Once created then saved the `client_id` string and `private_key` dictionary (you can also save the private key in PEM format) ready for use with the `SuperfacilityAPI` library.

> **Note**:
> This step should only be repeated if your client has expired


For more information about the NERSC Superfacility API visit the [documenation](https://docs.nersc.gov/services/sfapi/).

In [4]:
from SuperfacilityAPI import SuperfacilityAPI, SuperfacilityAccessToken
from utility import load_secrets

# Replace with your client id string and private key dictionary
client_id, private_key = load_secrets()
# client_id = "<your client id string>"
# private_key = "<your private key dict>"

api_key = SuperfacilityAccessToken(
    client_id = client_id,
    private_key = private_key
)
sfp_api = SuperfacilityAPI(api_key)

## Creating Ray Cluster

To create a ray cluster on NERSC compute nodes, execute the `deploy_ray_cluster` function with your desired slurm sbatch options.

In [5]:
from nersc_cluster_deploy import deploy_ray_cluster
from utility import user_account

slurm_options = {
    'qos': 'debug',
    'account': user_account(),
    'nodes': '2',
    't': '00:30:00'
}
site = 'perlmutter'
module_load = 'pytorch/1.13.1'

job = deploy_ray_cluster(
    sfp_api,
    slurm_options,
    site,
    job_setup = [f'module load {module_load}']
)

In [6]:
job

{'error': None, 'jobid': '5906466', 'task_id': '11931'}

Now the job has been submitted, check on the job status

In [28]:
import os
import pandas as pd
sqs_table = sfp_api.get_jobs(site=site, user=os.getlogin(), sacct=False)
sqs_df = pd.DataFrame(sqs_table['output'])
sqs_df

Unnamed: 0,account,tres_per_node,min_cpus,min_tmp_disk,end_time,features,group,over_subscribe,jobid,name,...,partition,nodelist(reason),start_time,state,uid,submit_time,licenses,core_spec,schednodes,work_dir
0,dasrepo_g,,128,0,2023-03-03T19:45:32,gpu&a100&hbm40g,75235,NO,5906466,sbatch,...,gpu_ss11,nid[003044-003045],2023-03-03T19:15:32,RUNNING,75235,2023-03-03T19:15:32,u2:1,,(null),/global/u2/a/asnaylor


Check job log

In [27]:
!cat ~/slurm-{job['jobid']}.out

In case of issues, please refer to our known issues: https://docs.nersc.gov/current/
and open a help ticket if your issue is not listed: https://help.nersc.gov/
[slurm] - Starting ray HEAD
2023-03-03 19:15:38,460	INFO usage_lib.py:435 -- Usage stats collection is disabled.
2023-03-03 19:15:38,460	INFO scripts.py:710 -- [37mLocal node IP[39m: [1mnid003044[22m
2023-03-03 19:15:40,974	SUCC scripts.py:747 -- [32m--------------------[39m
2023-03-03 19:15:40,974	SUCC scripts.py:748 -- [32mRay runtime started.[39m
2023-03-03 19:15:40,974	SUCC scripts.py:749 -- [32m--------------------[39m
2023-03-03 19:15:40,974	INFO scripts.py:751 -- [36mNext steps[39m
2023-03-03 19:15:40,974	INFO scripts.py:752 -- To connect to this Ray runtime from another node, run
2023-03-03 19:15:40,974	INFO scripts.py:755 -- [1m  ray start --address='nid003044:6379'[22m
2023-03-03 19:15:40,974	INFO scripts.py:771 -- Alternatively, use the following Python code:
2023-03-03 19:15:40,974	INFO scripts.py:773 

## Connect to Ray Cluster

Get the Ray cluster head node ip address to connect to the cluster

In [29]:
from nersc_cluster_deploy import get_ray_cluster_address
import ray

cluster_address = get_ray_cluster_address(
    sfp_api,
    job['jobid'],
    site
)
ray.init(cluster_address)

0,1
Python version:,3.9.15
Ray version:,2.3.0
Dashboard:,http://127.0.0.1:8265


Check all nodes connected to cluster

In [30]:
from nersc_cluster_deploy import ray_cluster_summary

ray_cluster_summary()

Cluster Summary
---------------
Nodes: 2
CPU:   256
GPU:   8
RAM:   309.42 GB


## Setup PyTorch Model

In [31]:
import argparse
import os

import torch
import torch.nn as nn
import torchvision.transforms as transforms
from filelock import FileLock
from torch.utils.data import DataLoader, Subset
from torchvision.datasets import CIFAR10
from torchvision.models import resnet18

import ray
import ray.train as train
from ray import tune
from ray.air import session
from ray.air.checkpoint import Checkpoint
from ray.air.config import FailureConfig, RunConfig, ScalingConfig
from ray.train.torch import TorchTrainer
from ray.tune.schedulers import PopulationBasedTraining
from ray.tune.tune_config import TuneConfig
from ray.tune.tuner import Tuner

In [32]:
def train_epoch(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset) // session.get_world_size()
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")


def validate_epoch(dataloader, model, loss_fn):
    size = len(dataloader.dataset) // session.get_world_size()
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(
        f"Test Error: \n "
        f"Accuracy: {(100 * correct):>0.1f}%, "
        f"Avg loss: {test_loss:>8f} \n"
    )
    return {"loss": test_loss}


def update_optimizer_config(optimizer, config):
    for param_group in optimizer.param_groups:
        for param, val in config.items():
            param_group[param] = val


def train_func(config):
    epochs = config.get("epochs", 3)

    model = resnet18()

    # Note that `prepare_model` needs to be called before setting optimizer.
    if not session.get_checkpoint():  # fresh start
        model = train.torch.prepare_model(model)

    # Create optimizer.
    optimizer_config = {
        "lr": config.get("lr"),
        "momentum": config.get("momentum"),
    }
    optimizer = torch.optim.SGD(model.parameters(), **optimizer_config)

    starting_epoch = 0
    if session.get_checkpoint():
        checkpoint_dict = session.get_checkpoint().to_dict()

        # Load in model
        model_state = checkpoint_dict["model"]
        model.load_state_dict(model_state)
        model = train.torch.prepare_model(model)

        # Load in optimizer
        optimizer_state = checkpoint_dict["optimizer_state_dict"]
        optimizer.load_state_dict(optimizer_state)

        # Optimizer configs (`lr`, `momentum`) are being mutated by PBT and passed in
        # through config, so we need to update the optimizer loaded from the checkpoint
        update_optimizer_config(optimizer, optimizer_config)

        # The current epoch increments the loaded epoch by 1
        checkpoint_epoch = checkpoint_dict["epoch"]
        starting_epoch = checkpoint_epoch + 1

    # Load in training and validation data.
    transform_train = transforms.Compose(
        [
            transforms.RandomCrop(32, padding=4),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
        ]
    )  # meanstd transformation

    transform_test = transforms.Compose(
        [
            transforms.ToTensor(),
            transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
        ]
    )

    data_dir = config.get("data_dir", os.path.expanduser("~/data"))
    os.makedirs(data_dir, exist_ok=True)
    with FileLock(os.path.join(data_dir, ".ray.lock")):
        train_dataset = CIFAR10(
            root=data_dir, train=True, download=True, transform=transform_train
        )
        validation_dataset = CIFAR10(
            root=data_dir, train=False, download=False, transform=transform_test
        )

    if config.get("test_mode"):
        train_dataset = Subset(train_dataset, list(range(64)))
        validation_dataset = Subset(validation_dataset, list(range(64)))

    worker_batch_size = config["batch_size"] // session.get_world_size()

    train_loader = DataLoader(train_dataset, batch_size=worker_batch_size)
    validation_loader = DataLoader(validation_dataset, batch_size=worker_batch_size)

    train_loader = train.torch.prepare_data_loader(train_loader)
    validation_loader = train.torch.prepare_data_loader(validation_loader)

    # Create loss.
    criterion = nn.CrossEntropyLoss()

    for epoch in range(starting_epoch, epochs):
        train_epoch(train_loader, model, criterion, optimizer)
        result = validate_epoch(validation_loader, model, criterion)
        checkpoint = Checkpoint.from_dict(
            {
                "epoch": epoch,
                "model": model.state_dict(),
                "optimizer_state_dict": optimizer.state_dict(),
            }
        )

        session.report(result, checkpoint=checkpoint)


## Train Model

In [33]:
SCRATCH = os.getenv('SCRATCH')

node_resources = ray.cluster_resources()
num_workers = int(node_resources['GPU'])
use_gpu = True

data_dir = os.path.join(SCRATCH, 'CIFAR10')
num_epochs = 5
smoke_test = False
synch = False

In [34]:
trainer = TorchTrainer(
        train_func,
        scaling_config=ScalingConfig(
            num_workers=num_workers, use_gpu=use_gpu
        ),
    )

In [35]:
pbt_scheduler = PopulationBasedTraining(
        time_attr="training_iteration",
        perturbation_interval=1,
        hyperparam_mutations={
            "train_loop_config": {
                # distribution for resampling
                "lr": tune.loguniform(0.001, 0.1),
                # allow perturbations within this set of categorical values
                "momentum": [0.8, 0.9, 0.99],
            }
        },
        synch=synch,
    )

In [36]:
tuner = Tuner(
        trainer,
        param_space={
            "train_loop_config": {
                "lr": tune.grid_search([0.001, 0.01, 0.05, 0.1]),
                "momentum": 0.8,
                "batch_size": 128 * num_workers,
                "test_mode": smoke_test,  # whether to to subset the data
                "data_dir": data_dir,
                "epochs": num_epochs,
            }
        },
        tune_config=TuneConfig(
            num_samples=1, metric="loss", mode="min", scheduler=pbt_scheduler
        ),
        run_config=RunConfig(
            stop={"training_iteration": 3 if smoke_test else num_epochs},
            failure_config=FailureConfig(max_failures=3),  # used for fault tolerance
        ),
    )


In [37]:
results = tuner.fit()

0,1
Current time:,2023-03-03 19:29:11
Running for:,00:08:22.13
Memory:,63.6/251.3 GiB

Trial name,status,loc,train_loop_config/lr,iter,total time (s),loss,_timestamp,_time_this_iter_s
TorchTrainer_8f5ae_00000,TERMINATED,pid=78729,0.06,5,99.1326,0.985433,1677900479,15.4285
TorchTrainer_8f5ae_00001,TERMINATED,10.249.19.154:108482,0.072,5,99.1398,1.01094,1677900502,15.3436
TorchTrainer_8f5ae_00002,TERMINATED,pid=80619,0.05,5,99.0159,1.00415,1677900526,15.4466
TorchTrainer_8f5ae_00003,TERMINATED,pid=81595,0.04,5,99.4392,1.04827,1677900550,15.3058


[2m[36m(RayTrainWorker pid=61784)[0m 2023-03-03 19:20:58,111	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=61784)[0m 2023-03-03 19:21:00,799	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=98660, ip=128.55.69.178)[0m 2023-03-03 19:21:00,855	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=61784)[0m 2023-03-03 19:21:02,923	INFO train_loop_utils.py:315 -- Wrapping provided model in DistributedDataParallel.
[2m[36m(RayTrainWorker pid=98660, ip=128.55.69.178)[0m 2023-03-03 19:21:02,934	INFO train_loop_utils.py:315 -- Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=61784)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=61787)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=61786)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=61785)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=98657, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=98658, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=98660, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=98659, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=98659, ip=128.55.69.178)[0m loss: 7.233888  [    0/ 6250]
[2m[36m(RayTrainWorker pid=98658, ip=128.55.69.178)[0m loss: 7.088008  [    0/ 6250]
[2m[36m(RayTrainWorker pid=98657, ip=128.55.69.178)[0m loss: 7.143627  [    0/ 6250]
[2m[36m(RayTrainWorker pid=98660, ip=1

[2m[36m(RayTrainWorker pid=62849)[0m 2023-03-03 19:21:23,577	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=62849)[0m 2023-03-03 19:21:25,488	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=99128, ip=128.55.69.178)[0m 2023-03-03 19:21:25,529	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=99128, ip=128.55.69.178)[0m 2023-03-03 19:21:27,023	INFO train_loop_utils.py:315 -- Wrapping provided model in DistributedDataParallel.
[2m[36m(RayTrainWorker pid=62849)[0m 2023-03-03 19:21:27,085	INFO train_loop_utils.py:315 -- Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=62850)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=62849)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=62852)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=99131, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=99130, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=99129, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=99128, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=62851)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=62852)[0m loss: 7.217724  [    0/ 6250]
[2m[36m(RayTrainWorker pid=62851)[0m loss: 7.198362  [    0/ 6250]
[2m[36m(RayTrainWorker pid=62850)[0m loss: 7.171299  [    0/ 6250]
[2m[36m(RayTrainWorker pid=62849)[0m loss: 7.220249  [    0/ 6250]
[2m[36m(RayTrainWorker

[2m[36m(RayTrainWorker pid=63837)[0m 2023-03-03 19:21:45,564	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=63837)[0m 2023-03-03 19:21:47,329	INFO train_loop_utils.py:255 -- Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=99625, ip=128.55.69.178)[0m 2023-03-03 19:21:47,395	INFO train_loop_utils.py:255 -- Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=99625, ip=128.55.69.178)[0m 2023-03-03 19:21:48,908	INFO train_loop_utils.py:315 -- Wrapping provided model in DistributedDataParallel.
[2m[36m(RayTrainWorker pid=63837)[0m 2023-03-03 19:21:49,001	INFO train_loop_utils.py:315 -- Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=99628, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=99626, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=63839)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=63838)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=63837)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=63840)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=99625, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=99627, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=99627, ip=128.55.69.178)[0m loss: 7.026409  [    0/ 6250]
[2m[36m(RayTrainWorker pid=99628, ip=128.55.69.178)[0m loss: 7.051173  [    0/ 6250]
[2m[36m(RayTrainWorker pid=99626, ip=128.55.69.178)[0m loss: 7.136610  [    0/ 6250]
[2m[36m(RayTrainWorker pid=99625, ip=1

[2m[36m(RayTrainWorker pid=100198, ip=128.55.69.178)[0m 2023-03-03 19:22:09,156	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=64724)[0m 2023-03-03 19:22:11,104	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=100198, ip=128.55.69.178)[0m 2023-03-03 19:22:11,074	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=64724)[0m 2023-03-03 19:22:12,625	INFO train_loop_utils.py:315 -- Wrapping provided model in DistributedDataParallel.
[2m[36m(RayTrainWorker pid=100198, ip=128.55.69.178)[0m 2023-03-03 19:22:12,775	INFO train_loop_utils.py:315 -- Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=100201, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=100198, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=64724)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=100200, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=64725)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=64727)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=64726)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=100199, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=64725)[0m loss: 7.380192  [    0/ 6250]
[2m[36m(RayTrainWorker pid=64724)[0m loss: 7.478827  [    0/ 6250]
[2m[36m(RayTrainWorker pid=64727)[0m loss: 7.547189  [    0/ 6250]
[2m[36m(RayTrainWorker pid=64726)[0m loss: 7.476220  [    0/ 6250]
[2m[36m(RayTrainWo

[2m[36m(TorchTrainer pid=100861, ip=128.55.69.178)[0m 2023-03-03 19:22:31,259	INFO trainable.py:791 -- Restored on 10.249.19.154 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-03_19-20-47/TorchTrainer_8f5ae_00000_0_lr=0.0010_2023-03-03_19-20-49/checkpoint_tmpb1d240
[2m[36m(TorchTrainer pid=100861, ip=128.55.69.178)[0m 2023-03-03 19:22:31,259	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 1, '_timesteps_total': None, '_time_total': 25.8347487449646, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=100958, ip=128.55.69.178)[0m 2023-03-03 19:22:33,803	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=100958, ip=128.55.69.178)[0m 2023-03-03 19:22:37,293	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=65641)[0m 2023-03-03 19:22:37,252	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pi

[2m[36m(RayTrainWorker pid=65641)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=65646)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=65644)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=100958, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=100959, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=100960, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=100961, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=65642)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=100960, ip=128.55.69.178)[0m loss: 2.387163  [    0/ 6250]
[2m[36m(RayTrainWorker pid=100961, ip=128.55.69.178)[0m loss: 2.425263  [    0/ 6250]
[2m[36m(RayTrainWorker pid=100959, ip=128.55.69.178)[0m loss: 2.359572  [    0/ 6250]
[2m[36m(RayTrainWorker pid=1009

[2m[36m(TunerInternal pid=61223)[0m 2023-03-03 19:22:51,121	INFO pbt.py:804 -- 
[2m[36m(TunerInternal pid=61223)[0m 
[2m[36m(TunerInternal pid=61223)[0m [PopulationBasedTraining] [Exploit] Cloning trial 8f5ae_00002 (score = -1.416393) into trial 8f5ae_00000 (score = -1.994845)
[2m[36m(TunerInternal pid=61223)[0m 
[2m[36m(TunerInternal pid=61223)[0m 2023-03-03 19:22:51,122	INFO pbt.py:831 -- 
[2m[36m(TunerInternal pid=61223)[0m 
[2m[36m(TunerInternal pid=61223)[0m [PopulationBasedTraining] [Explore] Perturbed the hyperparameter config of trial8f5ae_00000:
[2m[36m(TunerInternal pid=61223)[0m train_loop_config : 
[2m[36m(TunerInternal pid=61223)[0m     lr : 0.05 --- (* 1.2) --> 0.06
[2m[36m(TunerInternal pid=61223)[0m     momentum : 0.8 --- (resample) --> 0.9
[2m[36m(TunerInternal pid=61223)[0m 


[2m[36m(TunerInternal pid=61223)[0m Result for TorchTrainer_8f5ae_00000:
[2m[36m(TunerInternal pid=61223)[0m   _time_this_iter_s: 15.14406132698059
[2m[36m(TunerInternal pid=61223)[0m   _timestamp: 1677900170
[2m[36m(TunerInternal pid=61223)[0m   _training_iteration: 1
[2m[36m(TunerInternal pid=61223)[0m   date: 2023-03-03_19-22-51
[2m[36m(TunerInternal pid=61223)[0m   done: false
[2m[36m(TunerInternal pid=61223)[0m   experiment_id: 3bc6f6f6e6f14ed7acb0989850ce3a74
[2m[36m(TunerInternal pid=61223)[0m   hostname: nid003045
[2m[36m(TunerInternal pid=61223)[0m   iterations_since_restore: 1
[2m[36m(TunerInternal pid=61223)[0m   loss: 1.9948448538780212
[2m[36m(TunerInternal pid=61223)[0m   node_ip: 10.249.19.154
[2m[36m(TunerInternal pid=61223)[0m   pid: 100861
[2m[36m(TunerInternal pid=61223)[0m   should_checkpoint: true
[2m[36m(TunerInternal pid=61223)[0m   time_since_restore: 19.860088348388672
[2m[36m(TunerInternal pid=61223)[0m   time_this_

[2m[36m(TorchTrainer pid=66486)[0m 2023-03-03 19:22:54,203	INFO trainable.py:791 -- Restored on nid003044 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-03_19-20-47/TorchTrainer_8f5ae_00000_0_lr=0.0010_2023-03-03_19-20-49/checkpoint_tmp2525d5
[2m[36m(TorchTrainer pid=66486)[0m 2023-03-03 19:22:54,203	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 1, '_timesteps_total': None, '_time_total': 18.63059902191162, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=66620)[0m 2023-03-03 19:22:56,676	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=66620)[0m 2023-03-03 19:23:00,000	INFO train_loop_utils.py:255 -- Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=101469, ip=128.55.69.178)[0m 2023-03-03 19:22:59,993	INFO train_loop_utils.py:255 -- Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=101469, ip=128.55.69.178)[0m 2023-03-03 19:23:01,560	INFO

[2m[36m(RayTrainWorker pid=66620)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=66622)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=66625)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=66624)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=101472, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=101471, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=101469, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=101470, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=66625)[0m loss: 1.564933  [    0/ 6250]
[2m[36m(RayTrainWorker pid=66622)[0m loss: 1.461504  [    0/ 6250]
[2m[36m(RayTrainWorker pid=66624)[0m loss: 1.509070  [    0/ 6250]
[2m[36m(RayTrainWorker pid=66620)[0m loss: 1.462853  [    0/ 6250]
[2m[36m(RayTrainWo

[2m[36m(TorchTrainer pid=67450)[0m 2023-03-03 19:23:17,222	INFO trainable.py:791 -- Restored on nid003044 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-03_19-20-47/TorchTrainer_8f5ae_00001_1_lr=0.0100_2023-03-03_19-21-18/checkpoint_tmp1b7202
[2m[36m(TorchTrainer pid=67450)[0m 2023-03-03 19:23:17,222	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 1, '_timesteps_total': None, '_time_total': 18.498899221420288, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=67592)[0m 2023-03-03 19:23:19,877	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=67592)[0m 2023-03-03 19:23:23,308	INFO train_loop_utils.py:255 -- Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=101998, ip=128.55.69.178)[0m 2023-03-03 19:23:23,364	INFO train_loop_utils.py:255 -- Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=101998, ip=128.55.69.178)[0m 2023-03-03 19:23:24,929	INF

[2m[36m(RayTrainWorker pid=67592)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=102000, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=101999, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=67594)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=102001, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=101998, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=67595)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=67593)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=67595)[0m loss: 1.654110  [    0/ 6250]
[2m[36m(RayTrainWorker pid=67594)[0m loss: 1.756918  [    0/ 6250]
[2m[36m(RayTrainWorker pid=67592)[0m loss: 1.708636  [    0/ 6250]
[2m[36m(RayTrainWorker pid=67593)[0m loss: 1.584543  [    0/ 6250]
[2m[36m(RayTrainWo

[2m[36m(TorchTrainer pid=68415)[0m 2023-03-03 19:23:40,219	INFO trainable.py:791 -- Restored on nid003044 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-03_19-20-47/TorchTrainer_8f5ae_00002_2_lr=0.0500_2023-03-03_19-21-40/checkpoint_tmp6c8151
[2m[36m(TorchTrainer pid=68415)[0m 2023-03-03 19:23:40,219	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 1, '_timesteps_total': None, '_time_total': 18.63059902191162, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=68568)[0m 2023-03-03 19:23:42,740	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=68568)[0m 2023-03-03 19:23:46,097	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=102467, ip=128.55.69.178)[0m 2023-03-03 19:23:46,065	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=102467, ip=128.55.69.178)[0m 2023-03-03 19:23:47,571	INFO

[2m[36m(RayTrainWorker pid=102468, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=68571)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=102469, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=68569)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=102467, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=68570)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=68568)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=102470, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=102468, ip=128.55.69.178)[0m loss: 1.514647  [    0/ 6250]
[2m[36m(RayTrainWorker pid=102470, ip=128.55.69.178)[0m loss: 1.545555  [    0/ 6250]
[2m[36m(RayTrainWorker pid=102469, ip=128.55.69.178)[0m loss: 1.541115  [    0/ 6250]
[2m[36m(RayTrainWorker pid=1024

[2m[36m(TorchTrainer pid=69448)[0m 2023-03-03 19:24:04,176	INFO trainable.py:791 -- Restored on nid003044 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-03_19-20-47/TorchTrainer_8f5ae_00003_3_lr=0.1000_2023-03-03_19-22-02/checkpoint_tmp6aeefd
[2m[36m(TorchTrainer pid=69448)[0m 2023-03-03 19:24:04,176	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 1, '_timesteps_total': None, '_time_total': 18.72793674468994, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=69597)[0m 2023-03-03 19:24:06,888	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=103011, ip=128.55.69.178)[0m 2023-03-03 19:24:10,224	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=69597)[0m 2023-03-03 19:24:10,231	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=103011, ip=128.55.69.178)[0m 2023-03-03 19:24:11,788	INFO

[2m[36m(RayTrainWorker pid=103012, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=103013, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=103014, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=103011, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=69599)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=69600)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=69597)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=69598)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=103013, ip=128.55.69.178)[0m loss: 1.696166  [    0/ 6250]
[2m[36m(RayTrainWorker pid=103014, ip=128.55.69.178)[0m loss: 1.796499  [    0/ 6250]
[2m[36m(RayTrainWorker pid=103012, ip=128.55.69.178)[0m loss: 1.820550  [    0/ 6250]
[2m[36m(RayTrainWorker pid=1030

[2m[36m(TunerInternal pid=61223)[0m 2023-03-03 19:24:24,313	INFO pbt.py:804 -- 
[2m[36m(TunerInternal pid=61223)[0m 
[2m[36m(TunerInternal pid=61223)[0m [PopulationBasedTraining] [Exploit] Cloning trial 8f5ae_00002 (score = -1.303070) into trial 8f5ae_00003 (score = -1.515052)
[2m[36m(TunerInternal pid=61223)[0m 
[2m[36m(TunerInternal pid=61223)[0m 2023-03-03 19:24:24,314	INFO pbt.py:831 -- 
[2m[36m(TunerInternal pid=61223)[0m 
[2m[36m(TunerInternal pid=61223)[0m [PopulationBasedTraining] [Explore] Perturbed the hyperparameter config of trial8f5ae_00003:
[2m[36m(TunerInternal pid=61223)[0m train_loop_config : 
[2m[36m(TunerInternal pid=61223)[0m     lr : 0.05 --- (* 0.8) --> 0.04000000000000001
[2m[36m(TunerInternal pid=61223)[0m     momentum : 0.8 --- (shift left (noop)) --> 0.8
[2m[36m(TunerInternal pid=61223)[0m 


[2m[36m(TunerInternal pid=61223)[0m Result for TorchTrainer_8f5ae_00003:
[2m[36m(TunerInternal pid=61223)[0m   _time_this_iter_s: 15.373709678649902
[2m[36m(TunerInternal pid=61223)[0m   _timestamp: 1677900264
[2m[36m(TunerInternal pid=61223)[0m   _training_iteration: 1
[2m[36m(TunerInternal pid=61223)[0m   date: 2023-03-03_19-24-24
[2m[36m(TunerInternal pid=61223)[0m   done: false
[2m[36m(TunerInternal pid=61223)[0m   experiment_id: d5bbb6808c4c4baba6fd4d28281a9a8e
[2m[36m(TunerInternal pid=61223)[0m   hostname: nid003044
[2m[36m(TunerInternal pid=61223)[0m   iterations_since_restore: 1
[2m[36m(TunerInternal pid=61223)[0m   loss: 1.515051782131195
[2m[36m(TunerInternal pid=61223)[0m   node_ip: nid003044
[2m[36m(TunerInternal pid=61223)[0m   pid: 69448
[2m[36m(TunerInternal pid=61223)[0m   should_checkpoint: true
[2m[36m(TunerInternal pid=61223)[0m   time_since_restore: 20.13590693473816
[2m[36m(TunerInternal pid=61223)[0m   time_this_iter_s

[2m[36m(TorchTrainer pid=103480, ip=128.55.69.178)[0m 2023-03-03 19:24:27,891	INFO trainable.py:791 -- Restored on 10.249.19.154 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-03_19-20-47/TorchTrainer_8f5ae_00000_0_lr=0.0010_2023-03-03_19-20-49/checkpoint_tmp683cd2
[2m[36m(TorchTrainer pid=103480, ip=128.55.69.178)[0m 2023-03-03 19:24:27,891	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 2, '_timesteps_total': None, '_time_total': 38.65552496910095, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=103577, ip=128.55.69.178)[0m 2023-03-03 19:24:30,537	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=103577, ip=128.55.69.178)[0m 2023-03-03 19:24:33,983	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=70572)[0m 2023-03-03 19:24:33,947	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker p

[2m[36m(RayTrainWorker pid=70575)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=70573)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=70574)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=70572)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=103577, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=103578, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=103579, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=103580, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=70575)[0m loss: 1.464871  [    0/ 6250]
[2m[36m(RayTrainWorker pid=70573)[0m loss: 1.346256  [    0/ 6250]
[2m[36m(RayTrainWorker pid=70574)[0m loss: 1.403846  [    0/ 6250]
[2m[36m(RayTrainWorker pid=70572)[0m loss: 1.382998  [    0/ 6250]
[2m[36m(RayTrainWo

[2m[36m(TorchTrainer pid=71464)[0m 2023-03-03 19:24:52,221	INFO trainable.py:791 -- Restored on nid003044 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-03_19-20-47/TorchTrainer_8f5ae_00001_1_lr=0.0100_2023-03-03_19-21-18/checkpoint_tmpc678c2
[2m[36m(TorchTrainer pid=71464)[0m 2023-03-03 19:24:52,221	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 2, '_timesteps_total': None, '_time_total': 38.698747396469116, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=71627)[0m 2023-03-03 19:24:54,754	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=104206, ip=128.55.69.178)[0m 2023-03-03 19:24:58,057	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=71627)[0m 2023-03-03 19:24:58,011	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=71627)[0m 2023-03-03 19:24:59,548	INFO train_loop_utils.

[2m[36m(RayTrainWorker pid=104207, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=71631)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=71630)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=71629)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=71627)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=104206, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=104209, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=104208, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=104208, ip=128.55.69.178)[0m loss: 1.521926  [    0/ 6250]
[2m[36m(RayTrainWorker pid=104207, ip=128.55.69.178)[0m loss: 1.589796  [    0/ 6250]
[2m[36m(RayTrainWorker pid=104206, ip=128.55.69.178)[0m loss: 1.441181  [    0/ 6250]
[2m[36m(RayTrainWorker pid=1042

[2m[36m(TorchTrainer pid=72529)[0m 2023-03-03 19:25:15,692	INFO trainable.py:791 -- Restored on nid003044 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-03_19-20-47/TorchTrainer_8f5ae_00002_2_lr=0.0500_2023-03-03_19-21-40/checkpoint_tmp517b40
[2m[36m(TorchTrainer pid=72529)[0m 2023-03-03 19:25:15,692	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 2, '_timesteps_total': None, '_time_total': 38.46820569038391, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=72714)[0m 2023-03-03 19:25:18,440	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=104674, ip=128.55.69.178)[0m 2023-03-03 19:25:21,744	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=72714)[0m 2023-03-03 19:25:21,784	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=104674, ip=128.55.69.178)[0m 2023-03-03 19:25:23,274	INFO

[2m[36m(RayTrainWorker pid=104674, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=104675, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=72714)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=72717)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=104677, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=104676, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=72718)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=72716)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=104677, ip=128.55.69.178)[0m loss: 1.434514  [    0/ 6250]
[2m[36m(RayTrainWorker pid=104676, ip=128.55.69.178)[0m loss: 1.328148  [    0/ 6250]
[2m[36m(RayTrainWorker pid=104675, ip=128.55.69.178)[0m loss: 1.252781  [    0/ 6250]
[2m[36m(RayTrainWorker pid=1046

[2m[36m(TorchTrainer pid=73697)[0m 2023-03-03 19:25:39,220	INFO trainable.py:791 -- Restored on nid003044 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-03_19-20-47/TorchTrainer_8f5ae_00003_3_lr=0.1000_2023-03-03_19-22-02/checkpoint_tmp259468
[2m[36m(TorchTrainer pid=73697)[0m 2023-03-03 19:25:39,220	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 2, '_timesteps_total': None, '_time_total': 38.46820569038391, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=73846)[0m 2023-03-03 19:25:41,755	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=73846)[0m 2023-03-03 19:25:45,195	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=105143, ip=128.55.69.178)[0m 2023-03-03 19:25:45,235	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=73846)[0m 2023-03-03 19:25:46,777	INFO train_loop_utils.p

[2m[36m(RayTrainWorker pid=105144, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=105146, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=73846)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=73847)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=105143, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=105145, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=73849)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=73848)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=73847)[0m loss: 1.261367  [    0/ 6250]
[2m[36m(RayTrainWorker pid=73849)[0m loss: 1.346612  [    0/ 6250]
[2m[36m(RayTrainWorker pid=73848)[0m loss: 1.294568  [    0/ 6250]
[2m[36m(RayTrainWorker pid=73846)[0m loss: 1.314752  [    0/ 6250]
[2m[36m(RayTrainWo

[2m[36m(TorchTrainer pid=105642, ip=128.55.69.178)[0m 2023-03-03 19:26:02,316	INFO trainable.py:791 -- Restored on 10.249.19.154 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-03_19-20-47/TorchTrainer_8f5ae_00000_0_lr=0.0010_2023-03-03_19-20-49/checkpoint_tmpaf2927
[2m[36m(TorchTrainer pid=105642, ip=128.55.69.178)[0m 2023-03-03 19:26:02,316	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 3, '_timesteps_total': None, '_time_total': 58.79319167137146, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=105742, ip=128.55.69.178)[0m 2023-03-03 19:26:05,119	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=105742, ip=128.55.69.178)[0m 2023-03-03 19:26:08,481	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=74922)[0m 2023-03-03 19:26:08,494	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker p

[2m[36m(RayTrainWorker pid=105742, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=74924)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=105743, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=105745, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=74923)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=74922)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=105744, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=74925)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=74922)[0m loss: 1.349034  [    0/ 6250]
[2m[36m(RayTrainWorker pid=74924)[0m loss: 1.310535  [    0/ 6250]
[2m[36m(RayTrainWorker pid=74925)[0m loss: 1.201815  [    0/ 6250]
[2m[36m(RayTrainWorker pid=74923)[0m loss: 1.107988  [    0/ 6250]
[2m[36m(RayTrainWo

[2m[36m(TorchTrainer pid=75785)[0m 2023-03-03 19:26:27,498	INFO trainable.py:791 -- Restored on nid003044 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-03_19-20-47/TorchTrainer_8f5ae_00001_1_lr=0.0100_2023-03-03_19-21-18/checkpoint_tmpf6c962
[2m[36m(TorchTrainer pid=75785)[0m 2023-03-03 19:26:27,498	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 3, '_timesteps_total': None, '_time_total': 58.453394174575806, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=75931)[0m 2023-03-03 19:26:30,443	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=106324, ip=128.55.69.178)[0m 2023-03-03 19:26:33,880	INFO train_loop_utils.py:255 -- Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=75931)[0m 2023-03-03 19:26:33,880	INFO train_loop_utils.py:255 -- Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=106324, ip=128.55.69.178)[0m 2023-03-03 19:26:35,464	INF

[2m[36m(RayTrainWorker pid=106326, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=75932)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=75933)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=75931)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=106325, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=106324, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=106327, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=75934)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=75934)[0m loss: 1.496544  [    0/ 6250]
[2m[36m(RayTrainWorker pid=75933)[0m loss: 1.306821  [    0/ 6250]
[2m[36m(RayTrainWorker pid=75932)[0m loss: 1.372525  [    0/ 6250]
[2m[36m(RayTrainWorker pid=75931)[0m loss: 1.321198  [    0/ 6250]
[2m[36m(RayTrainWo

[2m[36m(TunerInternal pid=61223)[0m 2023-03-03 19:26:47,897	INFO pbt.py:804 -- 
[2m[36m(TunerInternal pid=61223)[0m 
[2m[36m(TunerInternal pid=61223)[0m [PopulationBasedTraining] [Exploit] Cloning trial 8f5ae_00000 (score = -1.043841) into trial 8f5ae_00001 (score = -1.271376)
[2m[36m(TunerInternal pid=61223)[0m 
[2m[36m(TunerInternal pid=61223)[0m 2023-03-03 19:26:47,897	INFO pbt.py:831 -- 
[2m[36m(TunerInternal pid=61223)[0m 
[2m[36m(TunerInternal pid=61223)[0m [PopulationBasedTraining] [Explore] Perturbed the hyperparameter config of trial8f5ae_00001:
[2m[36m(TunerInternal pid=61223)[0m train_loop_config : 
[2m[36m(TunerInternal pid=61223)[0m     lr : 0.06 --- (* 1.2) --> 0.072
[2m[36m(TunerInternal pid=61223)[0m     momentum : 0.9 --- (resample) --> 0.9
[2m[36m(TunerInternal pid=61223)[0m 


[2m[36m(TunerInternal pid=61223)[0m Result for TorchTrainer_8f5ae_00001:
[2m[36m(TunerInternal pid=61223)[0m   _time_this_iter_s: 15.301666021347046
[2m[36m(TunerInternal pid=61223)[0m   _timestamp: 1677900407
[2m[36m(TunerInternal pid=61223)[0m   _training_iteration: 1
[2m[36m(TunerInternal pid=61223)[0m   date: 2023-03-03_19-26-47
[2m[36m(TunerInternal pid=61223)[0m   done: false
[2m[36m(TunerInternal pid=61223)[0m   experiment_id: 6a7c912915ea446c99c8693563996683
[2m[36m(TunerInternal pid=61223)[0m   hostname: nid003044
[2m[36m(TunerInternal pid=61223)[0m   iterations_since_restore: 1
[2m[36m(TunerInternal pid=61223)[0m   loss: 1.271375799179077
[2m[36m(TunerInternal pid=61223)[0m   node_ip: nid003044
[2m[36m(TunerInternal pid=61223)[0m   pid: 75785
[2m[36m(TunerInternal pid=61223)[0m   should_checkpoint: true
[2m[36m(TunerInternal pid=61223)[0m   time_since_restore: 20.396260023117065
[2m[36m(TunerInternal pid=61223)[0m   time_this_iter_

[2m[36m(TorchTrainer pid=76751)[0m 2023-03-03 19:26:51,221	INFO trainable.py:791 -- Restored on nid003044 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-03_19-20-47/TorchTrainer_8f5ae_00002_2_lr=0.0500_2023-03-03_19-21-40/checkpoint_tmp55a02f
[2m[36m(TorchTrainer pid=76751)[0m 2023-03-03 19:26:51,221	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 3, '_timesteps_total': None, '_time_total': 58.759873151779175, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=76922)[0m 2023-03-03 19:26:53,689	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=76922)[0m 2023-03-03 19:26:57,225	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=106833, ip=128.55.69.178)[0m 2023-03-03 19:26:57,244	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=76922)[0m 2023-03-03 19:26:58,771	INFO train_loop_utils.

[2m[36m(RayTrainWorker pid=106834, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=76924)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=76925)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=106833, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=76922)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=106836, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=106835, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=76923)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=76923)[0m loss: 1.227200  [    0/ 6250]
[2m[36m(RayTrainWorker pid=76922)[0m loss: 1.102683  [    0/ 6250]
[2m[36m(RayTrainWorker pid=76925)[0m loss: 1.293283  [    0/ 6250]
[2m[36m(RayTrainWorker pid=76924)[0m loss: 1.176510  [    0/ 6250]
[2m[36m(RayTrainWo

[2m[36m(TorchTrainer pid=107303, ip=128.55.69.178)[0m 2023-03-03 19:27:14,878	INFO trainable.py:791 -- Restored on 10.249.19.154 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-03_19-20-47/TorchTrainer_8f5ae_00003_3_lr=0.1000_2023-03-03_19-22-02/checkpoint_tmpd44d41
[2m[36m(TorchTrainer pid=107303, ip=128.55.69.178)[0m 2023-03-03 19:27:14,878	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 3, '_timesteps_total': None, '_time_total': 58.60694169998169, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=107400, ip=128.55.69.178)[0m 2023-03-03 19:27:17,763	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=107400, ip=128.55.69.178)[0m 2023-03-03 19:27:21,092	INFO train_loop_utils.py:255 -- Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=77813)[0m 2023-03-03 19:27:21,168	INFO train_loop_utils.py:255 -- Moving model to device: cuda:0
[2m[36m(RayTrainWorker p

[2m[36m(RayTrainWorker pid=107400, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=107401, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=77815)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=77813)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=77814)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=107402, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=77816)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=107403, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=107403, ip=128.55.69.178)[0m loss: 1.278616  [    0/ 6250]
[2m[36m(RayTrainWorker pid=107402, ip=128.55.69.178)[0m loss: 1.252916  [    0/ 6250]
[2m[36m(RayTrainWorker pid=107401, ip=128.55.69.178)[0m loss: 1.092820  [    0/ 6250]
[2m[36m(RayTrainWorker pid=1074

[2m[36m(TorchTrainer pid=78729)[0m 2023-03-03 19:27:39,299	INFO trainable.py:791 -- Restored on nid003044 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-03_19-20-47/TorchTrainer_8f5ae_00000_0_lr=0.0010_2023-03-03_19-20-49/checkpoint_tmp77ebb4
[2m[36m(TorchTrainer pid=78729)[0m 2023-03-03 19:27:39,299	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 4, '_timesteps_total': None, '_time_total': 79.13151025772095, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=78867)[0m 2023-03-03 19:27:41,820	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=78867)[0m 2023-03-03 19:27:45,178	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=107977, ip=128.55.69.178)[0m 2023-03-03 19:27:45,126	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=107977, ip=128.55.69.178)[0m 2023-03-03 19:27:46,684	INFO

[2m[36m(RayTrainWorker pid=107977, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=78869)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=107978, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=78867)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=78870)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=107980, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=78868)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=107979, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=107979, ip=128.55.69.178)[0m loss: 1.043445  [    0/ 6250]
[2m[36m(RayTrainWorker pid=107980, ip=128.55.69.178)[0m loss: 0.997631  [    0/ 6250]
[2m[36m(RayTrainWorker pid=107978, ip=128.55.69.178)[0m loss: 1.033666  [    0/ 6250]
[2m[36m(RayTrainWorker pid=1079

[2m[36m(TorchTrainer pid=108482, ip=128.55.69.178)[0m 2023-03-03 19:28:02,278	INFO trainable.py:791 -- Restored on 10.249.19.154 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-03_19-20-47/TorchTrainer_8f5ae_00001_1_lr=0.0100_2023-03-03_19-21-18/checkpoint_tmp7b9bc9
[2m[36m(TorchTrainer pid=108482, ip=128.55.69.178)[0m 2023-03-03 19:28:02,278	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 4, '_timesteps_total': None, '_time_total': 79.13151025772095, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=108581, ip=128.55.69.178)[0m 2023-03-03 19:28:04,784	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=108581, ip=128.55.69.178)[0m 2023-03-03 19:28:08,187	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=79770)[0m 2023-03-03 19:28:08,197	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker p

[2m[36m(RayTrainWorker pid=108584, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=79771)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=108583, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=79774)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=108582, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=79770)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=79772)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=108581, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=79770)[0m loss: 1.079005  [    0/ 6250]
[2m[36m(RayTrainWorker pid=79774)[0m loss: 1.201817  [    0/ 6250]
[2m[36m(RayTrainWorker pid=79771)[0m loss: 1.013621  [    0/ 6250]
[2m[36m(RayTrainWorker pid=79772)[0m loss: 0.977002  [    0/ 6250]
[2m[36m(RayTrainWo

[2m[36m(TorchTrainer pid=80619)[0m 2023-03-03 19:28:27,252	INFO trainable.py:791 -- Restored on nid003044 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-03_19-20-47/TorchTrainer_8f5ae_00002_2_lr=0.0500_2023-03-03_19-21-40/checkpoint_tmpf40f8a
[2m[36m(TorchTrainer pid=80619)[0m 2023-03-03 19:28:27,252	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 4, '_timesteps_total': None, '_time_total': 79.00595831871033, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=80753)[0m 2023-03-03 19:28:29,786	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=80753)[0m 2023-03-03 19:28:33,048	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=109245, ip=128.55.69.178)[0m 2023-03-03 19:28:33,058	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=109245, ip=128.55.69.178)[0m 2023-03-03 19:28:34,578	INFO

[2m[36m(RayTrainWorker pid=109245, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=109247, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=80756)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=80754)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=80753)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=80755)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=109248, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=109246, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=109245, ip=128.55.69.178)[0m loss: 1.069553  [    0/ 6250]
[2m[36m(RayTrainWorker pid=109248, ip=128.55.69.178)[0m loss: 1.230193  [    0/ 6250]
[2m[36m(RayTrainWorker pid=109247, ip=128.55.69.178)[0m loss: 1.148094  [    0/ 6250]
[2m[36m(RayTrainWorker pid=1092

[2m[36m(TorchTrainer pid=81595)[0m 2023-03-03 19:28:50,685	INFO trainable.py:791 -- Restored on nid003044 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-03_19-20-47/TorchTrainer_8f5ae_00003_3_lr=0.1000_2023-03-03_19-22-02/checkpoint_tmpba0165
[2m[36m(TorchTrainer pid=81595)[0m 2023-03-03 19:28:50,686	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 4, '_timesteps_total': None, '_time_total': 79.01619005203247, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=81741)[0m 2023-03-03 19:28:53,614	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=8]
[2m[36m(RayTrainWorker pid=81741)[0m 2023-03-03 19:28:57,087	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=109753, ip=128.55.69.178)[0m 2023-03-03 19:28:57,067	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=109753, ip=128.55.69.178)[0m 2023-03-03 19:28:58,582	INFO

[2m[36m(RayTrainWorker pid=109754, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=109753, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=109755, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=81743)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=81741)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=81742)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=81744)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=109756, ip=128.55.69.178)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=109754, ip=128.55.69.178)[0m loss: 1.128427  [    0/ 6250]
[2m[36m(RayTrainWorker pid=109753, ip=128.55.69.178)[0m loss: 1.078292  [    0/ 6250]
[2m[36m(RayTrainWorker pid=109755, ip=128.55.69.178)[0m loss: 1.117896  [    0/ 6250]
[2m[36m(RayTrainWorker pid=1097

[2m[36m(TunerInternal pid=61223)[0m 2023-03-03 19:29:11,126	INFO tune.py:798 -- Total run time: 502.20 seconds (502.13 seconds for the tuning loop).


In [38]:
print(results.get_best_result(metric="loss", mode="min"))

Result(metrics={'loss': 0.9854333639144898, '_timestamp': 1677900479, '_time_this_iter_s': 15.428494215011597, '_training_iteration': 1, 'should_checkpoint': True, 'done': True, 'trial_id': '8f5ae_00000', 'experiment_tag': '0_lr=0.0010@perturbed[train_loop_config=lr_0_06_momentum_0_9_batch_size_1024_test_mode_False_data_dir_pscratch_sd_a_asnaylor_CIFAR10_epochs_5]'}, error=None, log_dir=PosixPath('/global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-03_19-20-47/TorchTrainer_8f5ae_00000_0_lr=0.0010_2023-03-03_19-20-49'))


## Close cluster conection and stop job

In [39]:
ray.shutdown()

In [40]:
sfp_api.delete_job(site, job['jobid'])

{'task_id': '0', 'status': 'OK', 'error': None}

## Explore Training in Tensorboard

In [39]:
import nersc_tensorboard_helper
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [40]:
log_dir = str(results.get_best_result(metric="loss", mode="min").log_dir)

In [41]:
%tensorboard --logdir $log_dir --port 0

In [42]:
nersc_tensorboard_helper.tb_address()