# NERSC Cluster Deploy Tutorial: Tuning Hyperparameters of a Distributed PyTorch Model with PBT using Ray Train & Tune

📖 [Back to Table of Contents](../README.md)<br>
<!-- ⬅ [Previous notebook](./ex_01_pytorch_ray_hvd.ipynb) <br> -->
<!-- ➡ [Next notebook](./ex_02_tensorflow_ray_train_tune.ipynb) <br> -->

----


## Introduction

We are going to run an example Ray Train & Tune code. This example looks at tunning hyperparameters of a distrbuted PyTorch Model with PBT. This tutorial is following the code in this example: https://docs.ray.io/en/latest/train/examples/pytorch/tune_cifar_torch_pbt_example.html

> **Note**:
> To setup the environment for the notebook, execute on command line: `./setup.sh 2` then select the kernel `pytorch-1.13.1` in the notebook

This Ray cluster will be setup using the NERSC PyTorch module and deployed on Perlmutter.



# Starting Ray Cluster

## Superfacility API

To deploy the Ray cluster via the NERSC Superfacility API you require a valid API client. 

To create a valid client visit your profile page in [Iris](https://iris.nersc.gov/):

<img src="img/iris_profile_header.png" width="800" />

Then scroll down to the **Superfacility API Clients** section and click the "+ New Client" button which will produce this window:

<img src="img/new_sf_api_client.png" width="400" />

To submit and deploy a Ray cluster we require the highest security level (<span style="color:red">RED</span>). **[This client id is valid for 2 days]**

Once created then saved the `client_id` string and `private_key` dictionary (you can also save the private key in PEM format) ready for use with the `SuperfacilityAPI` library.

> **Note**:
> This step should only be repeated if your client has expired


For more information about the NERSC Superfacility API visit the [documenation](https://docs.nersc.gov/services/sfapi/).

In [1]:
from SuperfacilityAPI import SuperfacilityAPI, SuperfacilityAccessToken
from utility import load_secrets

# Replace with your client id string and private key dictionary
client_id, private_key = load_secrets()
# client_id = "<your client id string>"
# private_key = "<your private key dict>"

api_key = SuperfacilityAccessToken(
    client_id = client_id,
    private_key = private_key
)
sfp_api = SuperfacilityAPI(api_key)

## Creating Ray Cluster

To create a ray cluster on NERSC compute nodes, execute the `deploy_ray_cluster` function with your desired slurm sbatch options.

In [2]:
from nersc_cluster_deploy import deploy_ray_cluster
from utility import user_account

slurm_options = {
    'qos': 'debug',
    'account': user_account(),
    'nodes': '3',
    't': '00:30:00'
}
site = 'perlmutter'
module_load = 'pytorch/1.13.1'

job = deploy_ray_cluster(
    sfp_api,
    slurm_options,
    site,
    job_setup = [f'module load {module_load}']
)

Now the job has been submitted, check on the job status

In [8]:
import os
import pandas as pd
sqs_table = sfp_api.get_jobs(site=site, user=os.getlogin(), sacct=False)
sqs_df = pd.DataFrame(sqs_table['output'])
sqs_df

Unnamed: 0,account,tres_per_node,min_cpus,min_tmp_disk,end_time,features,group,over_subscribe,jobid,name,...,partition,nodelist(reason),start_time,state,uid,submit_time,licenses,core_spec,schednodes,work_dir
0,dasrepo_g,,128,0,2023-03-21T10:54:12,gpu&a100&hbm40g,75235,NO,6304042,sbatch,...,gpu_ss11,"nid[001164,001361,001364]",2023-03-21T10:24:12,RUNNING,75235,2023-03-21T10:19:37,u2:1,,(null),/global/u2/a/asnaylor


Check job log

In [9]:
!cat ~/slurm-{job['jobid']}.out

In case of issues, please refer to our known issues: https://docs.nersc.gov/current/
and open a help ticket if your issue is not listed: https://help.nersc.gov/
[slurm] - Starting Ray HEAD
2023-03-21 10:24:24,850	INFO usage_lib.py:435 -- Usage stats collection is disabled.
2023-03-21 10:24:24,850	INFO scripts.py:710 -- [37mLocal node IP[39m: [1mnid001164[22m
2023-03-21 10:24:26,520	SUCC scripts.py:747 -- [32m--------------------[39m
2023-03-21 10:24:26,520	SUCC scripts.py:748 -- [32mRay runtime started.[39m
2023-03-21 10:24:26,520	SUCC scripts.py:749 -- [32m--------------------[39m
2023-03-21 10:24:26,520	INFO scripts.py:751 -- [36mNext steps[39m
2023-03-21 10:24:26,520	INFO scripts.py:752 -- To connect to this Ray runtime from another node, run
2023-03-21 10:24:26,520	INFO scripts.py:755 -- [1m  ray start --address='nid001164:6379'[22m
2023-03-21 10:24:26,520	INFO scripts.py:771 -- Alternatively, use the following Python code:
2023-03-21 10:24:26,520	INFO scripts.py:773 

## Connect to Ray + Grafana dashboards

Port forward both dashboards and get the URL paths

In [10]:
from nersc_cluster_deploy import connect_ray_dashboard


ray_dashboard_url, grafana_dashboard_url = connect_ray_dashboard(
    sfp_api,
    job['jobid'],
    site
)

In [11]:
ray_dashboard_url

'https://jupyter.nersc.gov/user/asnaylor/perlmutter-shared-node-cpu/proxy/localhost:8265/#/new/overview'

In [12]:
grafana_dashboard_url

'https://jupyter.nersc.gov/user/asnaylor/perlmutter-shared-node-cpu/proxy/3000/d/rayDefaultDashboard'

## Connect to Ray Cluster

Get the Ray cluster head node ip address to connect to the cluster

In [13]:
from nersc_cluster_deploy import get_ray_cluster_address
import ray

cluster_address = get_ray_cluster_address(
    sfp_api,
    job['jobid'],
    site
)
ray.init(cluster_address)

0,1
Python version:,3.9.15
Ray version:,2.3.0
Dashboard:,http://127.0.0.1:8265


Check all nodes connected to cluster

In [14]:
from nersc_cluster_deploy import ray_cluster_summary

ray_cluster_summary()

Cluster Summary
---------------
Nodes: 3
CPU:   318
GPU:   12
RAM:   465.96 GB


## Setup PyTorch Model

In [16]:
import argparse
import os

import torch
import torch.nn as nn
import torchvision.transforms as transforms
from filelock import FileLock
from torch.utils.data import DataLoader, Subset
from torchvision.datasets import CIFAR10
from torchvision.models import resnet18

import ray
import ray.train as train
from ray import tune
from ray.air import session
from ray.air.checkpoint import Checkpoint
from ray.air.config import FailureConfig, RunConfig, ScalingConfig
from ray.train.torch import TorchTrainer
from ray.tune.schedulers import PopulationBasedTraining
from ray.tune.tune_config import TuneConfig
from ray.tune.tuner import Tuner

In [17]:
def train_epoch(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset) // session.get_world_size()
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")


def validate_epoch(dataloader, model, loss_fn):
    size = len(dataloader.dataset) // session.get_world_size()
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(
        f"Test Error: \n "
        f"Accuracy: {(100 * correct):>0.1f}%, "
        f"Avg loss: {test_loss:>8f} \n"
    )
    return {"loss": test_loss}


def update_optimizer_config(optimizer, config):
    for param_group in optimizer.param_groups:
        for param, val in config.items():
            param_group[param] = val


def train_func(config):
    epochs = config.get("epochs", 3)

    model = resnet18()

    # Note that `prepare_model` needs to be called before setting optimizer.
    if not session.get_checkpoint():  # fresh start
        model = train.torch.prepare_model(model)

    # Create optimizer.
    optimizer_config = {
        "lr": config.get("lr"),
        "momentum": config.get("momentum"),
    }
    optimizer = torch.optim.SGD(model.parameters(), **optimizer_config)

    starting_epoch = 0
    if session.get_checkpoint():
        checkpoint_dict = session.get_checkpoint().to_dict()

        # Load in model
        model_state = checkpoint_dict["model"]
        model.load_state_dict(model_state)
        model = train.torch.prepare_model(model)

        # Load in optimizer
        optimizer_state = checkpoint_dict["optimizer_state_dict"]
        optimizer.load_state_dict(optimizer_state)

        # Optimizer configs (`lr`, `momentum`) are being mutated by PBT and passed in
        # through config, so we need to update the optimizer loaded from the checkpoint
        update_optimizer_config(optimizer, optimizer_config)

        # The current epoch increments the loaded epoch by 1
        checkpoint_epoch = checkpoint_dict["epoch"]
        starting_epoch = checkpoint_epoch + 1

    # Load in training and validation data.
    transform_train = transforms.Compose(
        [
            transforms.RandomCrop(32, padding=4),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
        ]
    )  # meanstd transformation

    transform_test = transforms.Compose(
        [
            transforms.ToTensor(),
            transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
        ]
    )

    data_dir = config.get("data_dir", os.path.expanduser("~/data"))
    os.makedirs(data_dir, exist_ok=True)
    with FileLock(os.path.join(data_dir, ".ray.lock")):
        train_dataset = CIFAR10(
            root=data_dir, train=True, download=True, transform=transform_train
        )
        validation_dataset = CIFAR10(
            root=data_dir, train=False, download=False, transform=transform_test
        )

    if config.get("test_mode"):
        train_dataset = Subset(train_dataset, list(range(64)))
        validation_dataset = Subset(validation_dataset, list(range(64)))

    worker_batch_size = config["batch_size"] // session.get_world_size()

    train_loader = DataLoader(train_dataset, batch_size=worker_batch_size)
    validation_loader = DataLoader(validation_dataset, batch_size=worker_batch_size)

    train_loader = train.torch.prepare_data_loader(train_loader)
    validation_loader = train.torch.prepare_data_loader(validation_loader)

    # Create loss.
    criterion = nn.CrossEntropyLoss()

    for epoch in range(starting_epoch, epochs):
        train_epoch(train_loader, model, criterion, optimizer)
        result = validate_epoch(validation_loader, model, criterion)
        checkpoint = Checkpoint.from_dict(
            {
                "epoch": epoch,
                "model": model.state_dict(),
                "optimizer_state_dict": optimizer.state_dict(),
            }
        )

        session.report(result, checkpoint=checkpoint)


## Train Model

In [18]:
SCRATCH = os.getenv('SCRATCH')

node_resources = ray.cluster_resources()
num_workers = int(node_resources['GPU'])
use_gpu = True

data_dir = os.path.join(SCRATCH, 'CIFAR10')
num_epochs = 5
smoke_test = False
synch = False

In [19]:
trainer = TorchTrainer(
        train_func,
        scaling_config=ScalingConfig(
            num_workers=num_workers, use_gpu=use_gpu
        ),
    )

In [20]:
pbt_scheduler = PopulationBasedTraining(
        time_attr="training_iteration",
        perturbation_interval=1,
        hyperparam_mutations={
            "train_loop_config": {
                # distribution for resampling
                "lr": tune.loguniform(0.001, 0.1),
                # allow perturbations within this set of categorical values
                "momentum": [0.8, 0.9, 0.99],
            }
        },
        synch=synch,
    )

In [21]:
tuner = Tuner(
        trainer,
        param_space={
            "train_loop_config": {
                "lr": tune.grid_search([0.001, 0.01, 0.05, 0.1]),
                "momentum": 0.8,
                "batch_size": 128 * num_workers,
                "test_mode": smoke_test,  # whether to to subset the data
                "data_dir": data_dir,
                "epochs": num_epochs,
            }
        },
        tune_config=TuneConfig(
            num_samples=1, metric="loss", mode="min", scheduler=pbt_scheduler
        ),
        run_config=RunConfig(
            stop={"training_iteration": 3 if smoke_test else num_epochs},
            failure_config=FailureConfig(max_failures=3),  # used for fault tolerance
        ),
    )


In [22]:
results = tuner.fit()

0,1
Current time:,2023-03-21 10:36:24
Running for:,00:09:49.22
Memory:,51.1/251.3 GiB

Trial name,status,loc,train_loop_config/lr,iter,total time (s),loss,_timestamp,_time_this_iter_s
TorchTrainer_876a2_00000,TERMINATED,128.55.66.67:7315,0.06,5,117.741,1.20098,1679420097,19.068
TorchTrainer_876a2_00001,TERMINATED,128.55.66.67:8132,0.01,5,120.926,1.32064,1679420126,18.9816
TorchTrainer_876a2_00002,TERMINATED,pid=16032,0.05,5,118.06,1.1953,1679420156,19.0394
TorchTrainer_876a2_00003,TERMINATED,pid=17557,0.048,5,117.507,1.14191,1679420184,19.2617


[2m[36m(RayTrainWorker pid=126806, ip=128.55.66.67)[0m 2023-03-21 10:26:46,542	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=126806, ip=128.55.66.67)[0m 2023-03-21 10:26:48,612	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=125989)[0m 2023-03-21 10:26:48,586	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=93576, ip=128.55.67.168)[0m 2023-03-21 10:26:48,607	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=126806, ip=128.55.66.67)[0m 2023-03-21 10:26:50,690	INFO train_loop_utils.py:315 -- Wrapping provided model in DistributedDataParallel.
[2m[36m(RayTrainWorker pid=93576, ip=128.55.67.168)[0m 2023-03-21 10:26:50,930	INFO train_loop_utils.py:315 -- Wrapping provided model in DistributedDataParallel.
[2m[36m(RayTrainWorker pid=125989)[0m 2023-03-21 10:26:51,138	INFO train_loop_utils

[2m[36m(RayTrainWorker pid=126806, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=126807, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=126808, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=126809, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=93576, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=93579, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=93578, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=93577, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=125989)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=125990)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=125992)[0m Files already downloaded and veri

[2m[36m(RayTrainWorker pid=127168)[0m 2023-03-21 10:27:17,174	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=127821, ip=128.55.66.67)[0m 2023-03-21 10:27:18,989	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=94097, ip=128.55.67.168)[0m 2023-03-21 10:27:19,019	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=127168)[0m 2023-03-21 10:27:19,005	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=127821, ip=128.55.66.67)[0m 2023-03-21 10:27:20,501	INFO train_loop_utils.py:315 -- Wrapping provided model in DistributedDataParallel.
[2m[36m(RayTrainWorker pid=94097, ip=128.55.67.168)[0m 2023-03-21 10:27:20,591	INFO train_loop_utils.py:315 -- Wrapping provided model in DistributedDataParallel.
[2m[36m(RayTrainWorker pid=127168)[0m 2023-03-21 10:27:20,741	INFO train_loop_utils.py:315 -- Wrappi

[2m[36m(RayTrainWorker pid=127171)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=127170)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=127821, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=127823, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=127169)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=127168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=94098, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=94097, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=127824, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=127822, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=94099, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(Ra

[2m[36m(RayTrainWorker pid=128269)[0m 2023-03-21 10:27:44,281	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=128313, ip=128.55.66.67)[0m 2023-03-21 10:27:46,099	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=128269)[0m 2023-03-21 10:27:46,063	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=94589, ip=128.55.67.168)[0m 2023-03-21 10:27:46,092	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=94589, ip=128.55.67.168)[0m 2023-03-21 10:27:47,610	INFO train_loop_utils.py:315 -- Wrapping provided model in DistributedDataParallel.
[2m[36m(RayTrainWorker pid=128269)[0m 2023-03-21 10:27:47,646	INFO train_loop_utils.py:315 -- Wrapping provided model in DistributedDataParallel.
[2m[36m(RayTrainWorker pid=128313, ip=128.55.66.67)[0m 2023-03-21 10:27:47,721	INFO train_loop_utils.py:315 -- Wrappi

[2m[36m(RayTrainWorker pid=94592, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=94589, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=94590, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=94591, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=128272)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=128314, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=128315, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=128316, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=128271)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=128269)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=128270)[0m Files already downloaded and verified
[2m[36m(Ra

[2m[36m(RayTrainWorker pid=128939, ip=128.55.66.67)[0m 2023-03-21 10:28:09,090	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=95110, ip=128.55.67.168)[0m 2023-03-21 10:28:11,016	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=128939, ip=128.55.66.67)[0m 2023-03-21 10:28:11,021	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=129203)[0m 2023-03-21 10:28:11,026	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=129203)[0m 2023-03-21 10:28:12,519	INFO train_loop_utils.py:315 -- Wrapping provided model in DistributedDataParallel.
[2m[36m(RayTrainWorker pid=95110, ip=128.55.67.168)[0m 2023-03-21 10:28:12,609	INFO train_loop_utils.py:315 -- Wrapping provided model in DistributedDataParallel.
[2m[36m(RayTrainWorker pid=128939, ip=128.55.66.67)[0m 2023-03-21 10:28:12,581	INFO train_loop_utils

[2m[36m(RayTrainWorker pid=128939, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=129205)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=129206)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=95110, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=95112, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=95111, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=129204)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=129203)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=128942, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=128941, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=128940, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(Ra

[2m[36m(TorchTrainer pid=129630, ip=128.55.66.67)[0m 2023-03-21 10:28:33,750	INFO trainable.py:791 -- Restored on 128.55.66.67 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-21_10-26-33/TorchTrainer_876a2_00000_0_lr=0.0010_2023-03-21_10-26-35/checkpoint_tmp0f7a48
[2m[36m(TorchTrainer pid=129630, ip=128.55.66.67)[0m 2023-03-21 10:28:33,750	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 1, '_timesteps_total': None, '_time_total': 25.868615865707397, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=129760, ip=128.55.66.67)[0m 2023-03-21 10:28:36,266	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=130215)[0m 2023-03-21 10:28:40,372	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=95603, ip=128.55.67.168)[0m 2023-03-21 10:28:40,361	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid

[2m[36m(RayTrainWorker pid=129760, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=130218)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=130216)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=130217)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=95604, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=95603, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=95606, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=95605, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=129761, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=129763, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=129762, ip=128.55.66.67)[0m Files already downloaded and veri

[2m[36m(TunerInternal pid=125486)[0m 2023-03-21 10:28:57,510	INFO pbt.py:804 -- 
[2m[36m(TunerInternal pid=125486)[0m 
[2m[36m(TunerInternal pid=125486)[0m [PopulationBasedTraining] [Exploit] Cloning trial 876a2_00002 (score = -1.572477) into trial 876a2_00000 (score = -2.224115)
[2m[36m(TunerInternal pid=125486)[0m 
[2m[36m(TunerInternal pid=125486)[0m 2023-03-21 10:28:57,510	INFO pbt.py:831 -- 
[2m[36m(TunerInternal pid=125486)[0m 
[2m[36m(TunerInternal pid=125486)[0m [PopulationBasedTraining] [Explore] Perturbed the hyperparameter config of trial876a2_00000:
[2m[36m(TunerInternal pid=125486)[0m train_loop_config : 
[2m[36m(TunerInternal pid=125486)[0m     lr : 0.05 --- (* 1.2) --> 0.06
[2m[36m(TunerInternal pid=125486)[0m     momentum : 0.8 --- (resample) --> 0.8
[2m[36m(TunerInternal pid=125486)[0m 


[2m[36m(TunerInternal pid=125486)[0m Result for TorchTrainer_876a2_00000:
[2m[36m(TunerInternal pid=125486)[0m   _time_this_iter_s: 19.077677965164185
[2m[36m(TunerInternal pid=125486)[0m   _timestamp: 1679419737
[2m[36m(TunerInternal pid=125486)[0m   _training_iteration: 1
[2m[36m(TunerInternal pid=125486)[0m   date: 2023-03-21_10-28-57
[2m[36m(TunerInternal pid=125486)[0m   done: false
[2m[36m(TunerInternal pid=125486)[0m   experiment_id: 53278e032fae459da159b97d261f2e68
[2m[36m(TunerInternal pid=125486)[0m   hostname: nid001361
[2m[36m(TunerInternal pid=125486)[0m   iterations_since_restore: 1
[2m[36m(TunerInternal pid=125486)[0m   loss: 2.2241149629865373
[2m[36m(TunerInternal pid=125486)[0m   node_ip: 128.55.66.67
[2m[36m(TunerInternal pid=125486)[0m   pid: 129630
[2m[36m(TunerInternal pid=125486)[0m   should_checkpoint: true
[2m[36m(TunerInternal pid=125486)[0m   time_since_restore: 23.756930112838745
[2m[36m(TunerInternal pid=125486)[

[2m[36m(TorchTrainer pid=578)[0m 2023-03-21 10:29:00,647	INFO trainable.py:791 -- Restored on nid001164 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-21_10-26-33/TorchTrainer_876a2_00000_0_lr=0.0010_2023-03-21_10-26-35/checkpoint_tmp2a9b55
[2m[36m(TorchTrainer pid=578)[0m 2023-03-21 10:29:00,647	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 1, '_timesteps_total': None, '_time_total': 21.938036918640137, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=957)[0m 2023-03-21 10:29:03,227	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=96149, ip=128.55.67.168)[0m 2023-03-21 10:29:07,316	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=957)[0m 2023-03-21 10:29:07,296	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=130262, ip=128.55.66.67)[0m 2023-03-21 10:29:07,324	INFO train_l

[2m[36m(RayTrainWorker pid=957)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=959)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=960)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=958)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=130264, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=130263, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=130262, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=130265, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=96149, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=96151, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=96150, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker

[2m[36m(TorchTrainer pid=130773, ip=128.55.66.67)[0m 2023-03-21 10:29:27,711	INFO trainable.py:791 -- Restored on 128.55.66.67 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-21_10-26-33/TorchTrainer_876a2_00001_1_lr=0.0100_2023-03-21_10-27-11/checkpoint_tmpbf85be
[2m[36m(TorchTrainer pid=130773, ip=128.55.66.67)[0m 2023-03-21 10:29:27,711	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 1, '_timesteps_total': None, '_time_total': 23.511271953582764, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=130907, ip=128.55.66.67)[0m 2023-03-21 10:29:30,307	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=2007)[0m 2023-03-21 10:29:34,210	INFO train_loop_utils.py:255 -- Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=130907, ip=128.55.66.67)[0m 2023-03-21 10:29:34,264	INFO train_loop_utils.py:255 -- Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=9

[2m[36m(RayTrainWorker pid=96646, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=96644, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=96647, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=2007)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=2009)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=2010)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=2008)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=130909, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=130910, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=130908, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=130907, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWo



[2m[36m(RayTrainWorker pid=2010)[0m Test Error: 
[2m[36m(RayTrainWorker pid=2010)[0m  Accuracy: 50.7%, Avg loss: 1.366644 
[2m[36m(RayTrainWorker pid=2010)[0m 
[2m[36m(RayTrainWorker pid=2009)[0m Test Error: 
[2m[36m(RayTrainWorker pid=2009)[0m  Accuracy: 44.5%, Avg loss: 1.498748 
[2m[36m(RayTrainWorker pid=2009)[0m 
[2m[36m(RayTrainWorker pid=2008)[0m Test Error: 
[2m[36m(RayTrainWorker pid=2008)[0m  Accuracy: 45.1%, Avg loss: 1.470742 
[2m[36m(RayTrainWorker pid=2008)[0m 
[2m[36m(RayTrainWorker pid=2007)[0m Test Error: 
[2m[36m(RayTrainWorker pid=2007)[0m  Accuracy: 48.5%, Avg loss: 1.446913 
[2m[36m(RayTrainWorker pid=2007)[0m 
[2m[36m(RayTrainWorker pid=96646, ip=128.55.67.168)[0m Test Error: 
[2m[36m(RayTrainWorker pid=96646, ip=128.55.67.168)[0m  Accuracy: 48.5%, Avg loss: 1.418464 
[2m[36m(RayTrainWorker pid=96646, ip=128.55.67.168)[0m 
[2m[36m(RayTrainWorker pid=96644, ip=128.55.67.168)[0m Test Error: 
[2m[36m(RayTrainWorker pi

[2m[36m(TorchTrainer pid=3068)[0m 2023-03-21 10:29:56,906	INFO trainable.py:791 -- Restored on nid001164 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-21_10-26-33/TorchTrainer_876a2_00002_2_lr=0.0500_2023-03-21_10-27-38/checkpoint_tmpd5b780
[2m[36m(TorchTrainer pid=3068)[0m 2023-03-21 10:29:56,906	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 1, '_timesteps_total': None, '_time_total': 21.938036918640137, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=3216)[0m 2023-03-21 10:29:59,937	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=97177, ip=128.55.67.168)[0m 2023-03-21 10:30:03,971	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=1264, ip=128.55.66.67)[0m 2023-03-21 10:30:04,003	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=3216)[0m 2023-03-21 10:30:03,996	INFO train

[2m[36m(RayTrainWorker pid=97179, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=97177, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=97178, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=97176, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=1264, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=1262, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=1265, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=3217)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=3216)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=3219)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=3218)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker p

[2m[36m(TorchTrainer pid=4158)[0m 2023-03-21 10:30:24,674	INFO trainable.py:791 -- Restored on nid001164 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-21_10-26-33/TorchTrainer_876a2_00003_3_lr=0.1000_2023-03-21_10-28-03/checkpoint_tmp292e17
[2m[36m(TorchTrainer pid=4158)[0m 2023-03-21 10:30:24,674	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 1, '_timesteps_total': None, '_time_total': 21.84858274459839, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=4303)[0m 2023-03-21 10:30:27,268	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=4303)[0m 2023-03-21 10:30:31,361	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=97695, ip=128.55.67.168)[0m 2023-03-21 10:30:31,342	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=1803, ip=128.55.66.67)[0m 2023-03-21 10:30:31,355	INFO train_

[2m[36m(RayTrainWorker pid=97698, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=97696, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=97695, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=4306)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=4304)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=4305)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=4303)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=1804, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=1802, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=1805, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=1803, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid

[2m[36m(TunerInternal pid=125486)[0m 2023-03-21 10:30:48,665	INFO pbt.py:804 -- 
[2m[36m(TunerInternal pid=125486)[0m 
[2m[36m(TunerInternal pid=125486)[0m [PopulationBasedTraining] [Exploit] Cloning trial 876a2_00000 (score = -1.438839) into trial 876a2_00003 (score = -1.614763)
[2m[36m(TunerInternal pid=125486)[0m 
[2m[36m(TunerInternal pid=125486)[0m 2023-03-21 10:30:48,665	INFO pbt.py:831 -- 
[2m[36m(TunerInternal pid=125486)[0m 
[2m[36m(TunerInternal pid=125486)[0m [PopulationBasedTraining] [Explore] Perturbed the hyperparameter config of trial876a2_00003:
[2m[36m(TunerInternal pid=125486)[0m train_loop_config : 
[2m[36m(TunerInternal pid=125486)[0m     lr : 0.06 --- (* 0.8) --> 0.048
[2m[36m(TunerInternal pid=125486)[0m     momentum : 0.8 --- (shift left (noop)) --> 0.8
[2m[36m(TunerInternal pid=125486)[0m 


[2m[36m(TunerInternal pid=125486)[0m Result for TorchTrainer_876a2_00003:
[2m[36m(TunerInternal pid=125486)[0m   _time_this_iter_s: 19.27050805091858
[2m[36m(TunerInternal pid=125486)[0m   _timestamp: 1679419848
[2m[36m(TunerInternal pid=125486)[0m   _training_iteration: 1
[2m[36m(TunerInternal pid=125486)[0m   date: 2023-03-21_10-30-48
[2m[36m(TunerInternal pid=125486)[0m   done: false
[2m[36m(TunerInternal pid=125486)[0m   experiment_id: fdcc74443faf410997f018944b736ffa
[2m[36m(TunerInternal pid=125486)[0m   hostname: nid001164
[2m[36m(TunerInternal pid=125486)[0m   iterations_since_restore: 1
[2m[36m(TunerInternal pid=125486)[0m   loss: 1.6147634301866804
[2m[36m(TunerInternal pid=125486)[0m   node_ip: nid001164
[2m[36m(TunerInternal pid=125486)[0m   pid: 4158
[2m[36m(TunerInternal pid=125486)[0m   should_checkpoint: true
[2m[36m(TunerInternal pid=125486)[0m   time_since_restore: 23.987836360931396
[2m[36m(TunerInternal pid=125486)[0m   t

[2m[36m(TorchTrainer pid=5326)[0m 2023-03-21 10:30:51,677	INFO trainable.py:791 -- Restored on nid001164 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-21_10-26-33/TorchTrainer_876a2_00000_0_lr=0.0010_2023-03-21_10-26-35/checkpoint_tmp59e384
[2m[36m(TorchTrainer pid=5326)[0m 2023-03-21 10:30:51,677	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 2, '_timesteps_total': None, '_time_total': 45.66729164123535, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=5477)[0m 2023-03-21 10:30:54,365	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=5477)[0m 2023-03-21 10:30:58,466	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=2335, ip=128.55.66.67)[0m 2023-03-21 10:30:58,499	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=98198, ip=128.55.67.168)[0m 2023-03-21 10:30:58,490	INFO train_

[2m[36m(RayTrainWorker pid=98201, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=98200, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=98198, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=98199, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=5478)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=5479)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=5480)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=5477)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=2336, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=2338, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=2337, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker p

[2m[36m(TorchTrainer pid=2842, ip=128.55.66.67)[0m 2023-03-21 10:31:18,717	INFO trainable.py:791 -- Restored on 128.55.66.67 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-21_10-26-33/TorchTrainer_876a2_00001_1_lr=0.0100_2023-03-21_10-27-11/checkpoint_tmp2fe551
[2m[36m(TorchTrainer pid=2842, ip=128.55.66.67)[0m 2023-03-21 10:31:18,717	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 2, '_timesteps_total': None, '_time_total': 47.58713221549988, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=2946, ip=128.55.66.67)[0m 2023-03-21 10:31:21,840	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=98736, ip=128.55.67.168)[0m 2023-03-21 10:31:26,112	INFO train_loop_utils.py:255 -- Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=6477)[0m 2023-03-21 10:31:26,122	INFO train_loop_utils.py:255 -- Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=2946, ip

[2m[36m(RayTrainWorker pid=98736, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=6479)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=6477)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=98737, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=2949, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=2946, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=2948, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=2947, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=98738, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=6480)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=6478)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid

[2m[36m(TorchTrainer pid=3573, ip=128.55.66.67)[0m 2023-03-21 10:31:48,628	INFO trainable.py:791 -- Restored on 128.55.66.67 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-21_10-26-33/TorchTrainer_876a2_00002_2_lr=0.0500_2023-03-21_10-27-38/checkpoint_tmpeaa95f
[2m[36m(TorchTrainer pid=3573, ip=128.55.66.67)[0m 2023-03-21 10:31:48,628	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 2, '_timesteps_total': None, '_time_total': 46.152037143707275, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=3676, ip=128.55.66.67)[0m 2023-03-21 10:31:51,400	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=7472)[0m 2023-03-21 10:31:55,277	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=99260, ip=128.55.67.168)[0m 2023-03-21 10:31:55,322	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=3676, i

[2m[36m(RayTrainWorker pid=99263, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=99262, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=99260, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=99261, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=7474)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=7472)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=3676, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=7473)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=7475)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=3679, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=3677, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker p

[2m[36m(TorchTrainer pid=8426)[0m 2023-03-21 10:32:16,655	INFO trainable.py:791 -- Restored on nid001164 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-21_10-26-33/TorchTrainer_876a2_00003_3_lr=0.1000_2023-03-21_10-28-03/checkpoint_tmp57b1d9
[2m[36m(TorchTrainer pid=8426)[0m 2023-03-21 10:32:16,655	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 2, '_timesteps_total': None, '_time_total': 45.66729164123535, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=8570)[0m 2023-03-21 10:32:19,278	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=4283, ip=128.55.66.67)[0m 2023-03-21 10:32:23,332	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=99755, ip=128.55.67.168)[0m 2023-03-21 10:32:23,284	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=8570)[0m 2023-03-21 10:32:23,351	INFO train_

[2m[36m(RayTrainWorker pid=4284, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=4285, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=4283, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=8570)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=8572)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=8573)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=8571)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=99755, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=99758, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=99757, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=99756, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker p

[2m[36m(TorchTrainer pid=9502)[0m 2023-03-21 10:32:43,733	INFO trainable.py:791 -- Restored on nid001164 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-21_10-26-33/TorchTrainer_876a2_00000_0_lr=0.0010_2023-03-21_10-26-35/checkpoint_tmp728f3b
[2m[36m(TorchTrainer pid=9502)[0m 2023-03-21 10:32:43,733	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 3, '_timesteps_total': None, '_time_total': 69.62242889404297, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=9646)[0m 2023-03-21 10:32:46,344	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=9646)[0m 2023-03-21 10:32:50,409	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=4806, ip=128.55.66.67)[0m 2023-03-21 10:32:50,401	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=100279, ip=128.55.67.168)[0m 2023-03-21 10:32:50,428	INFO train

[2m[36m(RayTrainWorker pid=100279, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=9646)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=9647)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=9648)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=9649)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=100281, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=100280, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=100282, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=4806, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=4803, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=4804, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWork

[2m[36m(TorchTrainer pid=5302, ip=128.55.66.67)[0m 2023-03-21 10:33:10,697	INFO trainable.py:791 -- Restored on 128.55.66.67 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-21_10-26-33/TorchTrainer_876a2_00001_1_lr=0.0100_2023-03-21_10-27-11/checkpoint_tmpd4d796
[2m[36m(TorchTrainer pid=5302, ip=128.55.66.67)[0m 2023-03-21 10:33:10,697	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 3, '_timesteps_total': None, '_time_total': 73.20204281806946, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=5433, ip=128.55.66.67)[0m 2023-03-21 10:33:13,298	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=5433, ip=128.55.66.67)[0m 2023-03-21 10:33:17,279	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=10636)[0m 2023-03-21 10:33:17,282	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=100773, i

[2m[36m(RayTrainWorker pid=100774, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=100776, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=100775, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=100773, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=10638)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=10637)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=10639)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=5434, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=5436, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=5433, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=5435, ip=128.55.66.67)[0m Files already downloaded and verified
[

[2m[36m(TorchTrainer pid=11899)[0m 2023-03-21 10:33:38,705	INFO trainable.py:791 -- Restored on nid001164 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-21_10-26-33/TorchTrainer_876a2_00002_2_lr=0.0500_2023-03-21_10-27-38/checkpoint_tmp2a50e8
[2m[36m(TorchTrainer pid=11899)[0m 2023-03-21 10:33:38,705	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 3, '_timesteps_total': None, '_time_total': 70.00265097618103, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=12068)[0m 2023-03-21 10:33:41,238	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=6067, ip=128.55.66.67)[0m 2023-03-21 10:33:45,239	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=101295, ip=128.55.67.168)[0m 2023-03-21 10:33:45,258	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=12068)[0m 2023-03-21 10:33:45,287	INFO t

[2m[36m(RayTrainWorker pid=12068)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=12069)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=12071)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=12072)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=101296, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=101298, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=101295, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=6068, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=6070, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=6067, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=6069, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWor

[2m[36m(TorchTrainer pid=6589, ip=128.55.66.67)[0m 2023-03-21 10:34:05,712	INFO trainable.py:791 -- Restored on 128.55.66.67 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-21_10-26-33/TorchTrainer_876a2_00003_3_lr=0.1000_2023-03-21_10-28-03/checkpoint_tmpb96546
[2m[36m(TorchTrainer pid=6589, ip=128.55.66.67)[0m 2023-03-21 10:34:05,713	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 3, '_timesteps_total': None, '_time_total': 69.630544424057, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=6694, ip=128.55.66.67)[0m 2023-03-21 10:34:08,298	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=13027)[0m 2023-03-21 10:34:12,305	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=6694, ip=128.55.66.67)[0m 2023-03-21 10:34:12,325	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=101799, ip=

[2m[36m(RayTrainWorker pid=101800, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=101802, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=6694, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=101799, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=13028)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=13027)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=13029)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=6697, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=13031)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=6695, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=6696, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWor

[2m[36m(TorchTrainer pid=7315, ip=128.55.66.67)[0m 2023-03-21 10:34:33,705	INFO trainable.py:791 -- Restored on 128.55.66.67 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-21_10-26-33/TorchTrainer_876a2_00000_0_lr=0.0010_2023-03-21_10-26-35/checkpoint_tmp5bddc5
[2m[36m(TorchTrainer pid=7315, ip=128.55.66.67)[0m 2023-03-21 10:34:33,705	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 4, '_timesteps_total': None, '_time_total': 93.53399157524109, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=7418, ip=128.55.66.67)[0m 2023-03-21 10:34:36,817	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=14077)[0m 2023-03-21 10:34:40,632	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=7418, ip=128.55.66.67)[0m 2023-03-21 10:34:40,695	INFO train_loop_utils.py:255 -- Moving model to device: cuda:1
[2m[36m(RayTrainWorker pid=102328, i

[2m[36m(RayTrainWorker pid=102328, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=102327, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=102329, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=102330, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=14082)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=14080)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=14077)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=14076)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=7419, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=7420, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=7421, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrain

[2m[36m(TorchTrainer pid=8132, ip=128.55.66.67)[0m 2023-03-21 10:35:03,208	INFO trainable.py:791 -- Restored on 128.55.66.67 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-21_10-26-33/TorchTrainer_876a2_00001_1_lr=0.0100_2023-03-21_10-27-11/checkpoint_tmp2b4bc6
[2m[36m(TorchTrainer pid=8132, ip=128.55.66.67)[0m 2023-03-21 10:35:03,208	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 4, '_timesteps_total': None, '_time_total': 96.90374326705933, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=8237, ip=128.55.66.67)[0m 2023-03-21 10:35:06,191	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=8237, ip=128.55.66.67)[0m 2023-03-21 10:35:10,100	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=15084)[0m 2023-03-21 10:35:10,082	INFO train_loop_utils.py:255 -- Moving model to device: cuda:3
[2m[36m(RayTrainWorker pid=102851, i

[2m[36m(RayTrainWorker pid=8237, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=8240, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=8238, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=8239, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=15086)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=15088)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=15084)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=15089)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=102852, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=102853, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=102851, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWor

[2m[36m(TorchTrainer pid=16032)[0m 2023-03-21 10:35:32,667	INFO trainable.py:791 -- Restored on nid001164 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-21_10-26-33/TorchTrainer_876a2_00002_2_lr=0.0500_2023-03-21_10-27-38/checkpoint_tmpcc26c9
[2m[36m(TorchTrainer pid=16032)[0m 2023-03-21 10:35:32,667	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 4, '_timesteps_total': None, '_time_total': 93.77648591995239, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=16180)[0m 2023-03-21 10:35:35,764	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=8949, ip=128.55.66.67)[0m 2023-03-21 10:35:39,778	INFO train_loop_utils.py:255 -- Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=16180)[0m 2023-03-21 10:35:39,808	INFO train_loop_utils.py:255 -- Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=103368, ip=128.55.67.168)[0m 2023-03-21 10:35:39,813	INFO t

[2m[36m(RayTrainWorker pid=103369, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=103368, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=16183)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=16180)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=16182)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=16181)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=103370, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=103371, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=8951, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=8952, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=8949, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrain

[2m[36m(TorchTrainer pid=17557)[0m 2023-03-21 10:35:59,933	INFO trainable.py:791 -- Restored on nid001164 from checkpoint: /global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-21_10-26-33/TorchTrainer_876a2_00003_3_lr=0.1000_2023-03-21_10-28-03/checkpoint_tmpe4edea
[2m[36m(TorchTrainer pid=17557)[0m 2023-03-21 10:35:59,933	INFO trainable.py:800 -- Current state after restoring: {'_iteration': 4, '_timesteps_total': None, '_time_total': 93.14122271537781, '_episodes_total': None}
[2m[36m(RayTrainWorker pid=17707)[0m 2023-03-21 10:36:02,920	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=12]
[2m[36m(RayTrainWorker pid=103865, ip=128.55.67.168)[0m 2023-03-21 10:36:06,871	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=17707)[0m 2023-03-21 10:36:06,893	INFO train_loop_utils.py:255 -- Moving model to device: cuda:2
[2m[36m(RayTrainWorker pid=9445, ip=128.55.66.67)[0m 2023-03-21 10:36:06,894	INFO t

[2m[36m(RayTrainWorker pid=103866, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=103868, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=9447, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=17708)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=9445, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=17710)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=17707)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=9446, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=9448, ip=128.55.66.67)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=103865, ip=128.55.67.168)[0m Files already downloaded and verified
[2m[36m(RayTrainWorker pid=17709)[0m Files already downloaded and verified
[2m[36m(RayTrainWor

In [23]:
print(results.get_best_result(metric="loss", mode="min"))

Result(metrics={'loss': 1.1419129882540022, '_timestamp': 1679420184, '_time_this_iter_s': 19.26172709465027, '_training_iteration': 1, 'should_checkpoint': True, 'done': True, 'trial_id': '876a2_00003', 'experiment_tag': '3_lr=0.1000@perturbed[train_loop_config=lr_0_048_momentum_0_8_batch_size_1536_test_mode_False_data_dir_pscratch_sd_a_asnaylor_CIFAR10_epochs_5]'}, error=None, log_dir=PosixPath('/global/homes/a/asnaylor/ray_results/TorchTrainer_2023-03-21_10-26-33/TorchTrainer_876a2_00003_3_lr=0.1000_2023-03-21_10-28-03'))


## Close cluster conection and stop job

In [24]:
ray.shutdown()

[2m[36m(TunerInternal pid=125486)[0m Result for TorchTrainer_876a2_00003:
[2m[36m(TunerInternal pid=125486)[0m   _time_this_iter_s: 19.26172709465027
[2m[36m(TunerInternal pid=125486)[0m   _timestamp: 1679420184
[2m[36m(TunerInternal pid=125486)[0m   _training_iteration: 1
[2m[36m(TunerInternal pid=125486)[0m   date: 2023-03-21_10-36-24
[2m[36m(TunerInternal pid=125486)[0m   done: true
[2m[36m(TunerInternal pid=125486)[0m   experiment_id: 2f564592309a4b05aee19a4ec9d1a599
[2m[36m(TunerInternal pid=125486)[0m   hostname: nid001164
[2m[36m(TunerInternal pid=125486)[0m   iterations_since_restore: 1
[2m[36m(TunerInternal pid=125486)[0m   loss: 1.1419129882540022
[2m[36m(TunerInternal pid=125486)[0m   node_ip: nid001164
[2m[36m(TunerInternal pid=125486)[0m   pid: 17557
[2m[36m(TunerInternal pid=125486)[0m   should_checkpoint: true
[2m[36m(TunerInternal pid=125486)[0m   time_since_restore: 24.36621594429016
[2m[36m(TunerInternal pid=125486)[0m   ti

[2m[36m(TunerInternal pid=125486)[0m 2023-03-21 10:36:24,317	INFO tune.py:798 -- Total run time: 589.28 seconds (589.22 seconds for the tuning loop).


In [25]:
sfp_api.delete_job(site, job['jobid'])

Connection to nid001164 closed by remote host.
Connection to nid001164 closed by remote host.


{'task_id': '0', 'status': 'OK', 'error': None}

## Explore Training in Tensorboard

In [39]:
import nersc_tensorboard_helper
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [40]:
log_dir = str(results.get_best_result(metric="loss", mode="min").log_dir)

In [41]:
%tensorboard --logdir $log_dir --port 0

In [42]:
nersc_tensorboard_helper.tb_address()