# Programming for Data Science and Artificial Intelligence

## 09.22 - Deep Learning - PyTorch - Hyperparameter tuning with Ray Tune

- Ray Tune docs - https://docs.ray.io/en/latest/index.html

### Hyperparameter tuning with Ray Tune

Hyperparameter tuning can make the difference between an average model and a highly
accurate one. Often simple things like choosing a different learning rate or changing
a network layer size can have a dramatic impact on your model performance.

Fortunately, there are tools that help with finding the best combination of parameters.
`Ray Tune <https://docs.ray.io/en/latest/tune.html>`_ is an industry standard tool for distributed hyperparameter tuning. Ray Tune includes the latest hyperparameter search algorithms, integrates with TensorBoard and other analysis libraries, and natively supports distributed training through `Ray's distributed machine learning engine <https://ray.io/>`_.

In this tutorial, we will show you how to integrate Ray Tune into your PyTorch training workflow. We will extend `this tutorial from the PyTorch documentation
<https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html>`_ for training a CIFAR10 image classifier.

As you will see, we only need to add some slight modifications. In particular, we need to

1. wrap data loading and training in functions,
2. make some network parameters configurable,
3. add checkpointing (optional),
4. and define the search space for the model tuning


To run this tutorial, please make sure the following packages are installed:

-  ``ray[tune]``: Distributed hyperparameter tuning library
-  ``tabulate``: Table handling library
-  ``torchvision``: For the data transformers

Let's start with the imports:

In [2]:
from functools import partial
import numpy as np
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import random_split
import torchvision
import torchvision.transforms as transforms
from ray import tune
from ray.tune import CLIReporter
from ray.tune.schedulers import ASHAScheduler

Most of the imports are needed for building the PyTorch model. Only the last three imports are for Ray Tune.

### Data loaders

We wrap the data loaders in their own function and pass a global data directory. This way we can share a data directory between different trials.

In [3]:
def load_data(data_dir="./data"):
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

    trainset = torchvision.datasets.CIFAR10(
        root=data_dir, train=True, download=True, transform=transform)

    testset = torchvision.datasets.CIFAR10(
        root=data_dir, train=False, download=True, transform=transform)

    return trainset, testset

### Configurable neural network

We can only tune those parameters that are configurable. In this example, we can specify the layer sizes of the fully connected layers:

In [4]:
class Net(nn.Module):
    def __init__(self, l1=120, l2=84):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, l1)
        self.fc2 = nn.Linear(l1, l2)
        self.fc3 = nn.Linear(l2, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

### The train function

Now it gets interesting, because we introduce some changes to the example `from the PyTorch documentation <https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html>`_.

We wrap the training script in a function ``train_cifar(config, checkpoint_dir=None, data_dir=None)``.
As you can guess, the ``config`` parameter will receive the hyperparameters we would like to train with. The ``checkpoint_dir`` parameter is used to restore checkpoints. The ``data_dir`` specifies the directory where we load and store the data, so multiple runs can share the same data source.

<code>
    net = Net(config["l1"], config["l2"])
    if checkpoint_dir:
        model_state, optimizer_state = torch.load(
            os.path.join(checkpoint_dir, "checkpoint"))
        net.load_state_dict(model_state)
        optimizer.load_state_dict(optimizer_state)
</code>

The learning rate of the optimizer is made configurable, too:

<code>
    optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)
</code>

We also split the training data into a training and validation subset. We thus train on 80% of the data and calculate the validation loss on the remaining 20%. The batch sizes with which we iterate through the training and test sets are configurable as well.

### Adding (multi) GPU support with DataParallel

Image classification benefits largely from GPUs. Luckily, we can continue to use PyTorch's abstractions in Ray Tune. Thus, we can wrap our model in ``nn.DataParallel`` to support data parallel training on multiple GPUs:

<code>
    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda:0"
        if torch.cuda.device_count() > 1:
            net = nn.DataParallel(net)
    net.to(device)
</code>

By using a ``device`` variable we make sure that training also works when we have no GPUs available. PyTorch requires us to send our data to the GPU memory explicitly, like this:

<code>
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)
</code>

The code now supports training on CPUs, on a single GPU, and on multiple GPUs. Notably, Ray also supports `fractional GPUs <https://docs.ray.io/en/master/using-ray-with-gpus.html#fractional-gpus>`_ so we can share GPUs among trials, as long as the model still fits on the GPU memory. We'll come back to that later.

### Communicating with Ray Tune

The most interesting part is the communication with Ray Tune:

<code>
    with tune.checkpoint_dir(epoch) as checkpoint_dir:
        path = os.path.join(checkpoint_dir, "checkpoint")
        torch.save((net.state_dict(), optimizer.state_dict()), path)
    tune.report(loss=(val_loss / val_steps), accuracy=correct / total)
</code>

Here we first save a checkpoint and then report some metrics back to Ray Tune. Specifically, we send the validation loss and accuracy back to Ray Tune. Ray Tune can then use these metrics
to decide which hyperparameter configuration lead to the best results. These metrics can also be used to stop bad performing trials early in order to avoid wasting resources on those trials.

The checkpoint saving is optional, however, it is necessary if we wanted to use advanced schedulers like
`Population Based Training <https://docs.ray.io/en/master/tune/tutorials/tune-advanced-tutorial.html>`_. Also, by saving the checkpoint we can later load the trained models and validate them
on a test set.

### Full training function

The full code example looks like this:

In [5]:
def train_cifar(config, checkpoint_dir=None, data_dir=None):
    net = Net(config["l1"], config["l2"])

    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda:0"
        if torch.cuda.device_count() > 1:
            net = nn.DataParallel(net)
    net.to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)

    if checkpoint_dir:
        model_state, optimizer_state = torch.load(
            os.path.join(checkpoint_dir, "checkpoint"))
        net.load_state_dict(model_state)
        optimizer.load_state_dict(optimizer_state)

    trainset, testset = load_data(data_dir)

    test_abs = int(len(trainset) * 0.8)
    train_subset, val_subset = random_split(
        trainset, [test_abs, len(trainset) - test_abs])

    trainloader = torch.utils.data.DataLoader(
        train_subset,
        batch_size=int(config["batch_size"]),
        shuffle=True,
        num_workers=8)
    valloader = torch.utils.data.DataLoader(
        val_subset,
        batch_size=int(config["batch_size"]),
        shuffle=True,
        num_workers=8)

    for epoch in range(10):  # loop over the dataset multiple times
        running_loss = 0.0
        epoch_steps = 0
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            epoch_steps += 1
            if i % 2000 == 1999:  # print every 2000 mini-batches
                print("[%d, %5d] loss: %.3f" % (epoch + 1, i + 1,
                                                running_loss / epoch_steps))
                running_loss = 0.0

        # Validation loss
        val_loss = 0.0
        val_steps = 0
        total = 0
        correct = 0
        for i, data in enumerate(valloader, 0):
            with torch.no_grad():
                inputs, labels = data
                inputs, labels = inputs.to(device), labels.to(device)

                outputs = net(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

                loss = criterion(outputs, labels)
                val_loss += loss.cpu().numpy()
                val_steps += 1

        with tune.checkpoint_dir(epoch) as checkpoint_dir:
            path = os.path.join(checkpoint_dir, "checkpoint")
            torch.save((net.state_dict(), optimizer.state_dict()), path)

        tune.report(loss=(val_loss / val_steps), accuracy=correct / total)
    print("Finished Training")

### Test set accuracy

Commonly the performance of a machine learning model is tested on a hold-out test set with data that has not been used for training the model. We also wrap this in a function:

In [6]:
def test_accuracy(net, device="cpu"):
    trainset, testset = load_data()

    testloader = torch.utils.data.DataLoader(
        testset, batch_size=4, shuffle=False, num_workers=2)

    correct = 0
    total = 0
    with torch.no_grad():
        for data in testloader:
            images, labels = data
            images, labels = images.to(device), labels.to(device)
            outputs = net(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    return correct / total

The function also expects a ``device`` parameter, so we can do the test set validation on a GPU.

### Configuring the search space

Lastly, we need to define Ray Tune's search space. Here is an example:

<code>
    config = {
        "l1": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),
        "l2": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),
        "lr": tune.loguniform(1e-4, 1e-1),
        "batch_size": tune.choice([2, 4, 8, 16])
    }
</code>

The ``tune.sample_from()`` function makes it possible to define your own sample methods to obtain hyperparameters. In this example, the ``l1`` and ``l2`` parameters should be powers of 2 between 4 and 256, so either 4, 8, 16, 32, 64, 128, or 256.
The ``lr`` (learning rate) should be uniformly sampled between 0.0001 and 0.1. Lastly, the batch size is a choice between 2, 4, 8, and 16.

At each trial, Ray Tune will now randomly sample a combination of parameters from these search spaces. It will then train a number of models in parallel and find the best performing one among these. We also use the ``ASHAScheduler`` which will terminate bad performing trials early.

We wrap the ``train_cifar`` function with ``functools.partial`` to set the constant ``data_dir`` parameter. We can also tell Ray Tune what resources should be available for each trial:

<code>
    gpus_per_trial = 2
    # ...
    result = tune.run(
        partial(train_cifar, data_dir=data_dir),
        resources_per_trial={"cpu": 8, "gpu": gpus_per_trial},
        config=config,
        num_samples=num_samples,
        scheduler=scheduler,
        progress_reporter=reporter,
        checkpoint_at_end=True)
</code>

You can specify the number of CPUs, which are then available e.g. to increase the ``num_workers`` of the PyTorch ``DataLoader`` instances. The selected number of GPUs are made visible to PyTorch in each trial. Trials do not have access to
GPUs that haven't been requested for them - so you don't have to care about two trials using the same set of resources.

Here we can also specify fractional GPUs, so something like ``gpus_per_trial=0.5`` is completely valid. The trials will then share GPUs among each other. You just have to make sure that the models still fit in the GPU memory.

After training the models, we will find the best performing one and load the trained network from the checkpoint file. We then obtain the test set accuracy and report everything by printing.

The full main function looks like this:

In [9]:
def main(num_samples=10, max_num_epochs=10, gpus_per_trial=2):
    data_dir = os.path.abspath("./data")
    load_data(data_dir)
    config = {
        "l1": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
        "l2": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
        "lr": tune.loguniform(1e-4, 1e-1),
        "batch_size": tune.choice([2, 4, 8, 16])
    }
    scheduler = ASHAScheduler(
        metric="loss",
        mode="min",
        max_t=max_num_epochs,
        grace_period=1,
        reduction_factor=2)
    reporter = CLIReporter(
        # parameter_columns=["l1", "l2", "lr", "batch_size"],
        metric_columns=["loss", "accuracy", "training_iteration"])
    result = tune.run(
        partial(train_cifar, data_dir=data_dir),
        resources_per_trial={"cpu": 2, "gpu": gpus_per_trial},
        config=config,
        num_samples=num_samples,
        scheduler=scheduler,
        progress_reporter=reporter)

    best_trial = result.get_best_trial("loss", "min", "last")
    print("Best trial config: {}".format(best_trial.config))
    print("Best trial final validation loss: {}".format(
        best_trial.last_result["loss"]))
    print("Best trial final validation accuracy: {}".format(
        best_trial.last_result["accuracy"]))

    best_trained_model = Net(best_trial.config["l1"], best_trial.config["l2"])
    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda:0"
        if gpus_per_trial > 1:
            best_trained_model = nn.DataParallel(best_trained_model)
    best_trained_model.to(device)

    best_checkpoint_dir = best_trial.checkpoint.value
    model_state, optimizer_state = torch.load(os.path.join(
        best_checkpoint_dir, "checkpoint"))
    best_trained_model.load_state_dict(model_state)

    test_acc = test_accuracy(best_trained_model, device)
    print("Best trial test set accuracy: {}".format(test_acc))


if __name__ == "__main__":
    # You can change the number of GPUs per trial here:
    main(num_samples=10, max_num_epochs=10, gpus_per_trial=0)

Files already downloaded and verified
Files already downloaded and verified


2020-11-24 16:14:56,856	INFO registry.py:64 -- Detected unknown callable for trainable. Converting to class.


== Status ==
Memory usage on this node: 10.4/16.0 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 2/8 CPUs, 0/0 GPUs, 0.0/4.79 GiB heap, 0.0/1.61 GiB objects
Result logdir: /Users/chaklam/ray_results/DEFAULT_2020-11-24_16-14-56
Number of trials: 1/10 (1 RUNNING)
+---------------------+----------+-------+--------------+------+------+-----------+
| Trial name          | status   | loc   |   batch_size |   l1 |   l2 |        lr |
|---------------------+----------+-------+--------------+------+------+-----------|
| DEFAULT_85401_00000 | RUNNING  |       |            2 |    4 |    4 | 0.0799485 |
+---------------------+----------+-------+--------------+------+------+-----------+


[2m[36m(pid=20804)[0m Files already downloaded and verified
[2m[36m(pid=20806)[0m Files already downloaded and verified
[2m[36m(pid=20795)[0m Files already downloaded and verified
[2m[36m(pid=20796)[0m Files

Result for DEFAULT_85401_00002:
  accuracy: 0.1672
  date: 2020-11-24_16-19-38
  done: false
  experiment_id: f73cdf768b3b476682e13ff506d7744e
  experiment_tag: 2_batch_size=16,l1=16,l2=128,lr=0.071116
  hostname: MacBook-Pro-2.local
  iterations_since_restore: 2
  loss: 2.171962975692749
  node_ip: 203.159.32.153
  pid: 20796
  should_checkpoint: true
  time_since_restore: 280.28642106056213
  time_this_iter_s: 131.73800110816956
  time_total_s: 280.28642106056213
  timestamp: 1606209578
  timesteps_since_restore: 0
  training_iteration: 2
  trial_id: '85401_00002'
  
Result for DEFAULT_85401_00001:
  accuracy: 0.238
  date: 2020-11-24_16-19-41
  done: false
  experiment_id: 1450e70e45764e7ca11f46dcf90c7ddf
  experiment_tag: 1_batch_size=16,l1=16,l2=8,lr=0.00045915
  hostname: MacBook-Pro-2.local
  iterations_since_restore: 2
  loss: 2.130254528236389
  node_ip: 203.159.32.153
  pid: 20804
  should_checkpoint: true
  time_since_restore: 283.14179015159607
  time_this_iter_s: 133.56160

[2m[36m(pid=20795)[0m [4,  2000] loss: 2.280
Result for DEFAULT_85401_00004:
  accuracy: 0.4581
  date: 2020-11-24_16-22-13
  done: false
  experiment_id: afeac2669df34f9f99167c92039c0b0e
  experiment_tag: 4_batch_size=16,l1=32,l2=8,lr=0.0027645
  hostname: MacBook-Pro-2.local
  iterations_since_restore: 2
  loss: 1.463725819683075
  node_ip: 203.159.32.153
  pid: 20793
  should_checkpoint: true
  time_since_restore: 244.1287488937378
  time_this_iter_s: 119.67727279663086
  time_total_s: 244.1287488937378
  timestamp: 1606209733
  timesteps_since_restore: 0
  training_iteration: 2
  trial_id: '85401_00004'
  
== Status ==
Memory usage on this node: 11.5/16.0 GiB
Using AsyncHyperBand: num_stopped=1
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: -2.151108751964569 | Iter 1.000: -2.2531853813171385
Resources requested: 8/8 CPUs, 0/0 GPUs, 0.0/4.79 GiB heap, 0.0/1.61 GiB objects
Result logdir: /Users/chaklam/ray_results/DEFAULT_2020-11-24_16-14-56
Number of trials: 6/10 (1 

Result for DEFAULT_85401_00004:
  accuracy: 0.5127
  date: 2020-11-24_16-24-16
  done: false
  experiment_id: afeac2669df34f9f99167c92039c0b0e
  experiment_tag: 4_batch_size=16,l1=32,l2=8,lr=0.0027645
  hostname: MacBook-Pro-2.local
  iterations_since_restore: 3
  loss: 1.3657977365493774
  node_ip: 203.159.32.153
  pid: 20793
  should_checkpoint: true
  time_since_restore: 367.25441098213196
  time_this_iter_s: 123.12566208839417
  time_total_s: 367.25441098213196
  timestamp: 1606209856
  timesteps_since_restore: 0
  training_iteration: 3
  trial_id: '85401_00004'
  
== Status ==
Memory usage on this node: 12.1/16.0 GiB
Using AsyncHyperBand: num_stopped=2
Bracket: Iter 8.000: None | Iter 4.000: -2.255589689254761 | Iter 2.000: -2.151108751964569 | Iter 1.000: -2.2531853813171385
Resources requested: 8/8 CPUs, 0/0 GPUs, 0.0/4.79 GiB heap, 0.0/1.61 GiB objects
Result logdir: /Users/chaklam/ray_results/DEFAULT_2020-11-24_16-14-56
Number of trials: 7/10 (1 PENDING, 4 RUNNING, 2 TERMINATE

[2m[36m(pid=20795)[0m [6,  2000] loss: 2.122
Result for DEFAULT_85401_00004:
  accuracy: 0.5606
  date: 2020-11-24_16-26-28
  done: false
  experiment_id: afeac2669df34f9f99167c92039c0b0e
  experiment_tag: 4_batch_size=16,l1=32,l2=8,lr=0.0027645
  hostname: MacBook-Pro-2.local
  iterations_since_restore: 4
  loss: 1.2514871767044067
  node_ip: 203.159.32.153
  pid: 20793
  should_checkpoint: true
  time_since_restore: 499.0302529335022
  time_this_iter_s: 131.77584195137024
  time_total_s: 499.0302529335022
  timestamp: 1606209988
  timesteps_since_restore: 0
  training_iteration: 4
  trial_id: '85401_00004'
  
== Status ==
Memory usage on this node: 10.3/16.0 GiB
Using AsyncHyperBand: num_stopped=2
Bracket: Iter 8.000: None | Iter 4.000: -1.966771477508545 | Iter 2.000: -2.151108751964569 | Iter 1.000: -2.2531853813171385
Resources requested: 8/8 CPUs, 0/0 GPUs, 0.0/4.79 GiB heap, 0.0/1.61 GiB objects
Result logdir: /Users/chaklam/ray_results/DEFAULT_2020-11-24_16-14-56
Number of t

[2m[36m(pid=20795)[0m [7,  2000] loss: 2.056
Result for DEFAULT_85401_00004:
  accuracy: 0.5746
  date: 2020-11-24_16-28-39
  done: false
  experiment_id: afeac2669df34f9f99167c92039c0b0e
  experiment_tag: 4_batch_size=16,l1=32,l2=8,lr=0.0027645
  hostname: MacBook-Pro-2.local
  iterations_since_restore: 5
  loss: 1.2007209563732146
  node_ip: 203.159.32.153
  pid: 20793
  should_checkpoint: true
  time_since_restore: 630.0681540966034
  time_this_iter_s: 131.0379011631012
  time_total_s: 630.0681540966034
  timestamp: 1606210119
  timesteps_since_restore: 0
  training_iteration: 5
  trial_id: '85401_00004'
  
== Status ==
Memory usage on this node: 9.9/16.0 GiB
Using AsyncHyperBand: num_stopped=2
Bracket: Iter 8.000: None | Iter 4.000: -1.966771477508545 | Iter 2.000: -2.151108751964569 | Iter 1.000: -2.2392182922363277
Resources requested: 8/8 CPUs, 0/0 GPUs, 0.0/4.79 GiB heap, 0.0/1.61 GiB objects
Result logdir: /Users/chaklam/ray_results/DEFAULT_2020-11-24_16-14-56
Number of tri

Result for DEFAULT_85401_00001:
  accuracy: 0.4507
  date: 2020-11-24_16-30-26
  done: false
  experiment_id: 1450e70e45764e7ca11f46dcf90c7ddf
  experiment_tag: 1_batch_size=16,l1=16,l2=8,lr=0.00045915
  hostname: MacBook-Pro-2.local
  iterations_since_restore: 7
  loss: 1.518710731124878
  node_ip: 203.159.32.153
  pid: 20804
  should_checkpoint: true
  time_since_restore: 927.8521041870117
  time_this_iter_s: 123.06536388397217
  time_total_s: 927.8521041870117
  timestamp: 1606210226
  timesteps_since_restore: 0
  training_iteration: 7
  trial_id: '85401_00001'
  
== Status ==
Memory usage on this node: 9.4/16.0 GiB
Using AsyncHyperBand: num_stopped=2
Bracket: Iter 8.000: None | Iter 4.000: -1.966771477508545 | Iter 2.000: -2.130254528236389 | Iter 1.000: -2.2392182922363277
Resources requested: 8/8 CPUs, 0/0 GPUs, 0.0/4.79 GiB heap, 0.0/1.61 GiB objects
Result logdir: /Users/chaklam/ray_results/DEFAULT_2020-11-24_16-14-56
Number of trials: 7/10 (1 PENDING, 4 RUNNING, 2 TERMINATED)


[2m[36m(pid=20794)[0m [4,  8000] loss: 0.340
[2m[36m(pid=20794)[0m [4, 10000] loss: 0.272
Result for DEFAULT_85401_00003:
  accuracy: 0.2695
  date: 2020-11-24_16-32-10
  done: false
  experiment_id: 77aacf4bed5d4e0aa705175c0f88ee27
  experiment_tag: 3_batch_size=16,l1=4,l2=32,lr=0.00010738
  hostname: MacBook-Pro-2.local
  iterations_since_restore: 8
  loss: 1.9869304985046388
  node_ip: 203.159.32.153
  pid: 20795
  should_checkpoint: true
  time_since_restore: 1031.4242887496948
  time_this_iter_s: 120.79872369766235
  time_total_s: 1031.4242887496948
  timestamp: 1606210330
  timesteps_since_restore: 0
  training_iteration: 8
  trial_id: '85401_00003'
  
== Status ==
Memory usage on this node: 8.6/16.0 GiB
Using AsyncHyperBand: num_stopped=2
Bracket: Iter 8.000: -1.9869304985046388 | Iter 4.000: -1.966771477508545 | Iter 2.000: -2.130254528236389 | Iter 1.000: -2.2392182922363277
Resources requested: 8/8 CPUs, 0/0 GPUs, 0.0/4.79 GiB heap, 0.0/1.61 GiB objects
Result logdir: /

Result for DEFAULT_85401_00005:
  accuracy: 0.4932
  date: 2020-11-24_16-33-43
  done: false
  experiment_id: 56b2048cf7bf4528a9dd0a038f89f72b
  experiment_tag: 5_batch_size=4,l1=32,l2=128,lr=0.0026431
  hostname: MacBook-Pro-2.local
  iterations_since_restore: 4
  loss: 1.4387769424036145
  node_ip: 203.159.32.153
  pid: 20794
  should_checkpoint: true
  time_since_restore: 584.8464779853821
  time_this_iter_s: 142.55190992355347
  time_total_s: 584.8464779853821
  timestamp: 1606210423
  timesteps_since_restore: 0
  training_iteration: 4
  trial_id: '85401_00005'
  
== Status ==
Memory usage on this node: 9.3/16.0 GiB
Using AsyncHyperBand: num_stopped=2
Bracket: Iter 8.000: -1.7359124258995058 | Iter 4.000: -1.677953265762329 | Iter 2.000: -2.130254528236389 | Iter 1.000: -2.2392182922363277
Resources requested: 8/8 CPUs, 0/0 GPUs, 0.0/4.79 GiB heap, 0.0/1.61 GiB objects
Result logdir: /Users/chaklam/ray_results/DEFAULT_2020-11-24_16-14-56
Number of trials: 7/10 (1 PENDING, 4 RUNNING

[2m[36m(pid=20795)[0m [10,  2000] loss: 1.883
Result for DEFAULT_85401_00004:
  accuracy: 0.6056
  date: 2020-11-24_16-34-48
  done: false
  experiment_id: afeac2669df34f9f99167c92039c0b0e
  experiment_tag: 4_batch_size=16,l1=32,l2=8,lr=0.0027645
  hostname: MacBook-Pro-2.local
  iterations_since_restore: 8
  loss: 1.1289723748683929
  node_ip: 203.159.32.153
  pid: 20793
  should_checkpoint: true
  time_since_restore: 998.7450039386749
  time_this_iter_s: 124.01496601104736
  time_total_s: 998.7450039386749
  timestamp: 1606210488
  timesteps_since_restore: 0
  training_iteration: 8
  trial_id: '85401_00004'
  
== Status ==
Memory usage on this node: 9.9/16.0 GiB
Using AsyncHyperBand: num_stopped=2
Bracket: Iter 8.000: -1.4848943532943726 | Iter 4.000: -1.677953265762329 | Iter 2.000: -2.130254528236389 | Iter 1.000: -2.2392182922363277
Resources requested: 8/8 CPUs, 0/0 GPUs, 0.0/4.79 GiB heap, 0.0/1.61 GiB objects
Result logdir: /Users/chaklam/ray_results/DEFAULT_2020-11-24_16-14

Result for DEFAULT_85401_00001:
  accuracy: 0.5014
  date: 2020-11-24_17-58-34
  done: true
  experiment_id: 1450e70e45764e7ca11f46dcf90c7ddf
  experiment_tag: 1_batch_size=16,l1=16,l2=8,lr=0.00045915
  hostname: MacBook-Pro-2.local
  iterations_since_restore: 10
  loss: 1.3830208913803101
  node_ip: 203.159.32.153
  pid: 20804
  should_checkpoint: true
  time_since_restore: 6215.440702199936
  time_this_iter_s: 5042.499175071716
  time_total_s: 6215.440702199936
  timestamp: 1606215514
  timesteps_since_restore: 0
  training_iteration: 10
  trial_id: '85401_00001'
  
== Status ==
Memory usage on this node: 11.1/16.0 GiB
Using AsyncHyperBand: num_stopped=4
Bracket: Iter 8.000: -1.4848943532943726 | Iter 4.000: -1.677953265762329 | Iter 2.000: -2.130254528236389 | Iter 1.000: -2.2392182922363277
Resources requested: 8/8 CPUs, 0/0 GPUs, 0.0/4.79 GiB heap, 0.0/1.61 GiB objects
Result logdir: /Users/chaklam/ray_results/DEFAULT_2020-11-24_16-14-56
Number of trials: 8/10 (1 PENDING, 4 RUNNIN

[2m[36m(pid=23063)[0m Files already downloaded and verified
[2m[36m(pid=23063)[0m Files already downloaded and verified
Result for DEFAULT_85401_00007:
  accuracy: 0.2942
  date: 2020-11-24_18-00-48
  done: false
  experiment_id: 48b079f640fd4b03a449383928b2b722
  experiment_tag: 7_batch_size=16,l1=32,l2=64,lr=0.024697
  hostname: MacBook-Pro-2.local
  iterations_since_restore: 1
  loss: 1.8508925651550292
  node_ip: 203.159.32.153
  pid: 22878
  should_checkpoint: true
  time_since_restore: 132.92341208457947
  time_this_iter_s: 132.92341208457947
  time_total_s: 132.92341208457947
  timestamp: 1606215648
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: '85401_00007'
  
== Status ==
Memory usage on this node: 8.4/16.0 GiB
Using AsyncHyperBand: num_stopped=5
Bracket: Iter 8.000: -1.4848943532943726 | Iter 4.000: -1.677953265762329 | Iter 2.000: -2.130254528236389 | Iter 1.000: -2.2392182922363277
Resources requested: 8/8 CPUs, 0/0 GPUs, 0.0/4.79 GiB heap, 0.0/1.61

Result for DEFAULT_85401_00007:
  accuracy: 0.3352
  date: 2020-11-24_18-02-55
  done: false
  experiment_id: 48b079f640fd4b03a449383928b2b722
  experiment_tag: 7_batch_size=16,l1=32,l2=64,lr=0.024697
  hostname: MacBook-Pro-2.local
  iterations_since_restore: 2
  loss: 1.8862203100204469
  node_ip: 203.159.32.153
  pid: 22878
  should_checkpoint: true
  time_since_restore: 259.1835949420929
  time_this_iter_s: 126.26018285751343
  time_total_s: 259.1835949420929
  timestamp: 1606215775
  timesteps_since_restore: 0
  training_iteration: 2
  trial_id: '85401_00007'
  
== Status ==
Memory usage on this node: 8.7/16.0 GiB
Using AsyncHyperBand: num_stopped=6
Bracket: Iter 8.000: -1.4848943532943726 | Iter 4.000: -1.677953265762329 | Iter 2.000: -2.008237419128418 | Iter 1.000: -2.2252512031555174
Resources requested: 8/8 CPUs, 0/0 GPUs, 0.0/4.79 GiB heap, 0.0/1.61 GiB objects
Result logdir: /Users/chaklam/ray_results/DEFAULT_2020-11-24_16-14-56
Number of trials: 10/10 (4 RUNNING, 6 TERMINA

[2m[36m(pid=20794)[0m [8,  2000] loss: 1.263
[2m[36m(pid=20794)[0m [8,  4000] loss: 0.645
[2m[36m(pid=20794)[0m [8,  6000] loss: 0.435
[2m[36m(pid=20794)[0m [8,  8000] loss: 0.324
[2m[36m(pid=20794)[0m [8, 10000] loss: 0.261
Result for DEFAULT_85401_00008:
  accuracy: 0.4793
  date: 2020-11-24_18-04-41
  done: false
  experiment_id: b0970b386d6d47c882b6966ca9275850
  experiment_tag: 8_batch_size=16,l1=8,l2=8,lr=0.0035444
  hostname: MacBook-Pro-2.local
  iterations_since_restore: 2
  loss: 1.4434231176376342
  node_ip: 203.159.32.153
  pid: 23063
  should_checkpoint: true
  time_since_restore: 242.9028868675232
  time_this_iter_s: 120.0637698173523
  time_total_s: 242.9028868675232
  timestamp: 1606215881
  timesteps_since_restore: 0
  training_iteration: 2
  trial_id: '85401_00008'
  
== Status ==
Memory usage on this node: 8.7/16.0 GiB
Using AsyncHyperBand: num_stopped=7
Bracket: Iter 8.000: -1.4848943532943726 | Iter 4.000: -1.677953265762329 | Iter 2.000: -1.88622031

[2m[36m(pid=22878)[0m [4,  2000] loss: 1.800
[2m[36m(pid=20794)[0m [9,  2000] loss: 1.249
[2m[36m(pid=20794)[0m [9,  4000] loss: 0.655
[2m[36m(pid=20794)[0m [9,  6000] loss: 0.436
[2m[36m(pid=20794)[0m [9,  8000] loss: 0.323
[2m[36m(pid=20794)[0m [9, 10000] loss: 0.256
Result for DEFAULT_85401_00008:
  accuracy: 0.506
  date: 2020-11-24_21-09-19
  done: false
  experiment_id: b0970b386d6d47c882b6966ca9275850
  experiment_tag: 8_batch_size=16,l1=8,l2=8,lr=0.0035444
  hostname: MacBook-Pro-2.local
  iterations_since_restore: 3
  loss: 1.3948659106254577
  node_ip: 203.159.32.153
  pid: 23063
  should_checkpoint: true
  time_since_restore: 11320.961384057999
  time_this_iter_s: 11078.058497190475
  time_total_s: 11320.961384057999
  timestamp: 1606226959
  timesteps_since_restore: 0
  training_iteration: 3
  trial_id: '85401_00008'
  
== Status ==
Memory usage on this node: 8.7/16.0 GiB
Using AsyncHyperBand: num_stopped=7
Bracket: Iter 8.000: -1.419052294562757 | Iter 4.

[2m[36m(pid=20794)[0m [10,  2000] loss: 1.223
[2m[36m(pid=20794)[0m [10,  4000] loss: 0.638
Result for DEFAULT_85401_00008:
  accuracy: 0.4925
  date: 2020-11-25_02-33-22
  done: false
  experiment_id: b0970b386d6d47c882b6966ca9275850
  experiment_tag: 8_batch_size=16,l1=8,l2=8,lr=0.0035444
  hostname: MacBook-Pro-2.local
  iterations_since_restore: 4
  loss: 1.4151583589553833
  node_ip: 203.159.32.153
  pid: 23063
  should_checkpoint: true
  time_since_restore: 30763.478137016296
  time_this_iter_s: 19442.516752958298
  time_total_s: 30763.478137016296
  timestamp: 1606246402
  timesteps_since_restore: 0
  training_iteration: 4
  trial_id: '85401_00008'
  
== Status ==
Memory usage on this node: 9.2/16.0 GiB
Using AsyncHyperBand: num_stopped=8
Bracket: Iter 8.000: -1.419052294562757 | Iter 4.000: -1.677953265762329 | Iter 2.000: -1.8862203100204469 | Iter 1.000: -2.2392182922363277
Resources requested: 4/8 CPUs, 0/0 GPUs, 0.0/4.79 GiB heap, 0.0/1.61 GiB objects
Result logdir: /

[2m[36m(pid=23063)[0m [6,  2000] loss: 1.283
Result for DEFAULT_85401_00008:
  accuracy: 0.5278
  date: 2020-11-25_08-04-43
  done: false
  experiment_id: b0970b386d6d47c882b6966ca9275850
  experiment_tag: 8_batch_size=16,l1=8,l2=8,lr=0.0035444
  hostname: MacBook-Pro-2.local
  iterations_since_restore: 6
  loss: 1.355634395647049
  node_ip: 203.159.32.153
  pid: 23063
  should_checkpoint: true
  time_since_restore: 50645.224475860596
  time_this_iter_s: 429.18142080307007
  time_total_s: 50645.224475860596
  timestamp: 1606266283
  timesteps_since_restore: 0
  training_iteration: 6
  trial_id: '85401_00008'
  
== Status ==
Memory usage on this node: 9.8/16.0 GiB
Using AsyncHyperBand: num_stopped=9
Bracket: Iter 8.000: -1.419052294562757 | Iter 4.000: -1.677953265762329 | Iter 2.000: -1.8862203100204469 | Iter 1.000: -2.2392182922363277
Resources requested: 2/8 CPUs, 0/0 GPUs, 0.0/4.79 GiB heap, 0.0/1.61 GiB objects
Result logdir: /Users/chaklam/ray_results/DEFAULT_2020-11-24_16-14-

[2m[36m(pid=23063)[0m [9,  2000] loss: 1.225
Result for DEFAULT_85401_00008:
  accuracy: 0.5425
  date: 2020-11-25_08-10-45
  done: false
  experiment_id: b0970b386d6d47c882b6966ca9275850
  experiment_tag: 8_batch_size=16,l1=8,l2=8,lr=0.0035444
  hostname: MacBook-Pro-2.local
  iterations_since_restore: 9
  loss: 1.2924960026741028
  node_ip: 203.159.32.153
  pid: 23063
  should_checkpoint: true
  time_since_restore: 51006.531791210175
  time_this_iter_s: 120.36501407623291
  time_total_s: 51006.531791210175
  timestamp: 1606266645
  timesteps_since_restore: 0
  training_iteration: 9
  trial_id: '85401_00008'
  
== Status ==
Memory usage on this node: 9.5/16.0 GiB
Using AsyncHyperBand: num_stopped=9
Bracket: Iter 8.000: -1.3532102358311415 | Iter 4.000: -1.677953265762329 | Iter 2.000: -1.8862203100204469 | Iter 1.000: -2.2392182922363277
Resources requested: 2/8 CPUs, 0/0 GPUs, 0.0/4.79 GiB heap, 0.0/1.61 GiB objects
Result logdir: /Users/chaklam/ray_results/DEFAULT_2020-11-24_16-1

2020-11-25 08:12:46,030	INFO tune.py:439 -- Total run time: 57469.17 seconds (57469.14 seconds for the tuning loop).


Result for DEFAULT_85401_00008:
  accuracy: 0.532
  date: 2020-11-25_08-12-45
  done: true
  experiment_id: b0970b386d6d47c882b6966ca9275850
  experiment_tag: 8_batch_size=16,l1=8,l2=8,lr=0.0035444
  hostname: MacBook-Pro-2.local
  iterations_since_restore: 10
  loss: 1.3192587754249572
  node_ip: 203.159.32.153
  pid: 23063
  should_checkpoint: true
  time_since_restore: 51127.197018146515
  time_this_iter_s: 120.66522693634033
  time_total_s: 51127.197018146515
  timestamp: 1606266765
  timesteps_since_restore: 0
  training_iteration: 10
  trial_id: '85401_00008'
  
== Status ==
Memory usage on this node: 10.2/16.0 GiB
Using AsyncHyperBand: num_stopped=10
Bracket: Iter 8.000: -1.3532102358311415 | Iter 4.000: -1.677953265762329 | Iter 2.000: -1.8862203100204469 | Iter 1.000: -2.2392182922363277
Resources requested: 2/8 CPUs, 0/0 GPUs, 0.0/4.79 GiB heap, 0.0/1.61 GiB objects
Result logdir: /Users/chaklam/ray_results/DEFAULT_2020-11-24_16-14-56
Number of trials: 10/10 (1 RUNNING, 9 TER

So that's it! You can now tune the parameters of your PyTorch models.