# Hyperparameter Optimization with Dask and Coiled

This example will walk through the following:

* **Getting and processing the data.**
* **Defining a model and parameters.**
* **Finding the best parameters,** and some details on why we're using the chosen search algorithm.
* **Scoring** and deploying.

All of these tasks will be performed on the New York City Taxi Cab dataset.

## Setup cluster

In [1]:
# Create cluster with Coiled
import coiled

cluster = coiled.Cluster(
    n_workers=20,
    configuration="coiled-examples/pytorch",
)

Creating Cluster. This takes about a minute ...

In [2]:
# Connect Dask to the cluster
import dask.distributed

client = dask.distributed.Client(cluster)
client

0,1
Client  Scheduler: tls://ec2-3-18-106-62.us-east-2.compute.amazonaws.com:8786  Dashboard: http://ec2-3-18-106-62.us-east-2.compute.amazonaws.com:8787/status,Cluster  Workers: 20  Cores: 80  Memory: 343.60 GB


#### ☝️ Don’t forget to click the "Dashboard" link above to view the cluster dashboard!

## Get and pre-process data

This example will mirror the Kaggle "[NYC Taxi Trip Duration][1]" example with different data.

These data have records on 84 million taxi rides.

[1]:https://www.kaggle.com/c/nyc-taxi-trip-duration/

In [3]:
import dask.dataframe as dd

features = ["passenger_count", "trip_distance", "fare_amount"]
categorical_features = ["RatecodeID", "payment_type"]
output = ["tpep_pickup_datetime", "tpep_dropoff_datetime"]

df = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv", 
    parse_dates=output,
    usecols=features + categorical_features + output,
    dtype={
        "passenger_count": "UInt8",
        "RatecodeID": "category",
        "payment_type": "category",
    },
    blocksize="16 MiB",
)

df = df.repartition(partition_size="10 MiB").persist()

# one hot encode the categorical columns
df = df.categorize(categorical_features)
df = dd.get_dummies(df, columns=categorical_features)

# persist so only download once
df = df.persist()

data = df[[c for c in df.columns if c not in output]]
data = data.fillna(0)

In [4]:
durations = (df["tpep_dropoff_datetime"] - df["tpep_pickup_datetime"]).dt.total_seconds() / 60  # minutes

In [5]:
from dask_ml.model_selection import train_test_split
import dask

X = data.to_dask_array(lengths=True).astype("float32")
y = durations.to_dask_array(lengths=True).astype("float32")
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2, shuffle=True)

# persist the data so it's not re-computed
X_train, X_test, y_train, y_test = dask.persist(X_train, X_test, y_train, y_test)

## Define model and hyperparameters

Let's use a simple neural network from [PyTorch] using [Skorch], a simple wrapper that provides a Scikit-Learn API for PyTorch.

This network is only small for demonstration. If desired, we could use much larger networks on GPUs.

[PyTorch]:https://pytorch.org/
[skorch]:https://skorch.readthedocs.io/en/stable/

In [6]:
# Import our HiddenLayerNet pytorch model from a local torch_model.py module
from torch_model import HiddenLayerNet
# Send module with HiddenLayerNet to workers on cluster
client.upload_file("torch_model.py")

In [7]:
# Print contents of torch_model.py module
!cat torch_model.py

import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F

class HiddenLayerNet(nn.Module):
    def __init__(self, n_features=10, n_outputs=1, n_hidden=100, activation="relu"):
        super().__init__()
        self.fc1 = nn.Linear(n_features, n_hidden)
        self.fc2 = nn.Linear(n_hidden, n_outputs)
        self.activation = getattr(F, activation)

    def forward(self, x, **kwargs):
        return self.fc2(self.activation(self.fc1(x)))

In [8]:
import torch
import torch.optim as optim
import torch.nn as nn
from skorch import NeuralNetRegressor

niceties = {
    "callbacks": False,
    "warm_start": True,
    "train_split": None,
    "max_epochs": 1,
}

class NonNanLossRegressor(NeuralNetRegressor):
    def get_loss(self, y_pred, y_true, X=None, training=False):
        if torch.abs(y_true - y_pred).abs().mean() > 1e6:
            return torch.tensor([0.0], requires_grad=True)
        return super().get_loss(y_pred, y_true, X=X, training=training)

model = NonNanLossRegressor(
    module=HiddenLayerNet,
    module__n_features=X_train.shape[1],
    optimizer=optim.SGD,
    criterion=nn.MSELoss,
    lr=0.0001,
    **niceties,
)

In [9]:
from scipy.stats import loguniform, uniform

params = {
    "module__activation": ["relu", "elu", "softsign", "leaky_relu", "rrelu"],
    "batch_size": [32, 64, 128, 256],
    "optimizer__lr": loguniform(1e-4, 1e-3),
    "optimizer__weight_decay": loguniform(1e-6, 1e-3),
    "optimizer__momentum": uniform(0, 1),
    "optimizer__nesterov": [True],
}

All of these parameters control model architecture, execpt for two basic optimizatino parameters, `batch_size` and `learning_rate_init`. They control finding the best model of a particular architecture.

## Find the best hyperparameters

Our search is "computationally-constrained" because (hypothetically) it requires GPUs and has a pretty complicated search space (in reality it has neither of those features). And obviously it's "memory-constrained" because the dataset doesn't fit in memory.

[Dask-ML's documentation on hyperparameter searches][2] indicates that we should use `HyperbandSearchCV`.

[2]:https://ml.dask.org/hyper-parameter-search.html

In [10]:
from dask_ml.model_selection import HyperbandSearchCV
search = HyperbandSearchCV(model, params, random_state=2, verbose=True, max_iter=9)

By default, `HyperbandSearchCV` will call `partial_fit` on each chunk of the Dask Array. `HyperbandSearchCV`'s rule of thumb specifies how to train for longer or sample more parameters.

In [11]:
y_train2 = y_train.reshape(-1, 1).persist()
search.fit(X_train, y_train2)

[CV, bracket=2] creating 9 models
[CV, bracket=1] creating 5 models
[CV, bracket=0] creating 3 models
[CV, bracket=0] For training there are between 119153 and 249047 examples in each chunk
[CV, bracket=1] For training there are between 119153 and 249047 examples in each chunk
[CV, bracket=2] For training there are between 119153 and 249047 examples in each chunk
[CV, bracket=1] validation score of 0.0202 received after 1 partial_fit calls
[CV, bracket=0] validation score of -3.3790 received after 1 partial_fit calls
[CV, bracket=1] validation score of 0.0210 received after 3 partial_fit calls
[CV, bracket=2] validation score of 0.0229 received after 1 partial_fit calls
[CV, bracket=1] validation score of -299404463816680.2500 received after 9 partial_fit calls
[CV, bracket=0] validation score of -11.9127 received after 9 partial_fit calls
[CV, bracket=2] validation score of 0.0232 received after 3 partial_fit calls
[CV, bracket=2] validation score of 0.0280 received after 9 partial_fi

HyperbandSearchCV(estimator=<class '__main__.NonNanLossRegressor'>[uninitialized](
  module=<class 'torch_model.HiddenLayerNet'>,
  module__n_features=15,
),
                  max_iter=9,
                  parameters={'batch_size': [32, 64, 128, 256],
                              'module__activation': ['relu', 'elu', 'softsign',
                                                     'leaky_relu', 'rrelu'],
                              'optimizer__lr': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7ff9de9e5450>,
                              'optimizer__momentum': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7ff9df575e50>,
                              'optimizer__nesterov': [True],
                              'optimizer__weight_decay': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7ff9de9e5850>},
                  random_state=2, verbose=True)

## Score

`HyperbandSearchCV` and the like mirror the Scikit-Learn model selection interface, so all attributes of Scikit-Learn's [RandomizedSearchCV][rscv] are available:

[rscv]:https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

In [12]:
search.best_score_

0.028028356182226544

In [13]:
search.best_params_

{'batch_size': 256,
 'module__activation': 'softsign',
 'optimizer__lr': 0.00015404537696021744,
 'optimizer__momentum': 0.15141540401838427,
 'optimizer__nesterov': True,
 'optimizer__weight_decay': 0.000576470051148445}

In [14]:
search.best_estimator_

<class '__main__.NonNanLossRegressor'>[initialized](
  module_=HiddenLayerNet(
    (fc1): Linear(in_features=15, out_features=100, bias=True)
    (fc2): Linear(in_features=100, out_features=1, bias=True)
  ),
)

This means we can deploy the best model and score on the testing dataset:

In [15]:
from dask_ml.wrappers import ParallelPostFit
deployed_model = ParallelPostFit(search.best_estimator_)
deployed_model.score(X_test, y_test)

0.028248285332490686