# Ray Crash Course  - Distributed HPO with Ray Tune's TuneGridSearchCV and Scikit-Learn

© 2019-2022, Anyscale. All Rights Reserved

This demo introduces **Ray tune's** key concepts using a classification example. Basically, there are three basic steps or Ray Tune pattern for you as a newcomer to get started with using Ray Tune. We'll use a drop-in replacement for normal Scikit-learn's `GridSearchCV` with distributed Ray Tune's `TuneGridSearchCV`.

See also the [Understanding Hyperparameter Tuning](https://github.com/anyscale/academy/blob/main/ray-tune/02-Understanding-Hyperparameter-Tuning.ipynb) notebook and the [Tune documentation](http://tune.io), in particular, the [API reference](https://docs.ray.io/en/latest/tune/api_docs/overview.html). 


In [1]:
from sklearn.model_selection import GridSearchCV
# Import Tune's replacement
from ray.tune.sklearn import TuneGridSearchCV

# Other relevant imports
from sklearn.model_selection import train_test_split

# Use the stochastic gradient descent (SGD) classifier
from sklearn.linear_model import SGDClassifier

# import the classification dataset
from sklearn.datasets import make_classification
import numpy as np
import time
import logging
import ray

  from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
  from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,


In [2]:
CONNECT_TO_ANYSCALE=False
if ray.is_initialized:
    ray.shutdown()
    if CONNECT_TO_ANYSCALE:
        ray.init("anyscale://jsd-weekly-demo")
    else:
        ray.init()

2022-03-27 12:44:31,046	INFO services.py:1412 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


### Create Feature Set

 * 250K rows
 * 250 features
 * 2 classes

In [3]:
def create_classification_data() -> (np.ndarray, np.ndarray):
    X, y = make_classification(
        n_samples=250000,
        n_features=250,
        n_informative=50,
        n_redundant=0,
        n_classes=2,
        class_sep=2.5)
    return X, y

### Create classification data and define parameter search space

In [4]:
X, y = create_classification_data()
# Split the dataset into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=10000)

# Example parameters grid to tune from SGDClassifier
parameter_grid = {"alpha": [1e-4, 1e-1, 1], "epsilon": [0.01, 0.1]}

## Use Regular Scikit-learn GridSearch
This will run on a single node using all its cores.

In [5]:
# n_jobs=-1 enables use of all cores does
sklearn_search = GridSearchCV(SGDClassifier(),
                    parameter_grid,
                    n_jobs=-1,
                    verbose=True)

In [6]:
%%time
sklearn_search.fit(x_train, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
CPU times: user 1.72 s, sys: 320 ms, total: 2.04 s
Wall time: 15.2 s


GridSearchCV(estimator=SGDClassifier(), n_jobs=-1,
             param_grid={'alpha': [0.0001, 0.1, 1], 'epsilon': [0.01, 0.1]},
             verbose=True)

In [7]:
 print(f"Standard Scikit-learn GridSearchCV Best params: {sklearn_search.best_params_}")

Standard Scikit-learn GridSearchCV Best params: {'alpha': 0.1, 'epsilon': 0.01}


## Use Ray's Scikit-learn drop-in replacement TuneGridSearchCV
Use all cores on a Ray Cluster or local host to tune 

In [8]:
# Now let's do with Tune's in-place replacement
# Note: If early_stopping=True, TuneGridSearchCV will default to using Tune’s ASHAScheduler.
tune_sklearn = TuneGridSearchCV(SGDClassifier(), 
                    parameter_grid,
                    early_stopping=True,
                    max_iters=30,
                    n_jobs=12,    # Use 40 cores if running on a cluster
                    mode="min",
                    verbose=True)

  from pandas import MultiIndex, Int64Index


In [9]:
%%time
tune_sklearn.fit(x_train, y_train)

2022-03-27 12:47:54,444	INFO tune.py:639 -- Total run time: 45.09 seconds (44.24 seconds for the tuning loop).


CPU times: user 2.08 s, sys: 1.19 s, total: 3.27 s
Wall time: 46.4 s


TuneGridSearchCV(early_stopping=<ray.tune.schedulers.async_hyperband.AsyncHyperBandScheduler object at 0x7fc67248ae20>,
                 estimator=SGDClassifier(),
                 loggers=[<class 'ray.tune.logger.JsonLogger'>,
                          <class 'ray.tune.logger.CSVLogger'>],
                 max_iters=30, mode='min', n_jobs=12,
                 param_grid={'alpha': [0.0001, 0.1, 1], 'epsilon': [0.01, 0.1]},
                 scoring={'score': <function _passthrough_scorer at 0x7fc670aff790>},
                 sk_n_jobs=1, verbose=True)

In [10]:
print(f"Ray Tune Scikit-learn TuneGridSearchCV Best params: {tune_sklearn.best_params}")

Ray Tune Scikit-learn TuneGridSearchCV Best params: {'alpha': 0.1, 'epsilon': 0.01}


In [11]:
ray.shutdown()