# Distributed HPO with Ray Tune and XGBoost-Ray

This demo introduces **Ray tune's** key concepts using a classification example. This example is derived from [Hyperparameter Tuning with Ray Tune and XGBoost-Ray](https://github.com/ray-project/xgboost_ray#hyperparameter-tuning). Basically, there are three basic steps or Ray Tune pattern for you as a newcomer to get started with using Ray Tune.

Three simple steps:

 1. Setup your config space and define your trainable and objective function
 2. Use Tune to execute your training hyperparameter sweep, supplying the appropriate arguments including: search space, [search algorithms](https://docs.ray.io/en/latest/tune/api_docs/suggestion.html#summary) or [trial schedulers](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#tune-schedulers)
 3. Examine or analyse the results returned
 
 <img src="https://docs.ray.io/en/latest/_images/tune-workflow.png" height="50%" width="60%">


See also the [Understanding Hyperparameter Tuning](https://github.com/anyscale/academy/blob/main/ray-tune/02-Understanding-Hyperparameter-Tuning.ipynb) notebook and the [Tune documentation](http://tune.io), in particular, the [API reference](https://docs.ray.io/en/latest/tune/api_docs/overview.html). 


In [11]:
from xgboost_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer

import ray
from ray import tune
CONNECT_TO_ANYSCALE=True

In [13]:
if ray.is_initialized:
    ray.shutdown()
    if CONNECT_TO_ANYSCALE:
        ray.init("anyscale://jsd-weekly-demo")
    else:
        ray.init()

[1m[36mOutput[0m
[1m[36m(anyscale +0.2s)[0m .anyscale.yaml found in project_dir. Directory is attached to a project.
[1m[36m(anyscale +0.3s)[0m Using project (name: prj-weekly-demo, project_dir: /Users/jules/git-repos/ray-core-tutorial, id: prj_5rvR1w2ciyUs9RM27FeZ6FVB).
[1m[36m(anyscale +2.1s)[0m cluster jsd-weekly-demo is currently running, the cluster will not be restarted.


2022-02-03 09:53:31,717	INFO packaging.py:352 -- Creating a file package for local directory '/Users/jules/git-repos/ray-core-tutorial'.
2022-02-03 09:53:31,764	INFO packaging.py:221 -- Pushing file package 'gcs://_ray_pkg_449719b58f870f86.zip' (6.35MiB) to Ray cluster...
2022-02-03 09:53:37,314	INFO packaging.py:224 -- Successfully pushed file package 'gcs://_ray_pkg_449719b58f870f86.zip'.


[1m[36m(anyscale +19.4s)[0m Connected to jsd-weekly-demo, see: https://console.anyscale.com/projects/prj_5rvR1w2ciyUs9RM27FeZ6FVB/clusters/ses_jUg93ra8KHWTzAMZv5nig2Rb
[1m[36m(anyscale +19.4s)[0m URL for head node of cluster: https://session-jug93ra8khwtzamzv5nig2rb.i.anyscaleuserdata.com


## Step 1: Define a 'Trainable' training function to use with Ray Tune `ray.tune(...)`

In [14]:
NUM_OF_ACTORS = 4           # degree of parallel trials; each actor will have a separate trial with a set of unique config from the search space
NUM_OF_CPUS_PER_ACTOR = 1   # number of CPUs per actor

ray_params = RayParams(num_actors=NUM_OF_ACTORS, cpus_per_actor=NUM_OF_CPUS_PER_ACTOR)

In [15]:
def train_func_model(config:dict, checkpoint_dir=None):
    # create the dataset
    train_X, train_y = load_breast_cancer(return_X_y=True)
    # Convert to RayDMatrix data structure
    train_set = RayDMatrix(train_X, train_y)

    # Empty dictionary for the evaluation results reported back
    # to tune
    evals_result = {}

    # Train the model with XGBoost train
    bst = train(
        params=config,                       # our hyperparameter search space
        dtrain=train_set,                    # our RayDMatrix data structure
        evals_result=evals_result,           # place holder for results
        evals=[(train_set, "train")],
        verbose_eval=False,
        ray_params=ray_params)                # distributed parameters configs for Ray Tune

    bst.save_model("model.xgb")

## Step 2: Define a hyperparameter search space

In [16]:
 # Specify the typical hyperparameter search space
config = {
    "tree_method": "approx",
    "objective": "binary:logistic",
    "eval_metric": ["logloss", "error"],
    "eta": tune.loguniform(1e-4, 1e-1),
    "subsample": tune.uniform(0.5, 1.0),
    "max_depth": tune.randint(1, 9)
}

## Step 3: Run Ray tune main trainer and examine the results

Ray Tune will launch distributed HPO, using four remote actors, each with its own instance of the trainable func

<img src="images/ray_tune_dist_hpo.png" height="60%" width="70%"> 

In [18]:
# Run tune
analysis = tune.run(
    train_func_model,
    config=config,
    metric="train-error",
    mode="min",
    num_samples=4,
    verbose=1,
    resources_per_trial=ray_params.get_tune_resources()
)

[2m[36m(run pid=27255)[0m == Status ==
[2m[36m(run pid=27255)[0m Current time: 2022-02-03 09:55:33 (running for 00:00:00.12)
[2m[36m(run pid=27255)[0m Memory usage on this node: 4.9/61.4 GiB
[2m[36m(run pid=27255)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=27255)[0m Resources requested: 0/80 CPUs, 0/0 GPUs, 0.0/215.83 GiB heap, 0.0/92.18 GiB objects
[2m[36m(run pid=27255)[0m Result logdir: /home/ray/ray_results/train_func_model_2022-02-03_09-55-33
[2m[36m(run pid=27255)[0m Number of trials: 4/4 (4 PENDING)
[2m[36m(run pid=27255)[0m 
[2m[36m(run pid=27255)[0m 


[2m[36m(ImplicitFunc pid=2639, ip=172.31.98.221)[0m 2022-02-03 09:55:35,391	INFO main.py:979 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
[2m[36m(ImplicitFunc pid=2473, ip=172.31.124.19)[0m 2022-02-03 09:55:35,380	INFO main.py:979 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
[2m[36m(ImplicitFunc pid=27982)[0m 2022-02-03 09:55:35,537	INFO main.py:979 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
[2m[36m(ImplicitFunc pid=2221, ip=172.31.127.237)[0m 2022-02-03 09:55:35,540	INFO main.py:979 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
[2m[36m(ImplicitFunc pid=2473, ip=172.31.124.19)[0m 2022-02-03 09:55:37,202	INFO main.py:1024 -- [RayXGBoost] Starting XGBoost training.
[2m[36m(ImplicitFunc pid=2639, ip=172.31.98.221)[0m 2022-02-03 09:55:37,210	INFO main.py:

[2m[36m(run pid=27255)[0m 2022-02-03 09:55:38,915	INFO command_runner.py:357 -- Fetched IP: 172.31.124.19
[2m[36m(run pid=27255)[0m 2022-02-03 09:55:38,915	INFO log_timer.py:25 -- NodeUpdater: ins_v94pQxcwcMPuqSZ2ri7zrC6k: Got IP  [LogTimer=31ms]
[2m[36m(run pid=27255)[0m == Status ==
[2m[36m(run pid=27255)[0m Current time: 2022-02-03 09:55:39 (running for 00:00:06.33)
[2m[36m(run pid=27255)[0m Memory usage on this node: 5.7/61.4 GiB
[2m[36m(run pid=27255)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=27255)[0m Resources requested: 20.0/80 CPUs, 0/0 GPUs, 0.0/215.83 GiB heap, 0.0/92.18 GiB objects
[2m[36m(run pid=27255)[0m Current best trial: 7bb96_00000 with train-error=0.040422 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.001562556356302479, 'subsample': 0.6628662961853669, 'max_depth': 4, 'nthread': 1, 'n_jobs': 1}
[2m[36m(run pid=27255)[0m Result logdir: /home/ray/ray_results/

[2m[36m(ImplicitFunc pid=2473, ip=172.31.124.19)[0m 2022-02-03 09:55:42,878	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 7.53 seconds (5.67 pure XGBoost training time).
[2m[36m(ImplicitFunc pid=2639, ip=172.31.98.221)[0m 2022-02-03 09:55:42,889	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 7.53 seconds (5.67 pure XGBoost training time).
[2m[36m(ImplicitFunc pid=27982)[0m 2022-02-03 09:55:42,864	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 7.36 seconds (5.38 pure XGBoost training time).
[2m[36m(ImplicitFunc pid=2221, ip=172.31.127.237)[0m 2022-02-03 09:55:42,878	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 7.37 seconds (5.51 pure XGBoost training time).
[2m[36m(run pid=27255)[0m 2022-02-03 09:55:43,015	INFO tune.py:626 -- Total run time: 9.71 seconds (9.56 seconds for 

[2m[36m(run pid=27255)[0m == Status ==
[2m[36m(run pid=27255)[0m Current time: 2022-02-03 09:55:42 (running for 00:00:09.57)
[2m[36m(run pid=27255)[0m Memory usage on this node: 5.4/61.4 GiB
[2m[36m(run pid=27255)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=27255)[0m Resources requested: 0/80 CPUs, 0/0 GPUs, 0.0/215.83 GiB heap, 0.0/92.18 GiB objects
[2m[36m(run pid=27255)[0m Current best trial: 7bb96_00001 with train-error=0.010545 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.0024262743496167613, 'subsample': 0.9902790091697815, 'max_depth': 7, 'nthread': 1, 'n_jobs': 1}
[2m[36m(run pid=27255)[0m Result logdir: /home/ray/ray_results/train_func_model_2022-02-03_09-55-33
[2m[36m(run pid=27255)[0m Number of trials: 4/4 (4 TERMINATED)
[2m[36m(run pid=27255)[0m 
[2m[36m(run pid=27255)[0m 


In [19]:
print("Best hyperparameters", analysis.best_config)

Best hyperparameters {'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.0024262743496167613, 'subsample': 0.9902790091697815, 'max_depth': 7}


In [20]:
analysis.results_df.head(5)

Unnamed: 0_level_0,train-logloss,train-error,time_this_iter_s,done,timesteps_total,episodes_total,training_iteration,experiment_id,date,timestamp,...,iterations_since_restore,experiment_tag,config.tree_method,config.objective,config.eval_metric,config.eta,config.subsample,config.max_depth,config.nthread,config.n_jobs
trial_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7bb96_00000,0.680007,0.022847,0.006234,True,,,10,859538adf98e4d9d9a51b9bf8ed24c90,2022-02-03_09-55-42,1643910942,...,10,"0_eta=0.0015626,max_depth=4,subsample=0.66287",approx,binary:logistic,"[logloss, error]",0.001563,0.662866,4,1,1
7bb96_00001,0.671894,0.010545,0.00632,True,,,10,2b13d621f94c43dc85a1125686c7ef59,2022-02-03_09-55-42,1643910942,...,10,"1_eta=0.0024263,max_depth=7,subsample=0.99028",approx,binary:logistic,"[logloss, error]",0.002426,0.990279,7,1,1
7bb96_00002,0.656298,0.073814,0.005506,True,,,10,7623639d937a4aa09455efdaef961293,2022-02-03_09-55-42,1643910942,...,10,"2_eta=0.00548,max_depth=1,subsample=0.78274",approx,binary:logistic,"[logloss, error]",0.00548,0.782738,1,1,1
7bb96_00003,0.691746,0.01406,0.016235,True,,,10,3852c6357dfd4112b5923de76dae78d3,2022-02-03_09-55-42,1643910942,...,10,"3_eta=0.00015802,max_depth=4,subsample=0.99192",approx,binary:logistic,"[logloss, error]",0.000158,0.99192,4,1,1


---

In [21]:
ray.shutdown()

## References

 * [Ray Train: Tune: Scalable Hyperparameter Tuning](https://docs.ray.io/en/master/tune/index.html)
 * [Introducing Distributed XGBoost Training with Ray](https://www.anyscale.com/blog/distributed-xgboost-training-with-ray)
 * [How to Speed Up XGBoost Model Training](https://www.anyscale.com/blog/how-to-speed-up-xgboost-model-training)
 * [XGBoost-Ray Project](https://github.com/ray-project/xgboost_ray)
 * [Distributed XGBoost on Ray](https://docs.ray.io/en/latest/xgboost-ray.html)