# Ray crash course - Distributed HPO with Ray Tune and XGBoost-Ray

© 2019-2022, Anyscale. All Rights Reserved

This demo introduces **Ray tune's** key concepts using a classification example. This example is derived from [Hyperparameter Tuning with Ray Tune and XGBoost-Ray](https://github.com/ray-project/xgboost_ray#hyperparameter-tuning). Basically, there are three basic steps or Ray Tune pattern for you as a newcomer to get started with using Ray Tune.

Three simple steps:

 1. Setup your config space and define your trainable and objective function
 2. Use Tune to execute your training hyperparameter sweep, supplying the appropriate arguments including: search space, [search algorithms](https://docs.ray.io/en/latest/tune/api_docs/suggestion.html#summary) or [trial schedulers](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#tune-schedulers)
 3. Examine or analyse the results returned
 
 <img src="https://docs.ray.io/en/latest/_images/tune-workflow.png" height="50%" width="60%">


See also the [Understanding Hyperparameter Tuning](https://github.com/anyscale/academy/blob/main/ray-tune/02-Understanding-Hyperparameter-Tuning.ipynb) notebook and the [Tune documentation](http://tune.io), in particular, the [API reference](https://docs.ray.io/en/latest/tune/api_docs/overview.html). 


In [1]:
from xgboost_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer

import ray
from ray import tune
CONNECT_TO_ANYSCALE=True

  from pandas import MultiIndex, Int64Index


In [2]:
if ray.is_initialized:
    ray.shutdown()
    if CONNECT_TO_ANYSCALE:
        ray.init("anyscale://jsd-ray-core-tutorial")
    else:
        ray.init()

[1m[36mAuthenticating[0m
Loaded Anyscale authentication token from ANYSCALE_CLI_TOKEN.

[1m[36mOutput[0m
[1m[36m(anyscale +0.4s)[0m Using default project, id: prj_DKZuDR2pUwMzpVZD6PybXaUK.
[1m[36m(anyscale +0.6s)[0m cluster jsd-ray-core-tutorial is currently running, the cluster will not be restarted.
[1m[36m(anyscale +10.4s)[0m Connected to jsd-ray-core-tutorial, see: https://console.anyscale.com/projects/prj_DKZuDR2pUwMzpVZD6PybXaUK/clusters/ses_DabrzRMt26MfefUwmgcZSxcu
[1m[36m(anyscale +10.4s)[0m URL for head node of cluster: https://session-dabrzrmt26mfefuwmgczsxcu.i.anyscaleuserdata.com


## Step 1: Define a 'Trainable' training function to use with Ray Tune `ray.tune(...)`

In [3]:
NUM_OF_ACTORS = 4           # degree of parallel trials; each actor will have a separate trial with a set of unique config from the search space
NUM_OF_CPUS_PER_ACTOR = 1   # number of CPUs per actor

ray_params = RayParams(num_actors=NUM_OF_ACTORS, cpus_per_actor=NUM_OF_CPUS_PER_ACTOR)

In [4]:
def train_func_model(config:dict, checkpoint_dir=None):
    # create the dataset
    train_X, train_y = load_breast_cancer(return_X_y=True)
    # Convert to RayDMatrix data structure
    train_set = RayDMatrix(train_X, train_y)

    # Empty dictionary for the evaluation results reported back
    # to tune
    evals_result = {}

    # Train the model with XGBoost train
    bst = train(
        params=config,                       # our hyperparameter search space
        dtrain=train_set,                    # our RayDMatrix data structure
        evals_result=evals_result,           # place holder for results
        evals=[(train_set, "train")],
        verbose_eval=False,
        ray_params=ray_params)                # distributed parameters configs for Ray Tune

    bst.save_model("model.xgb")

## Step 2: Define a hyperparameter search space

In [5]:
 # Specify the typical hyperparameter search space
config = {
    "tree_method": "approx",
    "objective": "binary:logistic",
    "eval_metric": ["logloss", "error"],
    "eta": tune.loguniform(1e-4, 1e-1),
    "subsample": tune.uniform(0.5, 1.0),
    "max_depth": tune.randint(1, 9)
}

## Step 3: Run Ray tune main trainer and examine the results

Ray Tune will launch distributed HPO, using four remote actors, each with its own instance of the trainable func

<img src="images/ray_tune_dist_hpo.png" height="60%" width="70%"> 

In [6]:
# Run tune
analysis = tune.run(
    train_func_model,
    config=config,
    metric="train-error",
    mode="min",
    num_samples=4,
    verbose=1,
    resources_per_trial=ray_params.get_tune_resources()
)

[2m[36m(run pid=769)[0m == Status ==
[2m[36m(run pid=769)[0m Current time: 2022-03-28 14:49:08 (running for 00:00:00.23)
[2m[36m(run pid=769)[0m Memory usage on this node: 2.3/61.4 GiB
[2m[36m(run pid=769)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=769)[0m Resources requested: 0/80 CPUs, 0/0 GPUs, 0.0/215.68 GiB heap, 0.0/91.5 GiB objects
[2m[36m(run pid=769)[0m Result logdir: /home/ray/ray_results/train_func_model_2022-03-28_14-49-07
[2m[36m(run pid=769)[0m Number of trials: 4/4 (4 PENDING)
[2m[36m(run pid=769)[0m 
[2m[36m(run pid=769)[0m 


[2m[36m(train_func_model pid=950)[0m 2022-03-28 14:49:10,503	INFO main.py:985 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
[2m[36m(train_func_model pid=950)[0m 2022-03-28 14:49:12,522	INFO main.py:1030 -- [RayXGBoost] Starting XGBoost training.
[2m[36m(_RemoteRayXGBoostActor pid=1029)[0m [14:49:12] task [xgboost.ray]:140334922639488 got new rank 3
[2m[36m(_RemoteRayXGBoostActor pid=1026)[0m [14:49:12] task [xgboost.ray]:140259044624512 got new rank 0
[2m[36m(_RemoteRayXGBoostActor pid=1027)[0m [14:49:12] task [xgboost.ray]:140157224888448 got new rank 1
[2m[36m(_RemoteRayXGBoostActor pid=1028)[0m [14:49:12] task [xgboost.ray]:139946679531648 got new rank 2


[2m[36m(run pid=769)[0m == Status ==
[2m[36m(run pid=769)[0m Current time: 2022-03-28 14:49:18 (running for 00:00:10.30)
[2m[36m(run pid=769)[0m Memory usage on this node: 2.9/61.4 GiB
[2m[36m(run pid=769)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=769)[0m Resources requested: 20.0/80 CPUs, 0/0 GPUs, 0.0/215.68 GiB heap, 0.0/91.5 GiB objects
[2m[36m(run pid=769)[0m Result logdir: /home/ray/ray_results/train_func_model_2022-03-28_14-49-07
[2m[36m(run pid=769)[0m Number of trials: 4/4 (4 RUNNING)
[2m[36m(run pid=769)[0m 
[2m[36m(run pid=769)[0m 


[2m[36m(train_func_model pid=185, ip=172.31.90.159)[0m 2022-03-28 14:49:18,374	INFO main.py:985 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
[2m[36m(train_func_model pid=184, ip=172.31.89.181)[0m 2022-03-28 14:49:18,319	INFO main.py:985 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
[2m[36m(train_func_model pid=184, ip=172.31.69.42)[0m 2022-03-28 14:49:18,391	INFO main.py:985 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
[2m[36m(train_func_model pid=950)[0m 2022-03-28 14:49:18,471	INFO main.py:1509 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 8.15 seconds (5.94 pure XGBoost training time).


[2m[36m(run pid=769)[0m == Status ==
[2m[36m(run pid=769)[0m Current time: 2022-03-28 14:49:19 (running for 00:00:11.51)
[2m[36m(run pid=769)[0m Memory usage on this node: 2.9/61.4 GiB
[2m[36m(run pid=769)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=769)[0m Resources requested: 15.0/80 CPUs, 0/0 GPUs, 0.0/215.68 GiB heap, 0.0/91.5 GiB objects
[2m[36m(run pid=769)[0m Current best trial: e4ca2_00003 with train-error=0.031634 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.00026114429490650996, 'subsample': 0.5976619476448475, 'max_depth': 3, 'nthread': 1, 'n_jobs': 1}
[2m[36m(run pid=769)[0m Result logdir: /home/ray/ray_results/train_func_model_2022-03-28_14-49-07
[2m[36m(run pid=769)[0m Number of trials: 4/4 (3 RUNNING, 1 TERMINATED)
[2m[36m(run pid=769)[0m 
[2m[36m(run pid=769)[0m 


[2m[36m(train_func_model pid=184, ip=172.31.89.181)[0m 2022-03-28 14:49:20,238	INFO main.py:1030 -- [RayXGBoost] Starting XGBoost training.
[2m[36m(_RemoteRayXGBoostActor pid=265, ip=172.31.89.181)[0m [14:49:20] task [xgboost.ray]:139626710455104 got new rank 0
[2m[36m(_RemoteRayXGBoostActor pid=268, ip=172.31.89.181)[0m [14:49:20] task [xgboost.ray]:140345526932288 got new rank 3
[2m[36m(_RemoteRayXGBoostActor pid=266, ip=172.31.89.181)[0m [14:49:20] task [xgboost.ray]:139622177989440 got new rank 1
[2m[36m(_RemoteRayXGBoostActor pid=267, ip=172.31.89.181)[0m [14:49:20] task [xgboost.ray]:140242534789904 got new rank 2
[2m[36m(train_func_model pid=184, ip=172.31.69.42)[0m 2022-03-28 14:49:20,409	INFO main.py:1030 -- [RayXGBoost] Starting XGBoost training.
[2m[36m(_RemoteRayXGBoostActor pid=232, ip=172.31.69.42)[0m [14:49:20] task [xgboost.ray]:140236976400512 got new rank 0
[2m[36m(_RemoteRayXGBoostActor pid=237, ip=172.31.69.42)[0m [14:49:20] task [xgboost.ray

[2m[36m(run pid=769)[0m 2022-03-28 14:49:21,346	INFO commands.py:292 -- Checking External environment settings
[2m[36m(run pid=769)[0m 2022-03-28 14:49:23,796	WARN util.py:132 -- The `head_node` field is deprecated and will be ignored. Use `head_node_type` and `available_node_types` instead.
[2m[36m(run pid=769)[0m 2022-03-28 14:49:23,796	WARN util.py:137 -- The `worker_nodes` field is deprecated and will be ignored. Use `available_node_types` instead.
[2m[36m(run pid=769)[0m [1m[36mAuthenticating[0m
[2m[36m(run pid=769)[0m 


[2m[36m(run pid=769)[0m Loaded Anyscale authentication token from variable.


[2m[36m(run pid=769)[0m 2022-03-28 14:49:25,719	INFO command_runner.py:357 -- Fetched IP: 172.31.90.159
[2m[36m(run pid=769)[0m 2022-03-28 14:49:25,719	INFO log_timer.py:25 -- NodeUpdater: ins_xq9kwXSZLKUYZFz6dfDxT3jD: Got IP  [LogTimer=29ms]
[2m[36m(run pid=769)[0m == Status ==
[2m[36m(run pid=769)[0m Current time: 2022-03-28 14:49:26 (running for 00:00:18.90)
[2m[36m(run pid=769)[0m Memory usage on this node: 2.5/61.4 GiB
[2m[36m(run pid=769)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=769)[0m Resources requested: 15.0/80 CPUs, 0/0 GPUs, 0.0/215.68 GiB heap, 0.0/91.5 GiB objects
[2m[36m(run pid=769)[0m Current best trial: e4ca2_00003 with train-error=0.031634 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.00026114429490650996, 'subsample': 0.5976619476448475, 'max_depth': 3, 'nthread': 1, 'n_jobs': 1}
[2m[36m(run pid=769)[0m Result logdir: /home/ray/ray_results/train_func_model_

[2m[36m(train_func_model pid=184, ip=172.31.69.42)[0m 2022-03-28 14:49:30,770	INFO main.py:1509 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 12.57 seconds (10.36 pure XGBoost training time).
[2m[36m(train_func_model pid=185, ip=172.31.90.159)[0m 2022-03-28 14:49:30,779	INFO main.py:1509 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 12.60 seconds (10.38 pure XGBoost training time).
[2m[36m(train_func_model pid=184, ip=172.31.89.181)[0m 2022-03-28 14:49:30,782	INFO main.py:1509 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 12.67 seconds (10.54 pure XGBoost training time).


[2m[36m(run pid=769)[0m == Status ==
[2m[36m(run pid=769)[0m Current time: 2022-03-28 14:49:31 (running for 00:00:23.20)
[2m[36m(run pid=769)[0m Memory usage on this node: 2.3/61.4 GiB
[2m[36m(run pid=769)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=769)[0m Resources requested: 0/80 CPUs, 0/0 GPUs, 0.0/215.68 GiB heap, 0.0/91.5 GiB objects
[2m[36m(run pid=769)[0m Current best trial: e4ca2_00002 with train-error=0.010545 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.0005792447612165456, 'subsample': 0.9104148298038599, 'max_depth': 8, 'nthread': 1, 'n_jobs': 1}
[2m[36m(run pid=769)[0m Result logdir: /home/ray/ray_results/train_func_model_2022-03-28_14-49-07
[2m[36m(run pid=769)[0m Number of trials: 4/4 (4 TERMINATED)
[2m[36m(run pid=769)[0m 
[2m[36m(run pid=769)[0m 


[2m[36m(run pid=769)[0m 2022-03-28 14:49:31,195	INFO tune.py:639 -- Total run time: 24.22 seconds (23.19 seconds for the tuning loop).


In [7]:
print("Best hyperparameters", analysis.best_config)

Best hyperparameters {'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.0005792447612165456, 'subsample': 0.9104148298038599, 'max_depth': 8}


In [8]:
analysis.results_df.head(5)



Unnamed: 0_level_0,train-logloss,train-error,time_this_iter_s,done,timesteps_total,episodes_total,training_iteration,experiment_id,date,timestamp,...,iterations_since_restore,experiment_tag,config.tree_method,config.objective,config.eval_metric,config.eta,config.subsample,config.max_depth,config.nthread,config.n_jobs
trial_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
e4ca2_00000,0.370544,0.056239,0.004565,True,,,10,0bf6b4af62bb41eea562179a49dabd1c,2022-03-28_14-49-30,1648504170,...,10,"0_eta=0.07609,max_depth=1,subsample=0.70435",approx,binary:logistic,"[logloss, error]",0.07609,0.704347,1,1,1
e4ca2_00001,0.686168,0.035149,0.00404,True,,,10,1298d113d7f74c5aa535118099aac140,2022-03-28_14-49-30,1648504170,...,10,"1_eta=0.00085375,max_depth=2,subsample=0.99038",approx,binary:logistic,"[logloss, error]",0.000854,0.990381,2,1,1
e4ca2_00002,0.688075,0.010545,0.005621,True,,,10,ef3e8d20fb7f4365bbd6fa24932ddbab,2022-03-28_14-49-30,1648504170,...,10,"2_eta=0.00057924,max_depth=8,subsample=0.91041",approx,binary:logistic,"[logloss, error]",0.000579,0.910415,8,1,1
e4ca2_00003,0.690972,0.031634,0.004894,True,,,10,962043591a5047cc85260daedcd74af3,2022-03-28_14-49-18,1648504158,...,10,"3_eta=0.00026114,max_depth=3,subsample=0.59766",approx,binary:logistic,"[logloss, error]",0.000261,0.597662,3,1,1


---

In [9]:
ray.shutdown()

## References

 * [Ray Train: Tune: Scalable Hyperparameter Tuning](https://docs.ray.io/en/master/tune/index.html)
 * [Introducing Distributed XGBoost Training with Ray](https://www.anyscale.com/blog/distributed-xgboost-training-with-ray)
 * [How to Speed Up XGBoost Model Training](https://www.anyscale.com/blog/how-to-speed-up-xgboost-model-training)
 * [XGBoost-Ray Project](https://github.com/ray-project/xgboost_ray)
 * [Distributed XGBoost on Ray](https://docs.ray.io/en/latest/xgboost-ray.html)