# Model training

For this demo we'll use the freely available Statlog (German Credit Data) Data Set, which can be downloaded from [Kaggle](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)). This dataset classifies customers based on a set of attributes into two credit risk groups - good or bad. The majority of the attributes in this data set are categorical, and they are symbolically encoded. For example, attribute 1 represents the status of an existing checking account and can take one of the following values:

A11 : ... < 0 DM

A12 : 0 <= ... < 200 DM

A13 : ... >= 200 DM / salary assignments for at least 1 year

A14 : no checking account

A comprehensive list of all attributes and symbol codes is given in the [document](https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.doc) that accompanies the original dataset. 

The data we use in this demo has also been balanced and upsampled (see the [Data Generation](./data_generation.ipynb) notebook for reference).

## Seting up and connecting to Ray


Let's start by loading all the libraries needed for the notebook and by setting up default data paths.


In [1]:
import os
import ray
import glob
import eli5

import xgboost_ray as xgbr
import xgboost as xgb
import pandas as pd

from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from ray import tune

DATA_ROOT = os.path.join("/mnt/data", os.environ["DOMINO_PROJECT_NAME"], "data") 
MODEL_ROOT = "/mnt/artifacts"
TUNE_ROOT = os.path.join("/mnt/data", os.environ["DOMINO_PROJECT_NAME"], "ray_results")

In this demo we'll use a dataset of a modest size (approx. 700 MB). Unfortunately, the standard Python libraries for data processing and machine learning Pandas and NumPy have never been designed with large datasets in mind. They rely on being able to fit the entire data in-memory with Pandas data frames having a hard theoretical limit of 100GB. In practice, the amount of data these libraries can handle is also restricted by the amount of physical memory available to the container that runs them, thus they'll have challenges handling even the 700 MB needed for our demo dataset. Trying to load our training data into a simple Pandas data frame using the code below will likely crash the Jupyter kernel.

``` 
# Do not run this code - it will likely crash the Jupyter kernel 
# (depending on the HW tier running the kernel)

import pandas as pd
import glob
import os

all_files = glob.glob(DATA_ROOT + "/train_data_*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

training_df = pd.concat(li, axis=0, ignore_index=True)
training_df.head()
```

To circumvent these restrictions Domino provides support for a number of industry-standard distributed computing frameworks like Ray, Dask, and Spark. In this demo we will use [On-Demand Ray](https://docs.dominodatalab.com/en/latest/user_guide/d13903/on-demand-ray-overview/). 

Ray is a general framework that enables you to quickly parallelize existing Python code, but it is also talked about as a "framework for building frameworks". Indeed, there are a growing number of domain-specific libraries that work on top of Ray.

![Ray](./images/ray.png)

For example:

* RaySGD - a library for distributed deep learning, which provides wrappers around PyTorch and TensorFlow
* RLlib - a library for reinforcement learning, which also natively supports TensorFlow, TensorFlow Eager, and PyTorch
* RayServe - a scalable, model-serving library
* Ray Tune - a hyperparameter optimization framework, most commonly used for deep and reinforcement learning

In this demo we'll use [Ray Tune](https://docs.ray.io/en/latest/tune/index.html) for hyperparameter optimisation and [XGBoost on Ray](https://github.com/ray-project/xgboost_ray) for model training.

In [2]:
# We'll set up Ray for 2 workers, 4 CPUs each (12 CPUs in total, counting the head node).

RAY_ACTORS = 3
RAY_CPUS_PER_ACTOR = 4

Let's connect to Ray.

In [3]:
if ray.is_initialized() == False:
    service_host = os.environ["RAY_HEAD_SERVICE_HOST"]
    service_port = os.environ["RAY_HEAD_SERVICE_PORT"]
    ray.init(f"ray://{service_host}:{service_port}")

Let's confirm we have the expected cluster configuration.

In [4]:
ray.nodes()

[{'NodeID': '5ac62f74ae8cb9722a46aa5f8763fc6c2ce81350cda268e0d302fe8d',
  'Alive': True,
  'NodeManagerAddress': '100.64.71.114',
  'NodeManagerHostname': 'ray-6793bb629cb041514b9307ed-ray-head-0',
  'NodeManagerPort': 2385,
  'ObjectManagerPort': 2384,
  'ObjectStoreSocketName': '/tmp/ray/session_2025-01-24_08-10-18_441645_1/sockets/plasma_store',
  'RayletSocketName': '/tmp/ray/session_2025-01-24_08-10-18_441645_1/sockets/raylet',
  'MetricsExportPort': 64793,
  'NodeName': '100.64.71.114',
  'alive': True,
  'Resources': {'object_store_memory': 4753630003.0,
   'memory': 9507260007.0,
   'node:100.64.71.114': 1.0}}]

Now let's create a list of all the shards for our training, validation, and test sets.

In [5]:
train_files = glob.glob(os.path.join(DATA_ROOT, "train_data*"))
val_files = glob.glob(os.path.join(DATA_ROOT, "validation_data*"))

test_file = os.path.join(DATA_ROOT, "test_data.csv")

target_col = "credit"

XGBoost-Ray provides a drop-in replacement for XGBoost's train function. To pass data, instead of using xgb.DMatrix we will have to use xgboost_ray.RayDMatrix. The RayDMatrix lazy loads data and stores it sharded in the Ray object store. The Ray XGBoost actors then access these shards to run their training on. Let's wrap our training, validation, and test sets into RayDMatrix objects.

In [6]:
# Although it is possible to specify the number of Actors when initializing the RayDMatrix, it is not necessary,
#  and can cause a conflict if different from the number of Actors chosen for training.

rdm_train = xgbr.RayDMatrix(train_files, label=target_col)
rdm_val = xgbr.RayDMatrix(val_files, label=target_col)

df_test = pd.read_csv(test_file)
rdm_test = xgbr.RayDMatrix(df_test, label=target_col)

In [7]:
# This function verifies whether the data will support splitting into a given number of shards.
# We use this to validate that our splits are compatible with the selected Ray cluster configuraiton (i.e. number of Ray nodes)

rdm_train.assert_enough_shards_for_actors(len(train_files))
rdm_train.assert_enough_shards_for_actors(len(val_files))

In [8]:
print("Will the read be distributed?", rdm_train.distributed)
print("Has any data been read yet?", rdm_train.loaded) # Remember, lazy loading

Will the read be distributed? True
Has any data been read yet? False


## Model training

Let's first try to train a single model in order to validate our setup. Feel free to switch to the Ray Web UI tab and observe the distribution of workload among the individual Ray nodes.

A few things to note:

* We are using “binary:logistic” – logistic regression for binary classification (*credit* is in {0,1}), which outputs probability
* We are calculating both logloss and error as evaluation metrics. They don't impact the model fitting
* We are passing the cluster topology via the xgb_ray_params objects so that the workload can be correctly distributed


In [9]:
# Set a few hyperparameters to specific values
param = {
    "seed":1234,
    "max_depth":3,
    "eta":0.1,
    "objective": "binary:logistic",
    "eval_metric": ["logloss", "error"]
}

xgb_ray_params = xgbr.RayParams(
    num_actors=RAY_ACTORS,
    cpus_per_actor=RAY_CPUS_PER_ACTOR
)

# Train the model
evals_result = {}

bst = xgbr.train(
    param,
    rdm_train,
    num_boost_round=50,
    verbose_eval=True,
    evals_result=evals_result,
    evals =[(rdm_train, "train"), (rdm_val, "val")],
    ray_params=xgb_ray_params
)

print("Final training error: {:.4f}".format(evals_result["train"]["error"][-1]))
print("Final validation error: {:.4f}".format(evals_result["val"]["error"][-1]))

Use get_node_id() instead
  current_node_id = ray.get_runtime_context().node_id.hex()


[2m[1m[36m(autoscaler +4s)[0m Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
[2m[1m[33m(autoscaler +4s)[0m Error: No available node types can fulfill resource request {'CPU': 4.0}. Add suitable node types to this cluster to resolve this issue.


[2m[36m(_wrapped pid=454)[0m 2025-01-24 08:12:35,485	INFO main.py:1047 -- [RayXGBoost] Created 3 new actors (3 total actors). Waiting until actors are ready for training.
[2m[36m(_wrapped pid=454)[0m 2025-01-24 08:12:49,201	INFO main.py:1092 -- [RayXGBoost] Starting XGBoost training.
[2m[36m(_RemoteRayXGBoostActor pid=171, ip=100.64.32.41)[0m [08:12:49] task [xgboost.ray]:131061809247136 got new rank 0
[2m[36m(_RemoteRayXGBoostActor pid=171, ip=100.64.59.160)[0m [08:12:49] task [xgboost.ray]:136176507720224 got new rank 1
[2m[36m(_RemoteRayXGBoostActor pid=211, ip=100.64.77.77)[0m [08:12:49] task [xgboost.ray]:130504528101488 got new rank 2


[2m[36m(_wrapped pid=454)[0m [0]	train-logloss:0.65890	train-error:0.22747	val-logloss:0.65631	val-error:0.20883
[2m[36m(_wrapped pid=454)[0m [1]	train-logloss:0.63109	train-error:0.22331	val-logloss:0.62514	val-error:0.20021
[2m[36m(_wrapped pid=454)[0m [2]	train-logloss:0.60870	train-error:0.22340	val-logloss:0.60081	val-error:0.20959
[2m[36m(_wrapped pid=454)[0m [3]	train-logloss:0.58690	train-error:0.18629	val-logloss:0.57680	val-error:0.15792
[2m[36m(_wrapped pid=454)[0m [4]	train-logloss:0.56898	train-error:0.19034	val-logloss:0.55650	val-error:0.16499
[2m[36m(_wrapped pid=454)[0m [5]	train-logloss:0.55210	train-error:0.20139	val-logloss:0.53913	val-error:0.17972
[2m[36m(_wrapped pid=454)[0m [6]	train-logloss:0.53646	train-error:0.18746	val-logloss:0.52258	val-error:0.16193
[2m[36m(_wrapped pid=454)[0m [7]	train-logloss:0.52353	train-error:0.18899	val-logloss:0.50922	val-error:0.16659
[2m[36m(_wrapped pid=454)[0m [8]	train-logloss:0.51097	train-error:0.

[2m[36m(_wrapped pid=454)[0m 2025-01-24 08:13:19,846	INFO main.py:1175 -- Training in progress (31 seconds since last restart).


[2m[36m(_wrapped pid=454)[0m [33]	train-logloss:0.34684	train-error:0.11448	val-logloss:0.33241	val-error:0.10723
[2m[36m(_wrapped pid=454)[0m [34]	train-logloss:0.34377	train-error:0.11517	val-logloss:0.32920	val-error:0.10568
[2m[36m(_wrapped pid=454)[0m [35]	train-logloss:0.33971	train-error:0.11360	val-logloss:0.32577	val-error:0.10412
[2m[36m(_wrapped pid=454)[0m [36]	train-logloss:0.33605	train-error:0.10761	val-logloss:0.32224	val-error:0.10097
[2m[36m(_wrapped pid=454)[0m [37]	train-logloss:0.33335	train-error:0.10382	val-logloss:0.31979	val-error:0.09627
[2m[36m(_wrapped pid=454)[0m [38]	train-logloss:0.32945	train-error:0.10460	val-logloss:0.31619	val-error:0.09786
[2m[36m(_wrapped pid=454)[0m [39]	train-logloss:0.32654	train-error:0.10505	val-logloss:0.31294	val-error:0.09787
[2m[36m(_wrapped pid=454)[0m [40]	train-logloss:0.32362	train-error:0.10278	val-logloss:0.31013	val-error:0.09549
[2m[36m(_wrapped pid=454)[0m [41]	train-logloss:0.32094	train

[2m[36m(_wrapped pid=454)[0m 2025-01-24 08:13:32,659	INFO main.py:1587 -- [RayXGBoost] Finished XGBoost training on training data with total N=2,100,000 in 82.53 seconds (43.45 pure XGBoost training time).


Now that we've confirmed the pipeline we can move onto performing some hyperparameter tuning for finding an optimal model.

## Hyperparameter tuning

Hyperparameter tuning requires training many copies of a model, each with a different set of hyperparameters, and seeing which one performs the best. Each time we train a model, that is one trial. To do this in our Ray cluster, we can specify what resources to use:

* Required CPU, Memory, and/or GPU per trial
* Where to store intermediate results

The `xgboost_ray` library includes a built-in method for generating a `PlacementGroupFactory` to pass to Ray Tune, based on the `RayParams` object used for XGBoost training. Resources can also be requested in a simpler dictionary format, e.g. `{"cpu": 2.0}`. As described in the [Tune docs](https://docs.ray.io/en/latest/tune/tutorials/tune-resources.html), by default Ray Tune will schedule N concurrent trials, using 1 CPU per trial, where N is the total number of CPUs available in the cluster.

In [10]:
# Get the placement group factory to pass to Ray Tune
# Notice how the tune resources are 1 CPU greater!
xgb_tune_resources = xgb_ray_params.get_tune_resources()
print(f"We will pass a {type(xgb_tune_resources)} to Ray Tune.")
print(f"It will request {xgb_tune_resources.required_resources} per trial.")
print(f"The cluster has {ray.cluster_resources()['CPU']} CPU total.")

We will pass a <class 'ray.tune.execution.placement_groups.PlacementGroupFactory'> to Ray Tune.
It will request {'CPU': 12.0} per trial.
The cluster has 12.0 CPU total.


In [11]:
print("Saving intermediate tune results to", TUNE_ROOT)

Saving intermediate tune results to /mnt/data/Demo-Credit-Default-Model/ray_results


In this demo we will use a very simple search strategy called *a grid search*. This involves searching over a predefined grid of hyperparameter choices - and it's easy to imaging writing a simple for loop to implement it. However, for $n$ choices each of $k$ hyperparameters, a full grid search requires $O(n^k)$ trials and quickly becomes prohibitively expensive.

Ray Tune provides much more sophisticated options for optimization. Instead of pre-defining a fixed grid to search over, Ray Tune allows specifying a [search space](https://docs.ray.io/en/releases-1.11.0/tune/key-concepts.html#search-spaces) with distributions of parameters. The number of trials over the search space is specified at a later stage in the `run()` function.

In [12]:
config = {
    "seed": 1234,
    "eta": tune.loguniform(3e-3, 3e-1),
    "max_depth": tune.randint(2, 6),
    "objective": "binary:logistic",
    "eval_metric": ["logloss", "error"]
}

For each trial, a config dictionary like the one we just defined, with the single value for each hyperparameter chosen for that trial, will be passed into a [trainable](https://docs.ray.io/en/releases-1.11.0/tune/key-concepts.html#search-algorithms) that we define and pass to Ray Tune. Below we have defined such a function to wrap training a single XGBoost model on Ray.

In [13]:
def my_trainer(config):
    evals_result = {}
    bst = xgbr.train(
        params=config,
        dtrain=rdm_train,
        num_boost_round=50,
        evals_result=evals_result,
        evals=[(rdm_train, "train"), (rdm_val, "val")],
        ray_params=xgb_ray_params
    )
    bst.save_model("model.xgb") # This will go into the TUNE_ROOT directory

Finally, we can now run our trials. Here we bring together the previous few sections:

* The training function
* The search space defined in the config
* The resources per trial and results location

We control the number of trials over the search space via the `num_samples` argument (currently set to 10). We also rank the models based on the lowest validation set error.

In [14]:
analysis = tune.run(
    my_trainer,
    config=config,
    resources_per_trial=xgb_tune_resources,
    local_dir=TUNE_ROOT,
    metric="val-error",
    mode="min",
    num_samples=10,
    verbose=1,
    progress_reporter=tune.JupyterNotebookReporter(overwrite=True)
)

0,1
Current time:,2025-01-24 08:21:32
Running for:,00:07:56.39
Memory:,7.8/30.8 GiB

Trial name,# failures,error file
my_trainer_28b6d_00005,1,"/mnt/data/Demo-Credit-Default-Model/ray_results/my_trainer_2025-01-24_08-13-33/my_trainer_28b6d_00005_5_eta=0.0105,max_depth=3_2025-01-24_08-17-56/error.txt"

Trial name,status,loc,eta,max_depth,iter,total time (s),train-logloss,train-error,val-logloss
my_trainer_28b6d_00000,TERMINATED,100.64.77.77:295,0.00429871,5,50.0,51.3373,0.598943,0.12756,0.594395
my_trainer_28b6d_00001,TERMINATED,100.64.77.77:295,0.0125163,5,50.0,51.9643,0.481246,0.118656,0.470189
my_trainer_28b6d_00002,TERMINATED,100.64.77.77:295,0.108287,2,50.0,49.6379,0.375754,0.139993,0.363039
my_trainer_28b6d_00003,TERMINATED,100.64.77.77:295,0.207217,5,50.0,50.6038,0.0757972,0.00486095,0.0705301
my_trainer_28b6d_00004,TERMINATED,100.64.77.77:295,0.0664882,5,50.0,51.4298,0.212149,0.0441219,0.199608
my_trainer_28b6d_00006,TERMINATED,100.64.77.77:1478,0.00450201,2,50.0,49.8746,0.647064,0.266268,0.643892
my_trainer_28b6d_00007,TERMINATED,100.64.77.77:1478,0.0521433,3,50.0,49.9487,0.385515,0.126247,0.372188
my_trainer_28b6d_00008,TERMINATED,100.64.77.77:1478,0.00564319,5,50.0,51.6142,0.57524,0.127462,0.569507
my_trainer_28b6d_00009,TERMINATED,100.64.77.77:1478,0.00459217,2,50.0,49.3611,0.646307,0.266268,0.643061
my_trainer_28b6d_00005,ERROR,100.64.77.77:295,0.0105371,3,,,,,


[2m[36m(_RemoteRayXGBoostActor pid=255, ip=100.64.32.41)[0m [08:13:46] task [xgboost.ray]:124607901438736 got new rank 0
[2m[36m(_RemoteRayXGBoostActor pid=253, ip=100.64.59.160)[0m [08:13:46] task [xgboost.ray]:123804710577056 got new rank 1
[2m[36m(_RemoteRayXGBoostActor pid=338, ip=100.64.77.77)[0m [08:13:46] task [xgboost.ray]:130961744736112 got new rank 2


[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [0]	train-logloss:0.69087	train-error:0.16109	val-logloss:0.69084	val-error:0.15954
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [1]	train-logloss:0.68861	train-error:0.16109	val-logloss:0.68855	val-error:0.15954
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [2]	train-logloss:0.68637	train-error:0.16109	val-logloss:0.68630	val-error:0.15954
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [3]	train-logloss:0.68415	train-error:0.16109	val-logloss:0.68405	val-error:0.15954
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [4]	train-logloss:0.68196	train-error:0.15997	val-logloss:0.68185	val-error:0.15799
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [5]	train-logloss:0.67977	train-error:0.15997	val-logloss:0.67963	val-error:0.15799
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [6]	train-logloss:0.67762	train-error:0.15997	val-logloss:0.67746	val-error:0.15799
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [7]	

[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m 2025-01-24 08:14:16,959	INFO main.py:1175 -- Training in progress (31 seconds since last restart).


[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [31]	train-logloss:0.62914	train-error:0.14940	val-logloss:0.62632	val-error:0.13385
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [32]	train-logloss:0.62737	train-error:0.14906	val-logloss:0.62440	val-error:0.13308
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [33]	train-logloss:0.62565	train-error:0.14759	val-logloss:0.62255	val-error:0.12998
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [34]	train-logloss:0.62384	train-error:0.14838	val-logloss:0.62066	val-error:0.13153
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [35]	train-logloss:0.62213	train-error:0.14692	val-logloss:0.61885	val-error:0.12998
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [36]	train-logloss:0.62040	train-error:0.14366	val-logloss:0.61699	val-error:0.12765
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [37]	train-logloss:0.61866	train-error:0.14366	val-logloss:0.61520	val-error:0.12765
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[

[2m[36m(_RemoteRayXGBoostActor pid=377, ip=100.64.32.41)[0m [08:14:38] task [xgboost.ray]:130079426449856 got new rank 0
[2m[36m(_RemoteRayXGBoostActor pid=373, ip=100.64.59.160)[0m [08:14:38] task [xgboost.ray]:138372102857536 got new rank 1
[2m[36m(_RemoteRayXGBoostActor pid=484, ip=100.64.77.77)[0m [08:14:38] task [xgboost.ray]:126805135118400 got new rank 2


[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [0]	train-logloss:0.68654	train-error:0.16109	val-logloss:0.68645	val-error:0.15954
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [1]	train-logloss:0.68009	train-error:0.16109	val-logloss:0.67999	val-error:0.15954
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [2]	train-logloss:0.67384	train-error:0.15846	val-logloss:0.67362	val-error:0.15730
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [3]	train-logloss:0.66773	train-error:0.15846	val-logloss:0.66749	val-error:0.15730
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [4]	train-logloss:0.66170	train-error:0.15846	val-logloss:0.66136	val-error:0.15730
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [5]	train-logloss:0.65596	train-error:0.16153	val-logloss:0.65518	val-error:0.15960
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [6]	train-logloss:0.65034	train-error:0.15814	val-logloss:0.64911	val-error:0.15729
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [7]	

[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m 2025-01-24 08:15:09,252	INFO main.py:1175 -- Training in progress (31 seconds since last restart).


[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [31]	train-logloss:0.53831	train-error:0.12846	val-logloss:0.53062	val-error:0.11591
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [32]	train-logloss:0.53483	train-error:0.12879	val-logloss:0.52686	val-error:0.11592
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [33]	train-logloss:0.53141	train-error:0.12791	val-logloss:0.52329	val-error:0.11513
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [34]	train-logloss:0.52800	train-error:0.13015	val-logloss:0.51956	val-error:0.11669
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [35]	train-logloss:0.52470	train-error:0.12846	val-logloss:0.51608	val-error:0.11511
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [36]	train-logloss:0.52143	train-error:0.12712	val-logloss:0.51252	val-error:0.11355
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [37]	train-logloss:0.51823	train-error:0.12542	val-logloss:0.50920	val-error:0.11280
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[

[2m[36m(_RemoteRayXGBoostActor pid=460, ip=100.64.32.41)[0m [08:15:30] task [xgboost.ray]:127364996489808 got new rank 0
[2m[36m(_RemoteRayXGBoostActor pid=496, ip=100.64.59.160)[0m [08:15:30] task [xgboost.ray]:125295200031072 got new rank 1
[2m[36m(_RemoteRayXGBoostActor pid=709, ip=100.64.77.77)[0m [08:15:31] task [xgboost.ray]:133228812800448 got new rank 2


[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [0]	train-logloss:0.66778	train-error:0.30285	val-logloss:0.66628	val-error:0.29079
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [1]	train-logloss:0.64664	train-error:0.26686	val-logloss:0.64339	val-error:0.24841
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [2]	train-logloss:0.62921	train-error:0.27810	val-logloss:0.62424	val-error:0.26415
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [3]	train-logloss:0.61223	train-error:0.26346	val-logloss:0.60586	val-error:0.24932
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [4]	train-logloss:0.59764	train-error:0.25796	val-logloss:0.59144	val-error:0.24541
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [5]	train-logloss:0.58507	train-error:0.25029	val-logloss:0.57784	val-error:0.22892
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [6]	train-logloss:0.57321	train-error:0.24804	val-logloss:0.56515	val-error:0.23061
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [7]	

[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m 2025-01-24 08:16:01,473	INFO main.py:1175 -- Training in progress (31 seconds since last restart).


[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [33]	train-logloss:0.41660	train-error:0.15396	val-logloss:0.40467	val-error:0.14480
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [34]	train-logloss:0.41414	train-error:0.15375	val-logloss:0.40213	val-error:0.14479
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [35]	train-logloss:0.40999	train-error:0.14842	val-logloss:0.39734	val-error:0.13926
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [36]	train-logloss:0.40665	train-error:0.15058	val-logloss:0.39398	val-error:0.13925
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [37]	train-logloss:0.40393	train-error:0.15340	val-logloss:0.39082	val-error:0.14160
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [38]	train-logloss:0.40090	train-error:0.15014	val-logloss:0.38813	val-error:0.14236
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [39]	train-logloss:0.39808	train-error:0.14630	val-logloss:0.38537	val-error:0.13612
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[

[2m[36m(_RemoteRayXGBoostActor pid=585, ip=100.64.32.41)[0m [08:16:20] task [xgboost.ray]:132773705380768 got new rank 0
[2m[36m(_RemoteRayXGBoostActor pid=616, ip=100.64.59.160)[0m [08:16:20] task [xgboost.ray]:139949385659248 got new rank 1
[2m[36m(_RemoteRayXGBoostActor pid=892, ip=100.64.77.77)[0m [08:16:20] task [xgboost.ray]:137329241047536 got new rank 2


[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [0]	train-logloss:0.59439	train-error:0.16109	val-logloss:0.59310	val-error:0.15954
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [1]	train-logloss:0.52654	train-error:0.14589	val-logloss:0.51986	val-error:0.13769
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [2]	train-logloss:0.47541	train-error:0.12346	val-logloss:0.46558	val-error:0.11588
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [3]	train-logloss:0.42880	train-error:0.10766	val-logloss:0.41875	val-error:0.10165
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [4]	train-logloss:0.39274	train-error:0.09594	val-logloss:0.38196	val-error:0.08763
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [5]	train-logloss:0.36400	train-error:0.09249	val-logloss:0.35204	val-error:0.08380
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [6]	train-logloss:0.33944	train-error:0.08536	val-logloss:0.32608	val-error:0.07436
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [7]	

[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m 2025-01-24 08:16:51,167	INFO main.py:1175 -- Training in progress (31 seconds since last restart).


[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [32]	train-logloss:0.12157	train-error:0.01570	val-logloss:0.11190	val-error:0.01098
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [33]	train-logloss:0.11625	train-error:0.01424	val-logloss:0.10740	val-error:0.01100
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [34]	train-logloss:0.11369	train-error:0.01424	val-logloss:0.10509	val-error:0.01100
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [35]	train-logloss:0.11043	train-error:0.01255	val-logloss:0.10209	val-error:0.01024
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [36]	train-logloss:0.10569	train-error:0.01186	val-logloss:0.09707	val-error:0.00864
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [37]	train-logloss:0.10074	train-error:0.00961	val-logloss:0.09212	val-error:0.00629
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [38]	train-logloss:0.09815	train-error:0.00769	val-logloss:0.08966	val-error:0.00470
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[

[2m[36m(_RemoteRayXGBoostActor pid=704, ip=100.64.32.41)[0m [08:17:11] task [xgboost.ray]:131409218359840 got new rank 0
[2m[36m(_RemoteRayXGBoostActor pid=740, ip=100.64.59.160)[0m [08:17:11] task [xgboost.ray]:124295231262576 got new rank 1
[2m[36m(_RemoteRayXGBoostActor pid=1076, ip=100.64.77.77)[0m [08:17:11] task [xgboost.ray]:136771580309360 got new rank 2


[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [0]	train-logloss:0.65899	train-error:0.16109	val-logloss:0.65855	val-error:0.15954
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [1]	train-logloss:0.62963	train-error:0.15153	val-logloss:0.62689	val-error:0.13858
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [2]	train-logloss:0.60215	train-error:0.14772	val-logloss:0.59887	val-error:0.13705
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [3]	train-logloss:0.57825	train-error:0.13895	val-logloss:0.57319	val-error:0.12379
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [4]	train-logloss:0.55623	train-error:0.13373	val-logloss:0.55039	val-error:0.12066
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [5]	train-logloss:0.53662	train-error:0.12645	val-logloss:0.52941	val-error:0.11121
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [6]	train-logloss:0.51835	train-error:0.12357	val-logloss:0.50976	val-error:0.10809
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [7]	

[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m 2025-01-24 08:17:42,114	INFO main.py:1175 -- Training in progress (31 seconds since last restart).


[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [32]	train-logloss:0.27633	train-error:0.07198	val-logloss:0.26399	val-error:0.06346
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [33]	train-logloss:0.27216	train-error:0.06997	val-logloss:0.26000	val-error:0.06190
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [34]	train-logloss:0.26740	train-error:0.06759	val-logloss:0.25541	val-error:0.06346
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [35]	train-logloss:0.26266	train-error:0.06339	val-logloss:0.25040	val-error:0.05718
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [36]	train-logloss:0.25838	train-error:0.06352	val-logloss:0.24589	val-error:0.05718
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [37]	train-logloss:0.25402	train-error:0.06352	val-logloss:0.24150	val-error:0.05718
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m [38]	train-logloss:0.24984	train-error:0.06103	val-logloss:0.23731	val-error:0.05167
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[

[2m[36m(_RemoteRayXGBoostActor pid=786, ip=100.64.32.41)[0m [08:18:02] task [xgboost.ray]:130721911181952 got new rank 0
[2m[36m(_RemoteRayXGBoostActor pid=821, ip=100.64.59.160)[0m [08:18:02] task [xgboost.ray]:138018465751392 got new rank 1
[2m[36m(_RemoteRayXGBoostActor pid=1301, ip=100.64.77.77)[0m [08:18:02] task [xgboost.ray]:128136061624480 got new rank 2
[2m[36m(my_trainer pid=295, ip=100.64.77.77)[0m 2025-01-24 08:18:04,720	INFO elastic.py:155 -- Actor status: 3 alive, 0 dead (3 total)
[2m[36m(run pid=454)[0m 2025-01-24 08:18:04,762	ERROR trial_runner.py:1450 -- Trial my_trainer_28b6d_00005: Error happened when processing _ExecutorEventType.TRAINING_RESULT.
[2m[36m(run pid=454)[0m ray.exceptions.RayTaskError(RuntimeError): [36mray::ImplicitFunc.train()[39m (pid=295, ip=100.64.77.77, repr=my_trainer)
[2m[36m(run pid=454)[0m ray.exceptions.RayTaskError(RayXGBoostTrainingError): [36mray::_RemoteRayXGBoostActor.train()[39m (pid=1301, ip=100.64.77.77, repr=

[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [0]	train-logloss:0.69203	train-error:0.30285	val-logloss:0.69197	val-error:0.29079
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [1]	train-logloss:0.69093	train-error:0.30285	val-logloss:0.69081	val-error:0.29079
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [2]	train-logloss:0.68984	train-error:0.30285	val-logloss:0.68965	val-error:0.29079
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [3]	train-logloss:0.68876	train-error:0.30285	val-logloss:0.68851	val-error:0.29079
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [4]	train-logloss:0.68768	train-error:0.26686	val-logloss:0.68736	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [5]	train-logloss:0.68662	train-error:0.26686	val-logloss:0.68623	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [6]	train-logloss:0.68556	train-error:0.26686	val-logloss:0.68510	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)

[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m 2025-01-24 08:18:46,141	INFO main.py:1175 -- Training in progress (31 seconds since last restart).


[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [33]	train-logloss:0.65996	train-error:0.26686	val-logloss:0.65770	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [34]	train-logloss:0.65911	train-error:0.26686	val-logloss:0.65678	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [35]	train-logloss:0.65827	train-error:0.26686	val-logloss:0.65588	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [36]	train-logloss:0.65743	train-error:0.26686	val-logloss:0.65497	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [37]	train-logloss:0.65660	train-error:0.26686	val-logloss:0.65409	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [38]	train-logloss:0.65577	train-error:0.26686	val-logloss:0.65319	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [39]	train-logloss:0.65495	train-error:0.26686	val-logloss:0.65232	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.

[2m[36m(_RemoteRayXGBoostActor pid=1028, ip=100.64.32.41)[0m [08:19:06] task [xgboost.ray]:127572619977488 got new rank 0
[2m[36m(_RemoteRayXGBoostActor pid=1022, ip=100.64.59.160)[0m [08:19:06] task [xgboost.ray]:123593326957424 got new rank 1
[2m[36m(_RemoteRayXGBoostActor pid=1664, ip=100.64.77.77)[0m [08:19:06] task [xgboost.ray]:136037348905744 got new rank 2


[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [0]	train-logloss:0.67484	train-error:0.22747	val-logloss:0.67348	val-error:0.20883
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [1]	train-logloss:0.65878	train-error:0.22331	val-logloss:0.65557	val-error:0.20021
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [2]	train-logloss:0.64425	train-error:0.21114	val-logloss:0.63917	val-error:0.18773
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [3]	train-logloss:0.63139	train-error:0.20921	val-logloss:0.62525	val-error:0.19475
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [4]	train-logloss:0.61860	train-error:0.21226	val-logloss:0.61084	val-error:0.19787
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [5]	train-logloss:0.60763	train-error:0.20412	val-logloss:0.59873	val-error:0.18924
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [6]	train-logloss:0.59703	train-error:0.20445	val-logloss:0.58741	val-error:0.18927
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)

[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m 2025-01-24 08:19:36,759	INFO main.py:1175 -- Training in progress (31 seconds since last restart).


[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [33]	train-logloss:0.43529	train-error:0.15373	val-logloss:0.42150	val-error:0.14323
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [34]	train-logloss:0.43165	train-error:0.15511	val-logloss:0.41796	val-error:0.14326
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [35]	train-logloss:0.42847	train-error:0.15149	val-logloss:0.41462	val-error:0.14009
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [36]	train-logloss:0.42469	train-error:0.15489	val-logloss:0.41103	val-error:0.14481
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [37]	train-logloss:0.42140	train-error:0.15072	val-logloss:0.40757	val-error:0.14092
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [38]	train-logloss:0.41789	train-error:0.15284	val-logloss:0.40365	val-error:0.14326
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [39]	train-logloss:0.41452	train-error:0.15062	val-logloss:0.40046	val-error:0.14012
[2m[36m(my_trainer pid=1478, ip=100.64.

[2m[36m(_RemoteRayXGBoostActor pid=1150, ip=100.64.32.41)[0m [08:19:56] task [xgboost.ray]:125304905958064 got new rank 0
[2m[36m(_RemoteRayXGBoostActor pid=1143, ip=100.64.59.160)[0m [08:19:56] task [xgboost.ray]:127581404348480 got new rank 1
[2m[36m(_RemoteRayXGBoostActor pid=1849, ip=100.64.77.77)[0m [08:19:56] task [xgboost.ray]:133184630681664 got new rank 2


[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [0]	train-logloss:0.69016	train-error:0.16109	val-logloss:0.69012	val-error:0.15954
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [1]	train-logloss:0.68720	train-error:0.16109	val-logloss:0.68715	val-error:0.15954
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [2]	train-logloss:0.68428	train-error:0.16109	val-logloss:0.68419	val-error:0.15954
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [3]	train-logloss:0.68140	train-error:0.15997	val-logloss:0.68130	val-error:0.15799
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [4]	train-logloss:0.67854	train-error:0.15997	val-logloss:0.67840	val-error:0.15799
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [5]	train-logloss:0.67573	train-error:0.15846	val-logloss:0.67557	val-error:0.15730
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [6]	train-logloss:0.67295	train-error:0.15846	val-logloss:0.67274	val-error:0.15730
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)

[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m 2025-01-24 08:20:27,532	INFO main.py:1175 -- Training in progress (31 seconds since last restart).


[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [31]	train-logloss:0.61193	train-error:0.14300	val-logloss:0.60810	val-error:0.12530
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [32]	train-logloss:0.60970	train-error:0.13724	val-logloss:0.60577	val-error:0.12065
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [33]	train-logloss:0.60757	train-error:0.13871	val-logloss:0.60349	val-error:0.12296
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [34]	train-logloss:0.60542	train-error:0.13658	val-logloss:0.60126	val-error:0.11830
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [35]	train-logloss:0.60333	train-error:0.13658	val-logloss:0.59902	val-error:0.11830
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [36]	train-logloss:0.60121	train-error:0.12913	val-logloss:0.59682	val-error:0.11205
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [37]	train-logloss:0.59915	train-error:0.13073	val-logloss:0.59462	val-error:0.11363
[2m[36m(my_trainer pid=1478, ip=100.64.

[2m[36m(_RemoteRayXGBoostActor pid=1230, ip=100.64.32.41)[0m [08:20:48] task [xgboost.ray]:135716432957296 got new rank 0
[2m[36m(_RemoteRayXGBoostActor pid=1307, ip=100.64.59.160)[0m [08:20:48] task [xgboost.ray]:124045621858608 got new rank 1
[2m[36m(_RemoteRayXGBoostActor pid=2033, ip=100.64.77.77)[0m [08:20:48] task [xgboost.ray]:126457693573808 got new rank 2


[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [0]	train-logloss:0.69201	train-error:0.30285	val-logloss:0.69195	val-error:0.29079
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [1]	train-logloss:0.69089	train-error:0.30285	val-logloss:0.69076	val-error:0.29079
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [2]	train-logloss:0.68977	train-error:0.30285	val-logloss:0.68958	val-error:0.29079
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [3]	train-logloss:0.68867	train-error:0.30285	val-logloss:0.68842	val-error:0.29079
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [4]	train-logloss:0.68758	train-error:0.26686	val-logloss:0.68724	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [5]	train-logloss:0.68649	train-error:0.26686	val-logloss:0.68609	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [6]	train-logloss:0.68541	train-error:0.26686	val-logloss:0.68494	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)

[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m 2025-01-24 08:21:19,323	INFO main.py:1175 -- Training in progress (31 seconds since last restart).


[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [33]	train-logloss:0.65938	train-error:0.26686	val-logloss:0.65708	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [34]	train-logloss:0.65852	train-error:0.26686	val-logloss:0.65616	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [35]	train-logloss:0.65766	train-error:0.26686	val-logloss:0.65523	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [36]	train-logloss:0.65681	train-error:0.26686	val-logloss:0.65431	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [37]	train-logloss:0.65597	train-error:0.26686	val-logloss:0.65341	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [38]	train-logloss:0.65513	train-error:0.26686	val-logloss:0.65252	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.77.77)[0m [39]	train-logloss:0.65430	train-error:0.26686	val-logloss:0.65162	val-error:0.24841
[2m[36m(my_trainer pid=1478, ip=100.64.

type: [36mray::run()[39m (pid=454, ip=100.64.71.114)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/tune.py", line 939, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [my_trainer_28b6d_00005])

Ray Tune returns an `ExperimentAnalysis` object which contains the results of the trials. We are only interested in its `best_config` property, which provides information on the best performing trial (according to our evaluation criteria).

In [None]:
analysis.best_config

We now have the hyperparameters (*depth* and *learing rate*) that produce the best model. Luckily, we don't have to use them to train it from scratch as our training function automatically persists each attempted model. All we need to do now is to move the already trained variant to `/mnt` and ignore the others. We'll name the selected model `tune_best.xgb`.

In [None]:
import shutil

shutil.copy(
    os.path.join(analysis.best_logdir, "model.xgb"),
    os.path.join(MODEL_ROOT, "tune_best.xgb")
)

Recall, that the model was selected using a validation set. We don't know its actual generalisation capability until we measure it on the test set.
Let's go ahead and test how well it performs on unseen data. Note, that here we are also using Ray for the inference. This is not necessary. Later you will see that we can just unpickle the model and use standard XGBoost for the purposes of operationalisation.

In [None]:
# Inference using Ray

# Load the serialized model
bst = xgb.Booster(model_file=os.path.join(MODEL_ROOT, "tune_best.xgb"))


xgb_ray_params = xgbr.RayParams(
    num_actors=RAY_ACTORS,
    cpus_per_actor=RAY_CPUS_PER_ACTOR
)

# Make predictions on the test data
predictions = xgbr.predict(bst, rdm_test, ray_params=xgb_ray_params)
pred_class = (predictions > 0.5).astype("int") 
actuals = df_test[target_col]
print("Accuracy on test: {:.2f}".format(accuracy_score(pred_class, actuals)))

In [None]:
list(predictions)

## Model explainability

The interest in interpretation of machine learning has been rapidly accelerating in the last decade. This can be attributed to the popularity that machine learning algorithms, and more specifically deep learning, has been gaining in various domains.

According to Fox et al. (2017), the need for explainable AI is mainly motivated by the following three reasons:

* The need for **trust** - if a doctor is recommending a treatment protocol based on a prediction from a neural network, this doctor must have absolute trust in the network's capability. This trust must be paramount when human lives are at stake.
* The need for **interaction** - complex decision making systems often rely on Human–Autonomy Teaming (HAT), where the outcome is produced by joint efforts of one or more humans and one or more autonomous agents. This form of cooperation requires that the human operator is able to interact with the model for the purposes of better understanding or improving the automated recommendations.
* The need for **transparency** - if a network makes an inappropriate recommendation or disagrees with a human expert, its behaviour must be explainable. There should be mechanisms that allow us to inspect the inner workings of the model's decision making process and get insight on what this decision was based on.

In addition, regulators are introducing legal requirements around the use of automated decision making. For example, [article 22 of the General Data Protection Regulation](https://gdpr-info.eu/art-22-gdpr/) (GDPR) introduces the right of explanation - the power of an individual to demand an explanation on the reasons behind a model-based decision and to challenge the decision if it leads to a negative impact for the individual. The Defence Advanced Research Projects Agency (DARPA) in the US is supporting a major effort that seeks to facilitate AI explainability (see Turek, DARPA XAI).

In this section of the notebook, we'll look into interpreting the inner workings of the model to better understand the encoded inductive biases.

Let's begin by loading the model as a normal XGBoost model. We are no longer using Ray, as the model itself and the inference don't process large amounts of data.

We'll also run another accuracy calculation on the test set (this time using a pure Pandas data frame) and make sure that the numbers agree.

In [None]:
xgc = xgb.Booster(model_file=os.path.join(MODEL_ROOT, "tune_best.xgb"))
df_test_X = df_test.drop(target_col, axis=1)
xgtest = xgb.DMatrix(df_test_X)

predictions = xgc.predict(xgtest)

pred_class = (predictions > 0.5).astype("int") 
actuals = df_test[target_col]
print("Accuracy on test: {:.2f}".format(accuracy_score(pred_class, actuals)))

Generally speaking, feature importance quantifies how useful each feature was in the construction of the model. We can interrogate a fitted XGBoost model on the feature importance and get the numbers for each one of the individual features.

Indirectly, this tells us how much each feature contributes to the model predictions. There is a method called `plot_importance`, which. plots the attribute importance based on the fitted trees. This method accepts an argument named `importance_type`, which takes one of the following values and controls how importance is calculated:

* gain --- average gain of splits which use the feature. When looking at two features, the one with the higher gain is more important for generating a prediction. Typically, Gain is the most relevant attribute to interpret the relative importance of each feature.
* weight --- number of times a feature appears in a tree. 
* cover --- average coverage of splits which use the feature where coverage is defined as the number of samples affected by the split. This basically gives us the relative number of observations related to a feature.

In [None]:
xgb.plot_importance(xgc, importance_type="gain", max_num_features=10, show_values=False);

Based on the above we see that the top three features driving the predictions of the model are:

* checking_account_A14 - lack of a checking account
* credit_history_A34 - critical account / has credits outside of the bank
* property_A121 - real estate

We could also look at the other importance metrics, just for completeness.

In [None]:
xgb.plot_importance(xgc, importance_type="weight", max_num_features=10);

In [None]:
xgb.plot_importance(xgc, importance_type="cover", max_num_features=10, show_values=False);

ELI5 is another popular libarary for model explainability. This package is used to debug machine learning classifiers and explain their predictions. 

Unlike XGBoost, which is confined to explaining its own models only, ELI5 provides support for other frameworks like *scikit-learn*, *Keras*, *LightGBM* and others. It can also explain black-box models (i.e. Neural Networks) using [LIME](https://www.dominodatalab.com/blog/explaining-black-box-models-using-attribute-importance-pdps-and-lime).

First, ELI5 also provides a way of calculating the feature importance. Let's test it and make sure it agrees with the original XGBoost calculation (based on gain).

In [None]:
eli5.show_weights(xgc)

A more interesting function is `show_predictions`, which returns an explanation of the decision behind individual predictions. In other words, we can see what features drove the model to predict one outcome or the other.

Feel free to experiment with the code below, changing the `id` value and observing what features the model uses to calculate its prediction, and if the prediction agrees with the actual value. The `id` variable represents an observation number from the test dataset.

In [None]:
id = 3 # <- change this to see results for different observations  

print("Actual Label: %s" % actuals.iloc[id])
print("Predicted: %s" % pred_class[id])
eli5.show_prediction(xgc, df_test_X.iloc[id], 
                     feature_names=list(df_test_X.columns),
                     show_feature_values=True)


This concludes the model training notebook demo.

In [None]:
# Set some default values
column_names_all = ['duration', 'credit_amount', 'installment_rate', 'residence', 'age', 'credits', 'dependents', 'checking_account_A11', 'checking_account_A12', 'checking_account_A13', 'checking_account_A14', 'credit_history_A30', 'credit_history_A31',
                    'credit_history_A32', 'credit_history_A33', 'credit_history_A34', 'purpose_A40', 'purpose_A41', 'purpose_A410', 'purpose_A42', 'purpose_A43', 'purpose_A44', 'purpose_A45', 'purpose_A46', 'purpose_A48', 'purpose_A49', 'savings_A61', 
                    'savings_A62', 'savings_A63', 'savings_A64', 'savings_A65', 'employment_since_A71', 'employment_since_A72', 'employment_since_A73', 'employment_since_A74', 'employment_since_A75', 'status_A91', 'status_A92', 'status_A93', 'status_A94', 
                    'debtors_guarantors_A101', 'debtors_guarantors_A102', 'debtors_guarantors_A103', 'property_A121', 'property_A122', 'property_A123', 'property_A124', 'other_installments_A141', 'other_installments_A142', 'other_installments_A143', 'housing_A151', 
                    'housing_A152', 'housing_A153', 'job_A171', 'job_A172', 'job_A173', 'job_A174', 'telephone_A191', 'telephone_A192', 'foreign_worker_A201', 'foreign_worker_A202']

sample_data = [[0.4705882352941176, 0.3685484758446132, 0.3333333333333333, 0.3333333333333333, 
                0.2857142857142857, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 
                1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
                1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 
                1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0]]

In [None]:
df_all = pd.DataFrame(sample_data, columns=column_names_all)

In [None]:
for col in ['checking_account_A11', 'checking_account_A12', 'checking_account_A13', 'checking_account_A14', 
                'credit_history_A30', 'credit_history_A31', 'credit_history_A32', 'credit_history_A33', 
                'credit_history_A34', 'purpose_A40', 'purpose_A41', 'purpose_A410', 'purpose_A42', 'purpose_A43', 
                'purpose_A44', 'purpose_A45', 'purpose_A46', 'purpose_A48', 'purpose_A49', 'savings_A61', 
                'savings_A62', 'savings_A63', 'savings_A64', 'savings_A65', 'employment_since_A71', 
                'employment_since_A72', 'employment_since_A73', 'employment_since_A74', 'employment_since_A75', 
                'status_A91', 'status_A92', 'status_A93', 'status_A94', 'debtors_guarantors_A101', 
                'debtors_guarantors_A102', 'debtors_guarantors_A103', 'property_A121', 'property_A122', 
                'property_A123', 'property_A124', 'other_installments_A141', 'other_installments_A142', 
                'other_installments_A143', 'housing_A151', 'housing_A152', 'housing_A153', 'job_A171', 'job_A172', 
                'job_A173', 'job_A174', 'telephone_A191', 'telephone_A192', 'foreign_worker_A201', 'foreign_worker_A202']:
    df_all[col] = df_all[col].astype('int')


In [None]:
eli5.show_prediction(xgc, df_all.iloc[0], 
                         feature_names=list(df_all.columns),
                         show_feature_values=True)

In [None]:
df_all.iloc[0]

In [None]:
df_all.iloc[0]["checking_account_A14"]

In [None]:
df_prediction = eli5.explain_prediction_df(xgc, df_all.iloc[0], 
                         feature_names=list(df_all.columns))

In [None]:
df_prediction.head(10).style.background_gradient(cmap = "Greens").hide()