# Scaling ML workflows with Ray

We are going to scale an end to end machine learning workload:

1. Data loading
2. Training
2. Hyperparameter tuning
3. Inference

First, we try on a local node with a data set, time it, and then try on an Anyscale cluster with multiple nodes and multiple cores.

We should observe noticeable difference.

<img src="https://images.ctfassets.net/xjan103pcp94/ETLioF3e6PLPYr5DggvUl/7f8247e1bf79ab49295882259bd5420d/localMachineCloud.png" width="40%" height="30%">

In [1]:
import ray
import numpy as np
import os
import pandas as pd
import glob

from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split

import time
import tqdm
import xgboost as xgb
import xgboost_ray
from xgboost.callback import TrainingCallback

## Read the data files into various chunks

Read the classifation data features and labes into 300MB, 3GB, and 11GB chunks to
illustrate training, tunning, and inferencing at scale.

The data is generated from `sklearn.datasets make_classification` with 2 classes as labels as default. More rows or classes can be easily generated. For this demo, we
will use the following: 

 * 10,000,000 rows
 * 40 feature columns
 * 2 classes

In [2]:

def get_parquet_files(path, size=10):
    """Get all parquet parts from a directory."""
    size *= 10
    files = sorted(glob.glob(path))
    while size > len(files):
        files = files + files
    files = files[0:size]
    return files

def load_parquet_dataset(files):
    """Load all parquet files into a pandas df."""
    df = pd.read_parquet(files[0])
    for i in tqdm.tqdm(range(1, len(files), 50)):
        df = pd.concat((df, pd.read_parquet(files[i:i+50])))
        memory_usage = df.memory_usage(deep=True).sum()/1e9
        tqdm.tqdm.write(f"Dataset size: {memory_usage} GB")
        if memory_usage > 12:
            raise MemoryError(f"Dataset too big to fit into memory!")
    return df[sorted(df.columns)].drop("partition", axis=1)

class TqdmCallback(TrainingCallback):
    """Simple callback to print a progress bar"""
    def __init__(self, num_samples: int) -> None:
        self.num_samples = num_samples
        super().__init__()

    def before_training(self, model):
        if xgb.rabit.get_rank() == 0:
            self.pbar = tqdm.tqdm(total=self.num_samples)
        return model

    def after_iteration(self, model, epoch, evals_log):
        if xgb.rabit.get_rank() == 0:
            self.pbar.update(1)

    def after_training(self, model):
        if xgb.rabit.get_rank() == 0:
            self.pbar.close()
        return model

data_path = f"/home/ec2-user/data/classification.parquet/**/*.parquet"

data_files_300MB = get_parquet_files(data_path, size=1)
data_files_3GB = get_parquet_files(data_path, size=10)
data_files_11GB = get_parquet_files(data_path, size=30)

# 1a. Training (regular XGBoost)

In [3]:
# XGBoost config.
xgboost_params = {
    "tree_method": "approx",
    "objective": "binary:logistic",
    "eval_metric": ["logloss", "error"],
}

from xgboost import DMatrix, train

def train_xgboost(config, files, progress_bar=True):
    start_time = time.time()

    target_column = "labels"

    # local loading of a parquet dataset
    # needs conversion to pandas as vanilla
    # xgboost can't work with parquet directly
    train_df = load_parquet_dataset(files[:int(len(files) * 0.75)])
    test_df = load_parquet_dataset(files[int(len(files) * 0.75):])
    train_x = train_df.drop(target_column, axis=1)
    train_y = train_df[target_column]
    test_x = test_df.drop(target_column, axis=1)
    test_y = test_df[target_column]

    train_set = DMatrix(train_x, train_y)
    test_set = DMatrix(test_x, test_y)

    evals_result = {}

    # Train the classifier
    bst = train(params=config,
                dtrain=train_set,
                evals=[(test_set, "eval")],
                evals_result=evals_result,
                verbose_eval=False,
                num_boost_round=10,
                callbacks=[TqdmCallback(10)] if progress_bar else [])
    print(f"Total time taken: {time.time()-start_time}")

    model_path = "model.xgb"
    bst.save_model(model_path)
    print("Final validation error: {:.4f}".format(
        evals_result["eval"]["error"][-1]))

    return bst

Try a smaller data set (300MB) to ensure the our xgboost trainer works

In [4]:
bst = train_xgboost(xgboost_params, data_files_300MB)

100%|██████████| 1/1 [00:00<00:00,  1.63it/s]


Dataset size: 0.23590022 GB


100%|██████████| 1/1 [00:00<00:00,  5.88it/s]


Dataset size: 0.101100124 GB


100%|██████████| 10/10 [00:16<00:00,  1.63s/it]

Total time taken: 18.82234835624695
Final validation error: 0.1478





Try a larger data set (3GB) to ensure the our xgboost trainer works

In [6]:
bst = train_xgboost(xgboost_params, data_files_3GB)

 50%|█████     | 1/2 [00:13<00:13, 13.10s/it]

Dataset size: 1.718702504 GB


100%|██████████| 2/2 [00:21<00:00, 10.71s/it]


Dataset size: 2.7896 GB


100%|██████████| 1/1 [00:06<00:00,  6.97s/it]


Dataset size: 0.842500748 GB


100%|██████████| 10/10 [03:41<00:00, 22.15s/it]


Total time taken: 254.95771074295044
Final validation error: 0.1472


# 1b. Training (XGBoost on Ray)

Requires only few lines of code changes to XGBoost to use distributed training on Ray:

 * `from xgboost_ray import RayDMatrix, train, RayParams`
 * Use `RayDMatrix` distributed [sharding and reading](https://github.com/ray-project/xgboost_ray#distributed-data-loading) to convert to XGBoost internal data structure
 * Add additional `RayParams` argument to `train_xgboost(...)` for level of parallelism
 
 This diagram depicts how Distributed XGBoost-Ray works on a Ray cluster
 
 <img src="images/xgboost_distributed.png" width="40%" height="30%">

Let's start Ray on the local host.

In [5]:

ray.init()

{'node_ip_address': '10.0.0.231',
 'raylet_ip_address': '10.0.0.231',
 'redis_address': '10.0.0.231:6379',
 'object_store_address': '/tmp/ray/session_2022-01-26_19-15-54_415773_15641/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2022-01-26_19-15-54_415773_15641/sockets/raylet',
 'webui_url': None,
 'session_dir': '/tmp/ray/session_2022-01-26_19-15-54_415773_15641',
 'metrics_export_port': 63950,
 'node_id': '7e2bd7cfb7e67522dcadb22a0c0e8b55d02f0a3e3ae71f32ef5de8c9'}

In [6]:
# XGBoost config.
xgboost_params = {
    "tree_method": "approx",
    "objective": "binary:logistic",
    "eval_metric": ["logloss", "error"],
}

from xgboost_ray import RayDMatrix, train, RayParams

def train_xgboost_remote(config, files, ray_params, progress_bar=True):
    start_time = time.time()

    target_column = "labels"
    ds = ray.data.read_parquet(files)
    ds.repartition(32)

    split_index = int(ds.count() * (1 -0.3))
    X = ds.random_shuffle()
    X_train, X_valid = X.split_at_indices([split_index]) 

    ds_train = X_train.repartition(8)
    ds_valid = X_valid.repartition(8)
    train_set = RayDMatrix(ds_train, target_column, ignore=["partition"])
    test_set = RayDMatrix(ds_valid, target_column, ignore=["partition"])

    evals_result = {}

    # Train the classifier
    bst = train(params=config,
                dtrain=train_set,
                evals=[(test_set, "eval")],
                evals_result=evals_result,
                verbose_eval=False,
                num_boost_round=10,                       # equivalent to epochs or iterations
                callbacks=[TqdmCallback(10)] if progress_bar else [],
                ray_params=ray_params)                    # Ray parameters for parallelism
    print(f"Total time taken: {time.time()-start_time}")

    model_path = "model.xgb"
    bst.save_model(model_path)
    print("Final validation error: {:.4f}".format(
        evals_result["eval"]["error"][-1]))

    return bst

#### Make dataset available in cloud object storage
 * We uploaded the dataset in Cloud storage (S3) and changed the URL for each file in our 3 datasets. 

In [7]:
s3_data_files_300MB = [str(i).replace('/home/ec2-user/', 's3://anyscale-demo/') for i in data_files_300MB]
s3_data_files_3GB = [str(i).replace('/home/ec2-user/', 's3://anyscale-demo/') for i in data_files_3GB]
s3_data_files_11GB = [str(i).replace('/home/ec2-user/', 's3://anyscale-demo/') for i in data_files_11GB]

In [8]:
s3_data_files_300MB

['s3://anyscale-demo/data/classification.parquet/partition=0/part_0.parquet',
 's3://anyscale-demo/data/classification.parquet/partition=1/part_1.parquet',
 's3://anyscale-demo/data/classification.parquet/partition=10/part_10.parquet',
 's3://anyscale-demo/data/classification.parquet/partition=11/part_11.parquet',
 's3://anyscale-demo/data/classification.parquet/partition=12/part_12.parquet',
 's3://anyscale-demo/data/classification.parquet/partition=13/part_13.parquet',
 's3://anyscale-demo/data/classification.parquet/partition=14/part_14.parquet',
 's3://anyscale-demo/data/classification.parquet/partition=15/part_15.parquet',
 's3://anyscale-demo/data/classification.parquet/partition=16/part_16.parquet',
 's3://anyscale-demo/data/classification.parquet/partition=17/part_17.parquet']

#### Define RayParams for XGBoost level of parallelism
 * Eight actors, 
 * Each using 2-4 CPUs

In [9]:
remote_train_xgboost_remote = ray.remote(train_xgboost_remote)
bst = ray.get(remote_train_xgboost_remote.remote(xgboost_params, s3_data_files_300MB, RayParams(num_actors=4, cpus_per_actor=1)))

Repartition:   0%|          | 0/32 [00:00<?, ?it/s]
Repartition:  31%|███▏      | 10/32 [00:00<00:00, 99.20it/s]
Repartition:  88%|████████▊ | 28/32 [00:00<00:00, 143.14it/s]
Repartition: 100%|██████████| 32/32 [00:01<00:00, 16.20it/s] 
Shuffle Map:   0%|          | 0/10 [00:00<?, ?it/s]
Shuffle Map:  10%|█         | 1/10 [00:00<00:01,  5.03it/s]
Shuffle Map: 100%|██████████| 10/10 [00:00<00:00, 28.59it/s]
Shuffle Reduce:   0%|          | 0/10 [00:00<?, ?it/s]
Shuffle Reduce: 100%|██████████| 10/10 [00:00<00:00, 72.52it/s]
Repartition:   0%|          | 0/8 [00:00<?, ?it/s]
Repartition: 100%|██████████| 8/8 [00:00<00:00, 124.54it/s]
Repartition:   0%|          | 0/8 [00:00<?, ?it/s]
Repartition: 100%|██████████| 8/8 [00:00<00:00, 239.14it/s]
[2m[36m(train_xgboost_remote pid=15864)[0m 2022-01-26 19:18:28,931	INFO main.py:979 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
[2m[36m(train_xgboost_remote pid=15864)[0m 2022-01-26 19:18

[2m[36m(train_xgboost_remote pid=15864)[0m Total time taken: 47.01284575462341
[2m[36m(train_xgboost_remote pid=15864)[0m Final validation error: 0.1511


100%|██████████| 10/10 [00:14<00:00,  1.46s/it] 
[2m[36m(train_xgboost_remote pid=15864)[0m 2022-01-26 19:18:46,298	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=700,000 in 17.43 seconds (15.26 pure XGBoost training time).


Let's now point our training job jto an anyscale cluster.

In [10]:
ray.shutdown()
ray.init("anyscale://xgboost-demo2")

[1m[36mAuthenticating[0m
Loaded Anyscale authentication token from ~/.anyscale/credentials.json.

[1m[36mOutput[0m
[1m[36m(anyscale +0.7s)[0m .anyscale.yaml found in project_dir. Directory is attached to a project.
[1m[36m(anyscale +0.9s)[0m Using project (name: phi-demos, project_dir: /home/ec2-user/working_dir, id: prj_26wrN4CCEw3fGhXMxzRAwvuk).
[1m[36m(anyscale +2.0s)[0m cluster xgboost-demo2 is currently running, the cluster will not be restarted.


2022-01-26 19:19:24,408	INFO packaging.py:352 -- Creating a file package for local directory '/home/ec2-user/working_dir'.
2022-01-26 19:19:24,427	INFO packaging.py:221 -- Pushing file package 'gcs://_ray_pkg_7954c209209536de.zip' (0.78MiB) to Ray cluster...
2022-01-26 19:19:24,899	INFO packaging.py:224 -- Successfully pushed file package 'gcs://_ray_pkg_7954c209209536de.zip'.


[1m[36m(anyscale +14.1s)[0m Connected to xgboost-demo2, see: https://console.anyscale.com/projects/prj_26wrN4CCEw3fGhXMxzRAwvuk/clusters/ses_aqKYRsRQpAUFD4gcfPrxnZ8e
[1m[36m(anyscale +14.1s)[0m URL for head node of cluster: https://session-aqkyrsrqpaufd4gcfprxnz8e.i.anyscaleuserdata.com


AnyscaleClientContext(dashboard_url='https://session-aqkyrsrqpaufd4gcfprxnz8e.i.anyscaleuserdata.com/auth/?token=af19b5ad-165f-4d57-a596-3bd9d882b96e&redirect_to=dashboard', python_version='3.8.5', ray_version='1.9.1', ray_commit='2cdbf974ea63caf4323aacbccaef2394a14a8562', protocol_version='2021-09-22', _num_clients=1, _context_to_restore=None)

We can now run our training job using our 11GB training files

In [11]:

bst = ray.get(remote_train_xgboost_remote.remote(xgboost_params, s3_data_files_11GB, RayParams(num_actors=8, cpus_per_actor=2)))

Metadata Fetch Progress:   0%|          | 0/50 [00:00<?, ?it/s]
Metadata Fetch Progress:   2%|▏         | 1/50 [00:05<04:48,  5.89s/it]
Metadata Fetch Progress:   4%|▍         | 2/50 [00:06<02:25,  3.03s/it]
Metadata Fetch Progress:   6%|▌         | 3/50 [00:08<01:41,  2.16s/it]
Metadata Fetch Progress:   8%|▊         | 4/50 [00:09<01:22,  1.80s/it]
Metadata Fetch Progress:  10%|█         | 5/50 [00:09<00:57,  1.27s/it]
Metadata Fetch Progress:  12%|█▏        | 6/50 [00:10<00:49,  1.13s/it]
Metadata Fetch Progress:  14%|█▍        | 7/50 [00:10<00:34,  1.23it/s]
Metadata Fetch Progress:  16%|█▌        | 8/50 [00:11<00:38,  1.09it/s]
Metadata Fetch Progress:  18%|█▊        | 9/50 [00:11<00:27,  1.50it/s]
Metadata Fetch Progress:  20%|██        | 10/50 [00:12<00:30,  1.29it/s]
Metadata Fetch Progress:  22%|██▏       | 11/50 [00:13<00:22,  1.70it/s]
Metadata Fetch Progress:  26%|██▌       | 13/50 [00:14<00:19,  1.87it/s]
Metadata Fetch Progress:  28%|██▊       | 14/50 [00:14<00:17,  2.11it

[2m[36m(train_xgboost_remote pid=45381)[0m Total time taken: 523.420551776886
[2m[36m(train_xgboost_remote pid=45381)[0m Final validation error: 0.1488


# 2. Hyperparameter Tuning

[Ray Tune](https://docs.ray.io/en/latest/tune/index.html) will launch distributed HPO, using 8 remote actors, each with its own instance of the trainable func: `train_xgboost` defined above.

<img src="https://raw.githubusercontent.com/dmatrix/ray-core-tutorial/main/images/ray_tune_dist_hpo.png" width="40%" height="30%">

# Weight and Biases integration
We just added the WandB callback to log all of our experiment metrics into WandB in the tune.run callback.

In [17]:
from ray import tune
import os
from ray.tune.integration.wandb import WandbLoggerCallback

project_name = f"XGBoost-Tune-Experiment-demo"
api_key = "xxxxxxxxxxxx" # TODO: change this if you have your own API key
os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"

# Set XGBoost config.
config = {
    "tree_method": "approx",
    "objective": "binary:logistic",
    "eval_metric": ["logloss", "error"],
    "eta": tune.loguniform(1e-4, 1e-1),
    "subsample": tune.uniform(0.5, 1.0),
    "max_depth": tune.randint(1, 9)
}

ray_params = RayParams(
    max_actor_restarts=1,
    gpus_per_actor=0,
    cpus_per_actor=2,
    num_actors=8)

analysis = tune.run(
    tune.with_parameters(train_xgboost_remote, files=s3_data_files_300MB, ray_params=ray_params, progress_bar=False),
    # Use the `get_tune_resources` helper function to set the resources.
    resources_per_trial=ray_params.get_tune_resources(),
    config=config,
    num_samples=8,
    metric="eval-error",
    mode="min",
    verbose=1,
    callbacks=[WandbLoggerCallback(
            project= project_name,
            api_key= api_key,
            log_config=True)]
    )

accuracy = 1. - analysis.best_result["eval-error"]
print(f"Best model parameters: {analysis.best_config}")
print(f"Best model total accuracy: {accuracy:.4f}")
print(analysis.best_config)



[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:08:26 (running for 00:00:00.39)
[2m[36m(run pid=5676)[0m Memory usage on this node: 12.1/30.9 GiB
[2m[36m(run pid=5676)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=5676)[0m Resources requested: 17.0/136 CPUs, 0/0 GPUs, 0.0/376.72 GiB heap, 0.0/156.92 GiB objects
[2m[36m(run pid=5676)[0m Result logdir: /home/ray/ray_results/train_xgboost_remote_2022-01-25_08-04-58
[2m[36m(run pid=5676)[0m Number of trials: 8/8 (7 PENDING, 1 RUNNING)
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 


[2m[36m(run pid=5676)[0m wandb: Currently logged in as: sublimotion (use `wandb login --relogin` to force relogin)
[2m[36m(run pid=5676)[0m wandb: wandb version 0.12.9 is available!  To upgrade, please run:
[2m[36m(run pid=5676)[0m wandb:  $ pip install wandb --upgrade
[2m[36m(run pid=5676)[0m wandb: Currently logged in as: sublimotion (use `wandb login --relogin` to force relogin)
[2m[36m(run pid=5676)[0m wandb: Currently logged in as: sublimotion (use `wandb login --relogin` to force relogin)
[2m[36m(run pid=5676)[0m wandb: Currently logged in as: sublimotion (use `wandb login --relogin` to force relogin)
[2m[36m(run pid=5676)[0m wandb: wandb version 0.12.9 is available!  To upgrade, please run:
[2m[36m(run pid=5676)[0m wandb:  $ pip install wandb --upgrade
[2m[36m(run pid=5676)[0m wandb: Currently logged in as: sublimotion (use `wandb login --relogin` to force relogin)
[2m[36m(run pid=5676)[0m wandb: wandb version 0.12.9 is available!  To upgrade, pleas

[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:08:31 (running for 00:00:05.25)
[2m[36m(run pid=5676)[0m Memory usage on this node: 12.5/30.9 GiB
[2m[36m(run pid=5676)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=5676)[0m Resources requested: 102.0/136 CPUs, 0/0 GPUs, 0.0/376.72 GiB heap, 0.0/156.92 GiB objects
[2m[36m(run pid=5676)[0m Result logdir: /home/ray/ray_results/train_xgboost_remote_2022-01-25_08-04-58
[2m[36m(run pid=5676)[0m Number of trials: 8/8 (2 PENDING, 6 RUNNING)
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 


[2m[36m(run pid=5676)[0m E0125 08:08:32.348522237   20921 chttp2_transport.cc:1103]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
[2m[36m(run pid=5676)[0m wandb: Tracking run with wandb version 0.12.5
[2m[36m(run pid=5676)[0m wandb: Syncing run train_xgboost_remote_8b1d5_00000
[2m[36m(run pid=5676)[0m wandb:  View project at https://wandb.ai/sublimotion/XGBoost-Tune-Experiment-demo
[2m[36m(run pid=5676)[0m wandb:  View run at https://wandb.ai/sublimotion/XGBoost-Tune-Experiment-demo/runs/8b1d5_00000
[2m[36m(run pid=5676)[0m wandb: Run data is saved locally in /tmp/ray/session_2022-01-25_07-11-21_216444_163/runtime_resources/working_dir_files/_ray_pkg_226d4ea0735c6585/wandb/run-20220125_080827-8b1d5_00000
[2m[36m(run pid=5676)[0m wandb: Run `wandb offline` to turn off syncing.


[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 


[2m[36m(run pid=5676)[0m wandb: Tracking run with wandb version 0.12.5
[2m[36m(run pid=5676)[0m wandb: Syncing run train_xgboost_remote_8b1d5_00002
[2m[36m(run pid=5676)[0m wandb:  View project at https://wandb.ai/sublimotion/XGBoost-Tune-Experiment-demo
[2m[36m(run pid=5676)[0m wandb:  View run at https://wandb.ai/sublimotion/XGBoost-Tune-Experiment-demo/runs/8b1d5_00002
[2m[36m(run pid=5676)[0m wandb: Run data is saved locally in /tmp/ray/session_2022-01-25_07-11-21_216444_163/runtime_resources/working_dir_files/_ray_pkg_226d4ea0735c6585/wandb/run-20220125_080827-8b1d5_00002
[2m[36m(run pid=5676)[0m wandb: Run `wandb offline` to turn off syncing.
[2m[36m(run pid=5676)[0m wandb: Tracking run with wandb version 0.12.5
[2m[36m(run pid=5676)[0m wandb: Syncing run train_xgboost_remote_8b1d5_00001
[2m[36m(run pid=5676)[0m wandb:  View project at https://wandb.ai/sublimotion/XGBoost-Tune-Experiment-demo
[2m[36m(run pid=5676)[0m wandb:  View run at https://wandb

[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:08:37 (running for 00:00:11.17)
[2m[36m(run pid=5676)[0m Memory usage on this node: 12.9/30.9 GiB
[2m[36m(run pid=5676)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=5676)[0m Resources requested: 102.0/136 CPUs, 0/0 GPUs, 0.0/376.72 GiB heap, 0.0/156.92 GiB objects
[2m[36m(run pid=5676)[0m Result logdir: /home/ray/ray_results/train_xgboost_remote_2022-01-25_08-04-58
[2m[36m(run pid=5676)[0m Number of trials: 8/8 (2 PENDING, 6 RUNNING)
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 


[2m[36m(run pid=5676)[0m *** SIGSEGV received at time=1643126922 on cpu 1 ***
[2m[36m(run pid=5676)[0m PC: @     0x7ff2ae3430a1  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m     @     0x7ff2b647b980  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m [2022-01-25 08:08:42,350 E 20810 20918] logging.cc:317: *** SIGSEGV received at time=1643126922 on cpu 1 ***
[2m[36m(run pid=5676)[0m [2022-01-25 08:08:42,350 E 20810 20918] logging.cc:317: PC: @     0x7ff2ae3430a1  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m [2022-01-25 08:08:42,351 E 20810 20918] logging.cc:317:     @     0x7ff2b647b980  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m Fatal Python error: Segmentation fault
[2m[36m(run pid=5676)[0m 


[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:08:42 (running for 00:00:16.18)
[2m[36m(run pid=5676)[0m Memory usage on this node: 12.6/30.9 GiB
[2m[36m(run pid=5676)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=5676)[0m Resources requested: 102.0/136 CPUs, 0/0 GPUs, 0.0/376.72 GiB heap, 0.0/156.92 GiB objects
[2m[36m(run pid=5676)[0m Result logdir: /home/ray/ray_results/train_xgboost_remote_2022-01-25_08-04-58
[2m[36m(run pid=5676)[0m Number of trials: 8/8 (2 PENDING, 6 RUNNING)
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:08:47 (running for 00:00:21.20)
[2m[36m(run pid=5676)[0m Memory usage on this node: 12.6/30.9 GiB
[2m[36m(run pid=5676)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=5676)[0m Resources requested: 102.0/136 CPUs, 0/0 GPUs, 0.0/376.72 GiB heap, 0.0/156.92 GiB objects
[2m[3

[2m[36m(run pid=5676)[0m *** SIGSEGV received at time=1643126932 on cpu 3 ***
[2m[36m(run pid=5676)[0m PC: @     0x7ff2ae3430a1  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m     @     0x7ff2b647b980  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m [2022-01-25 08:08:52,350 E 20902 20927] logging.cc:317: *** SIGSEGV received at time=1643126932 on cpu 3 ***
[2m[36m(run pid=5676)[0m [2022-01-25 08:08:52,351 E 20902 20927] logging.cc:317: PC: @     0x7ff2ae3430a1  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m [2022-01-25 08:08:52,351 E 20902 20927] logging.cc:317:     @     0x7ff2b647b980  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m Fatal Python error: Segmentation fault
[2m[36m(run pid=5676)[0m 


[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:08:52 (running for 00:00:26.22)
[2m[36m(run pid=5676)[0m Memory usage on this node: 12.6/30.9 GiB
[2m[36m(run pid=5676)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=5676)[0m Resources requested: 102.0/136 CPUs, 0/0 GPUs, 0.0/376.72 GiB heap, 0.0/156.92 GiB objects
[2m[36m(run pid=5676)[0m Result logdir: /home/ray/ray_results/train_xgboost_remote_2022-01-25_08-04-58
[2m[36m(run pid=5676)[0m Number of trials: 8/8 (2 PENDING, 6 RUNNING)
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:08:57 (running for 00:00:31.33)
[2m[36m(run pid=5676)[0m Memory usage on this node: 13.1/30.9 GiB
[2m[36m(run pid=5676)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=5676)[0m Resources requested: 102.0/136 CPUs, 0/0 GPUs, 0.0/376.72 GiB heap, 0.0/156.92 GiB objects
[2m[3

Repartition:   0%|          | 0/32 [00:00<?, ?it/s]
Repartition:  72%|███████▏  | 23/32 [00:00<00:00, 220.55it/s]
Repartition:   0%|          | 0/32 [00:00<?, ?it/s]
Repartition:   0%|          | 0/32 [00:00<?, ?it/s]
Repartition: 100%|██████████| 32/32 [00:00<00:00, 220.93it/s]
Shuffle Map:   0%|          | 0/10 [00:00<?, ?it/s]
Repartition:  31%|███▏      | 10/32 [00:00<00:00, 89.39it/s]
Repartition: 100%|██████████| 32/32 [00:00<00:00, 196.67it/s]
Shuffle Map:   0%|          | 0/10 [00:00<?, ?it/s]
Shuffle Map:  20%|██        | 2/10 [00:00<00:00, 13.04it/s]
Repartition:  66%|██████▌   | 21/32 [00:00<00:00, 97.52it/s]
Shuffle Map:  10%|█         | 1/10 [00:00<00:00,  9.93it/s]
Shuffle Map:  40%|████      | 4/10 [00:00<00:00, 14.29it/s]
Repartition: 100%|██████████| 32/32 [00:00<00:00, 102.51it/s]
Shuffle Map:   0%|          | 0/10 [00:00<?, ?it/s]
Shuffle Map:  70%|███████   | 7/10 [00:00<00:00, 19.89it/s]
Shuffle Map:  10%|█         | 1/10 [00:00<00:01,  8.60it/s]
Repartition:   0%|

[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:09:07 (running for 00:00:41.37)
[2m[36m(run pid=5676)[0m Memory usage on this node: 12.6/30.9 GiB
[2m[36m(run pid=5676)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=5676)[0m Resources requested: 102.0/136 CPUs, 0/0 GPUs, 0.0/376.72 GiB heap, 0.0/156.92 GiB objects
[2m[36m(run pid=5676)[0m Result logdir: /home/ray/ray_results/train_xgboost_remote_2022-01-25_08-04-58
[2m[36m(run pid=5676)[0m Number of trials: 8/8 (2 PENDING, 6 RUNNING)
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 


Repartition: 100%|██████████| 8/8 [00:00<00:00, 23.95it/s]
Repartition:   0%|          | 0/8 [00:00<?, ?it/s]
Repartition:   0%|          | 0/32 [00:00<?, ?it/s]
Repartition:   0%|          | 0/8 [00:00<?, ?it/s] 
Repartition: 100%|██████████| 8/8 [00:00<00:00, 127.21it/s]
Repartition:  56%|█████▋    | 18/32 [00:00<00:00, 148.71it/s]
Repartition:  88%|████████▊ | 7/8 [00:00<00:00, 66.97it/s]
Repartition:   0%|          | 0/8 [00:00<?, ?it/s] 
Repartition: 100%|██████████| 32/32 [00:00<00:00, 134.46it/s]
Shuffle Map:   0%|          | 0/10 [00:00<?, ?it/s]
Repartition: 100%|██████████| 8/8 [00:00<00:00, 39.36it/s]
Repartition: 100%|██████████| 8/8 [00:00<00:00, 109.83it/s]
Shuffle Map:  10%|█         | 1/10 [00:00<00:01,  8.04it/s]
[2m[36m(ImplicitFunc pid=592, ip=10.0.0.176)[0m 2022-01-25 08:09:08,778	INFO main.py:979 -- [RayXGBoost] Created 8 new actors (8 total actors). Waiting until actors are ready for training.
Shuffle Map:  30%|███       | 3/10 [00:00<00:00, 12.81it/s]
[2m[36

[2m[36m(_shuffle_reduce pid=637, ip=10.0.0.199)[0m 
[2m[36m(run pid=5676)[0m 2022-01-25 08:09:10,700	INFO commands.py:292 -- Checking External environment settings


Repartition:  72%|███████▏  | 23/32 [00:00<00:00, 204.77it/s]
Repartition: 100%|██████████| 32/32 [00:00<00:00, 208.82it/s]
Shuffle Map:   0%|          | 0/10 [00:00<?, ?it/s]
Shuffle Map:  10%|█         | 1/10 [00:00<00:01,  7.54it/s]
Shuffle Map: 100%|██████████| 10/10 [00:00<00:00, 42.92it/s]
Shuffle Reduce:   0%|          | 0/10 [00:00<?, ?it/s]
Repartition: 100%|██████████| 8/8 [00:00<00:00, 186.28it/s]
Shuffle Reduce:  10%|█         | 1/10 [00:00<00:01,  8.10it/s]
Shuffle Reduce: 100%|██████████| 10/10 [00:00<00:00, 34.71it/s]
[2m[36m(ImplicitFunc pid=608, ip=10.0.0.199)[0m 2022-01-25 08:09:11,399	INFO main.py:979 -- [RayXGBoost] Created 8 new actors (8 total actors). Waiting until actors are ready for training.
Repartition:   0%|          | 0/8 [00:00<?, ?it/s]
[2m[36m(_RemoteRayXGBoostActor pid=597, ip=10.0.0.166)[0m [08:09:11] task [xgboost.ray]:140314864567776 got new rank 2
[2m[36m(_RemoteRayXGBoostActor pid=658, ip=10.0.0.166)[0m [08:09:11] task [xgboost.ray]:13977

[2m[36m(run pid=5676)[0m 2022-01-25 08:09:12,595	WARN util.py:141 -- The `worker_nodes` field is deprecated and will be ignored. Use `available_node_types` instead.


[2m[36m(ImplicitFunc pid=874, ip=10.0.0.65)[0m 2022-01-25 08:09:12,697	INFO main.py:979 -- [RayXGBoost] Created 8 new actors (8 total actors). Waiting until actors are ready for training.


[2m[36m(run pid=5676)[0m [1m[36mAuthenticating[0m
[2m[36m(run pid=5676)[0m Loaded Anyscale authentication token from variable.
[2m[36m(run pid=5676)[0m 


[2m[36m(_RemoteRayXGBoostActor pid=594, ip=10.0.0.14)[0m [08:09:15] task [xgboost.ray]:140375843993296 got new rank 3
[2m[36m(_RemoteRayXGBoostActor pid=603, ip=10.0.0.14)[0m [08:09:15] task [xgboost.ray]:140059199810048 got new rank 4
[2m[36m(_RemoteRayXGBoostActor pid=597, ip=10.0.0.14)[0m [08:09:15] task [xgboost.ray]:140276367636896 got new rank 2
[2m[36m(_RemoteRayXGBoostActor pid=6246)[0m [08:09:15] task [xgboost.ray]:140140036855168 got new rank 0
[2m[36m(_RemoteRayXGBoostActor pid=7286)[0m [08:09:15] task [xgboost.ray]:140386444357104 got new rank 1
[2m[36m(_RemoteRayXGBoostActor pid=602, ip=10.0.0.65)[0m [08:09:15] task [xgboost.ray]:139639722249040 got new rank 5
[2m[36m(_RemoteRayXGBoostActor pid=646, ip=10.0.0.65)[0m [08:09:15] task [xgboost.ray]:139738108487088 got new rank 7
[2m[36m(_RemoteRayXGBoostActor pid=643, ip=10.0.0.65)[0m [08:09:15] task [xgboost.ray]:139870442774688 got new rank 6
[2m[36m(ImplicitFunc pid=874, ip=10.0.0.65)[0m 2022-01-

[2m[36m(run pid=5676)[0m 2022-01-25 08:09:16,901	INFO command_runner.py:357 -- Fetched IP: 10.0.0.176
[2m[36m(run pid=5676)[0m 2022-01-25 08:09:16,901	INFO log_timer.py:25 -- NodeUpdater: ins_jWYB6jsj1QzcDXgxiShXUb5Y: Got IP  [LogTimer=576ms]


[2m[36m(run pid=5676)[0m *** SIGSEGV received at time=1643126957 on cpu 7 ***
[2m[36m(run pid=5676)[0m PC: @     0x7ff2ae3430a1  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m     @     0x7ff2b647b980  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m [2022-01-25 08:09:17,351 E 20860 20926] logging.cc:317: *** SIGSEGV received at time=1643126957 on cpu 7 ***
[2m[36m(run pid=5676)[0m [2022-01-25 08:09:17,352 E 20860 20926] logging.cc:317: PC: @     0x7ff2ae3430a1  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m [2022-01-25 08:09:17,352 E 20860 20926] logging.cc:317:     @     0x7ff2b647b980  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m Fatal Python error: Segmentation fault
[2m[36m(run pid=5676)[0m 


[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:09:18 (running for 00:00:51.79)
[2m[36m(run pid=5676)[0m Memory usage on this node: 13.0/30.9 GiB
[2m[36m(run pid=5676)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=5676)[0m Resources requested: 102.0/136 CPUs, 0/0 GPUs, 0.0/376.72 GiB heap, 0.0/156.92 GiB objects
[2m[36m(run pid=5676)[0m Current best trial: 8b1d5_00004 with eval-error=0.30766 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.0020346618470171484, 'subsample': 0.7909773018031685, 'max_depth': 3, 'nthread': 2, 'n_jobs': 2}
[2m[36m(run pid=5676)[0m Result logdir: /home/ray/ray_results/train_xgboost_remote_2022-01-25_08-04-58
[2m[36m(run pid=5676)[0m Number of trials: 8/8 (2 PENDING, 6 RUNNING)
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 2022-01-25 08:09:18,395	WARN commands.py:269 -- Loaded ca



[2m[36m(run pid=5676)[0m 2022-01-25 08:09:20,860	INFO command_runner.py:357 -- Fetched IP: 10.0.0.65
[2m[36m(run pid=5676)[0m 2022-01-25 08:09:20,860	INFO log_timer.py:25 -- NodeUpdater: ins_1K7MdWdebpK4i77CsekEKiLW: Got IP  [LogTimer=105ms]




[2m[36m(run pid=5676)[0m 2022-01-25 08:09:24,322	INFO command_runner.py:357 -- Fetched IP: 10.0.0.199
[2m[36m(run pid=5676)[0m 2022-01-25 08:09:24,322	INFO log_timer.py:25 -- NodeUpdater: ins_2zfRTDth5fbbiYXk2tDPi2ru: Got IP  [LogTimer=86ms]
[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:09:25 (running for 00:00:58.84)
[2m[36m(run pid=5676)[0m Memory usage on this node: 13.0/30.9 GiB
[2m[36m(run pid=5676)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=5676)[0m Resources requested: 102.0/136 CPUs, 0/0 GPUs, 0.0/376.72 GiB heap, 0.0/156.92 GiB objects
[2m[36m(run pid=5676)[0m Current best trial: 8b1d5_00000 with eval-error=0.24240699999999998 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.0036490762865544015, 'subsample': 0.5594855338871798, 'max_depth': 6, 'nthread': 2, 'n_jobs': 2}
[2m[36m(run pid=5676)[0m Result logdir: /home/ray/ray_resul



[2m[36m(run pid=5676)[0m 2022-01-25 08:09:28,590	INFO command_runner.py:357 -- Fetched IP: 10.0.0.13
[2m[36m(run pid=5676)[0m 2022-01-25 08:09:28,590	INFO log_timer.py:25 -- NodeUpdater: ins_ac6zgMX3nvXgY9XpRJgDwwuh: Got IP  [LogTimer=531ms]


[2m[36m(run pid=5676)[0m *** SIGSEGV received at time=1643126972 on cpu 3 ***
[2m[36m(run pid=5676)[0m PC: @     0x7ff2ae3430a1  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m     @     0x7ff2b647b980  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m [2022-01-25 08:09:32,350 E 20773 20917] logging.cc:317: *** SIGSEGV received at time=1643126972 on cpu 3 ***
[2m[36m(run pid=5676)[0m [2022-01-25 08:09:32,350 E 20773 20917] logging.cc:317: PC: @     0x7ff2ae3430a1  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m [2022-01-25 08:09:32,350 E 20773 20917] logging.cc:317:     @     0x7ff2b647b980  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m Fatal Python error: Segmentation fault
[2m[36m(run pid=5676)[0m 


[2m[36m(run pid=5676)[0m 2022-01-25 08:09:33,002	INFO command_runner.py:357 -- Fetched IP: 10.0.0.85
[2m[36m(run pid=5676)[0m 2022-01-25 08:09:33,003	INFO log_timer.py:25 -- NodeUpdater: ins_U8mQnSeXzt6f6GwrquWfd36S: Got IP  [LogTimer=208ms]




[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:09:34 (running for 00:01:07.53)
[2m[36m(run pid=5676)[0m Memory usage on this node: 13.1/30.9 GiB
[2m[36m(run pid=5676)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=5676)[0m Resources requested: 102.0/136 CPUs, 0/0 GPUs, 0.0/376.72 GiB heap, 0.0/156.92 GiB objects
[2m[36m(run pid=5676)[0m Current best trial: 8b1d5_00000 with eval-error=0.24240699999999998 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.0036490762865544015, 'subsample': 0.5594855338871798, 'max_depth': 6, 'nthread': 2, 'n_jobs': 2}
[2m[36m(run pid=5676)[0m Result logdir: /home/ray/ray_results/train_xgboost_remote_2022-01-25_08-04-58
[2m[36m(run pid=5676)[0m Number of trials: 8/8 (2 PENDING, 6 RUNNING)
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 2022-01-25 08:09:36,517	INFO command_runner.p

[2m[36m(run pid=5676)[0m *** SIGSEGV received at time=1643126977 on cpu 5 ***
[2m[36m(run pid=5676)[0m PC: @     0x7ff2ae3430a1  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m     @     0x7ff2b647b980  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m [2022-01-25 08:09:37,350 E 20801 20922] logging.cc:317: *** SIGSEGV received at time=1643126977 on cpu 5 ***
[2m[36m(run pid=5676)[0m [2022-01-25 08:09:37,350 E 20801 20922] logging.cc:317: PC: @     0x7ff2ae3430a1  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m [2022-01-25 08:09:37,350 E 20801 20922] logging.cc:317:     @     0x7ff2b647b980  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m Fatal Python error: Segmentation fault
[2m[36m(run pid=5676)[0m 
[2m[36m(ImplicitFunc pid=1090, ip=10.0.0.13)[0m 2022-01-25 08:09:37,645	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=700,000 in 28.61 seconds (27.81 pure XGBoost training time).


[2m[36m(ImplicitFunc pid=1090, ip=10.0.0.13)[0m Total time taken: 70.41572999954224
[2m[36m(ImplicitFunc pid=1090, ip=10.0.0.13)[0m Final validation error: 0.2861
[2m[36m(run pid=5676)[0m 


[2m[36m(run pid=5676)[0m wandb: Waiting for W&B process to finish, PID 20902... (success).
[2m[36m(run pid=5676)[0m *** SIGSEGV received at time=1643126982 on cpu 1 ***
[2m[36m(run pid=5676)[0m PC: @     0x7ff2ae3430a1  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m     @     0x7ff2b647b980  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m [2022-01-25 08:09:42,350 E 20784 20848] logging.cc:317: *** SIGSEGV received at time=1643126982 on cpu 1 ***
[2m[36m(run pid=5676)[0m [2022-01-25 08:09:42,350 E 20784 20848] logging.cc:317: PC: @     0x7ff2ae3430a1  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m [2022-01-25 08:09:42,350 E 20784 20848] logging.cc:317:     @     0x7ff2b647b980  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m Fatal Python error: Segmentation fault
[2m[36m(run pid=5676)[0m 


[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:09:47 (running for 00:01:21.19)
[2m[36m(run pid=5676)[0m Memory usage on this node: 13.0/30.9 GiB
[2m[36m(run pid=5676)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=5676)[0m Resources requested: 85.0/136 CPUs, 0/0 GPUs, 0.0/376.72 GiB heap, 0.0/156.92 GiB objects
[2m[36m(run pid=5676)[0m Current best trial: 8b1d5_00001 with eval-error=0.17505 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.0002752208184848541, 'subsample': 0.5922580193607261, 'max_depth': 8, 'nthread': 2, 'n_jobs': 2}
[2m[36m(run pid=5676)[0m Result logdir: /home/ray/ray_results/train_xgboost_remote_2022-01-25_08-04-58
[2m[36m(run pid=5676)[0m Number of trials: 8/8 (2 PENDING, 5 RUNNING, 1 TERMINATED)
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 
[2m[36m(ImplicitFunc pid=592, ip=10.0.0.176)[0m Total time taken: 80.

[2m[36m(ImplicitFunc pid=757, ip=10.0.0.247)[0m 2022-01-25 08:09:47,813	INFO main.py:1099 -- Training in progress (38 seconds since last restart).
[2m[36m(ImplicitFunc pid=592, ip=10.0.0.176)[0m 2022-01-25 08:09:47,821	INFO main.py:1099 -- Training in progress (38 seconds since last restart).
[2m[36m(ImplicitFunc pid=592, ip=10.0.0.176)[0m 2022-01-25 08:09:47,827	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=700,000 in 39.09 seconds (38.43 pure XGBoost training time).
[2m[36m(ImplicitFunc pid=757, ip=10.0.0.247)[0m 2022-01-25 08:09:47,837	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=700,000 in 38.98 seconds (38.41 pure XGBoost training time).
[2m[36m(run pid=5676)[0m wandb: Currently logged in as: sublimotion (use `wandb login --relogin` to force relogin)
[2m[36m(run pid=5676)[0m wandb: wandb version 0.12.9 is available!  To upgrade, please run:
[2m[36m(run pid=5676)[0m wandb: 

[2m[36m(run pid=5676)[0m 


[2m[36m(run pid=5676)[0m wandb: Tracking run with wandb version 0.12.5
[2m[36m(run pid=5676)[0m wandb: Syncing run train_xgboost_remote_8b1d5_00006
[2m[36m(run pid=5676)[0m wandb:  View project at https://wandb.ai/sublimotion/XGBoost-Tune-Experiment-demo
[2m[36m(run pid=5676)[0m wandb:  View run at https://wandb.ai/sublimotion/XGBoost-Tune-Experiment-demo/runs/8b1d5_00006
[2m[36m(run pid=5676)[0m wandb: Run data is saved locally in /tmp/ray/session_2022-01-25_07-11-21_216444_163/runtime_resources/working_dir_files/_ray_pkg_226d4ea0735c6585/wandb/run-20220125_080947-8b1d5_00006
[2m[36m(run pid=5676)[0m wandb: Run `wandb offline` to turn off syncing.


[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:09:57 (running for 00:01:31.38)
[2m[36m(run pid=5676)[0m Memory usage on this node: 13.3/30.9 GiB
[2m[36m(run pid=5676)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=5676)[0m Resources requested: 85.0/136 CPUs, 0/0 GPUs, 0.0/376.72 GiB heap, 0.0/156.92 GiB objects
[2m[36m(run pid=5676)[0m Current best trial: 8b1d5_00001 with eval-error=0.173947 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.0002752208184848541, 'subsample': 0.5922580193607261, 'max_depth': 8, 'nthread': 2, 'n_jobs': 2}
[2m[36m(run pid=5676)[0m Result logdir: /home/ray/ray_results/train_xgboost_remote_2022-01-25_08-04-58
[2m[36m(run pid=5676)[0m Number of trials: 8/8 (1 PENDING, 5 RUNNING, 2 TERMINATED)
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 


[2m[36m(run pid=5676)[0m wandb: Currently logged in as: sublimotion (use `wandb login --relogin` to force relogin)
[2m[36m(run pid=5676)[0m wandb: wandb version 0.12.9 is available!  To upgrade, please run:
[2m[36m(run pid=5676)[0m wandb:  $ pip install wandb --upgrade


[2m[36m(run pid=5676)[0m 


[2m[36m(run pid=5676)[0m wandb: Tracking run with wandb version 0.12.5
[2m[36m(run pid=5676)[0m wandb: Syncing run train_xgboost_remote_8b1d5_00007
[2m[36m(run pid=5676)[0m wandb:  View project at https://wandb.ai/sublimotion/XGBoost-Tune-Experiment-demo
[2m[36m(run pid=5676)[0m wandb:  View run at https://wandb.ai/sublimotion/XGBoost-Tune-Experiment-demo/runs/8b1d5_00007
[2m[36m(run pid=5676)[0m wandb: Run data is saved locally in /tmp/ray/session_2022-01-25_07-11-21_216444_163/runtime_resources/working_dir_files/_ray_pkg_226d4ea0735c6585/wandb/run-20220125_080958-8b1d5_00007
[2m[36m(run pid=5676)[0m wandb: Run `wandb offline` to turn off syncing.


[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:10:07 (running for 00:01:41.45)
[2m[36m(run pid=5676)[0m Memory usage on this node: 13.6/30.9 GiB
[2m[36m(run pid=5676)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=5676)[0m Resources requested: 85.0/136 CPUs, 0/0 GPUs, 0.0/376.72 GiB heap, 0.0/156.92 GiB objects
[2m[36m(run pid=5676)[0m Current best trial: 8b1d5_00001 with eval-error=0.173947 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.0002752208184848541, 'subsample': 0.5922580193607261, 'max_depth': 8, 'nthread': 2, 'n_jobs': 2}
[2m[36m(run pid=5676)[0m Result logdir: /home/ray/ray_results/train_xgboost_remote_2022-01-25_08-04-58
[2m[36m(run pid=5676)[0m Number of trials: 8/8 (5 RUNNING, 3 TERMINATED)
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 
[2m[36m(ImplicitFunc pid=608, ip=10.0.0.199)[0m Total time taken: 100.883468627

[2m[36m(ImplicitFunc pid=617, ip=10.0.0.85)[0m 2022-01-25 08:10:08,001	INFO main.py:1099 -- Training in progress (58 seconds since last restart).
[2m[36m(ImplicitFunc pid=617, ip=10.0.0.85)[0m 2022-01-25 08:10:08,006	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=700,000 in 59.21 seconds (58.06 pure XGBoost training time).
[2m[36m(ImplicitFunc pid=608, ip=10.0.0.199)[0m 2022-01-25 08:10:07,991	INFO main.py:1099 -- Training in progress (56 seconds since last restart).
[2m[36m(ImplicitFunc pid=608, ip=10.0.0.199)[0m 2022-01-25 08:10:07,998	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=700,000 in 56.66 seconds (56.26 pure XGBoost training time).


[2m[36m(run pid=5676)[0m 


[2m[36m(run pid=5676)[0m wandb: Waiting for W&B process to finish, PID 20860... (success).
Repartition:   0%|          | 0/32 [00:00<?, ?it/s]
Repartition:  69%|██████▉   | 22/32 [00:00<00:00, 208.69it/s]
Repartition: 100%|██████████| 32/32 [00:00<00:00, 203.09it/s]
Shuffle Map:   0%|          | 0/10 [00:00<?, ?it/s]
Shuffle Map:  10%|█         | 1/10 [00:00<00:01,  8.42it/s]


[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:10:18 (running for 00:01:51.56)
[2m[36m(run pid=5676)[0m Memory usage on this node: 13.6/30.9 GiB
[2m[36m(run pid=5676)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=5676)[0m Resources requested: 68.0/136 CPUs, 0/0 GPUs, 0.0/376.72 GiB heap, 0.0/156.92 GiB objects
[2m[36m(run pid=5676)[0m Current best trial: 8b1d5_00001 with eval-error=0.173947 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.0002752208184848541, 'subsample': 0.5922580193607261, 'max_depth': 8, 'nthread': 2, 'n_jobs': 2}
[2m[36m(run pid=5676)[0m Result logdir: /home/ray/ray_results/train_xgboost_remote_2022-01-25_08-04-58
[2m[36m(run pid=5676)[0m Number of trials: 8/8 (4 RUNNING, 4 TERMINATED)
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 


Shuffle Map:  50%|█████     | 5/10 [00:00<00:00, 12.85it/s]
Shuffle Map: 100%|██████████| 10/10 [00:00<00:00, 19.13it/s]
Shuffle Reduce:   0%|          | 0/10 [00:00<?, ?it/s]
Shuffle Reduce:  10%|█         | 1/10 [00:00<00:01,  6.05it/s]
Shuffle Reduce:  60%|██████    | 6/10 [00:00<00:00, 24.76it/s]
Shuffle Reduce: 100%|██████████| 10/10 [00:02<00:00,  4.28it/s]
Repartition:   0%|          | 0/8 [00:00<?, ?it/s]
Repartition:  38%|███▊      | 3/8 [00:00<00:00, 29.94it/s]
[2m[36m(run pid=5676)[0m *** SIGSEGV received at time=1643127022 on cpu 0 ***
[2m[36m(run pid=5676)[0m PC: @     0x7ff2ae3430a1  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m     @     0x7ff2b647b980  (unknown)  (unknown)
[2m[36m(run pid=5676)[0m [2022-01-25 08:10:22,350 E 20792 20924] logging.cc:317: *** SIGSEGV received at time=1643127022 on cpu 0 ***
[2m[36m(run pid=5676)[0m [2022-01-25 08:10:22,350 E 20792 20924] logging.cc:317: PC: @     0x7ff2ae3430a1  (unknown)  (unknown)
[2m[36m(run pid=5676)[

[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:10:28 (running for 00:02:01.60)
[2m[36m(run pid=5676)[0m Memory usage on this node: 13.5/30.9 GiB
[2m[36m(run pid=5676)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=5676)[0m Resources requested: 51.0/136 CPUs, 0/0 GPUs, 0.0/376.72 GiB heap, 0.0/156.92 GiB objects
[2m[36m(run pid=5676)[0m Current best trial: 8b1d5_00001 with eval-error=0.173947 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.0002752208184848541, 'subsample': 0.5922580193607261, 'max_depth': 8, 'nthread': 2, 'n_jobs': 2}
[2m[36m(run pid=5676)[0m Result logdir: /home/ray/ray_results/train_xgboost_remote_2022-01-25_08-04-58
[2m[36m(run pid=5676)[0m Number of trials: 8/8 (3 RUNNING, 5 TERMINATED)
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 


Repartition:   0%|          | 0/32 [00:00<?, ?it/s]
Repartition:  41%|████      | 13/32 [00:00<00:00, 123.71it/s]
Repartition: 100%|██████████| 32/32 [00:00<00:00, 140.71it/s]
Shuffle Map:   0%|          | 0/10 [00:00<?, ?it/s]
Shuffle Map:  10%|█         | 1/10 [00:00<00:01,  8.20it/s]
Shuffle Map:  70%|███████   | 7/10 [00:00<00:00, 32.85it/s]
Shuffle Map: 100%|██████████| 10/10 [00:00<00:00, 37.97it/s]
Shuffle Reduce:   0%|          | 0/10 [00:00<?, ?it/s]
Shuffle Reduce: 100%|██████████| 10/10 [00:00<00:00, 54.35it/s]


[2m[36m(run pid=5676)[0m 2022-01-25 08:10:31,466	INFO command_runner.py:357 -- Fetched IP: 10.0.0.13
[2m[36m(run pid=5676)[0m 2022-01-25 08:10:31,466	INFO log_timer.py:25 -- NodeUpdater: ins_ac6zgMX3nvXgY9XpRJgDwwuh: Got IP  [LogTimer=483ms]


Repartition:   0%|          | 0/8 [00:00<?, ?it/s] 
Repartition: 100%|██████████| 8/8 [00:00<00:00, 100.01it/s]
Repartition: 100%|██████████| 8/8 [00:00<00:00, 199.78it/s]
[2m[36m(ImplicitFunc pid=725, ip=10.0.0.247)[0m 2022-01-25 08:10:32,056	INFO main.py:979 -- [RayXGBoost] Created 8 new actors (8 total actors). Waiting until actors are ready for training.
[2m[36m(ImplicitFunc pid=874, ip=10.0.0.65)[0m 2022-01-25 08:10:32,288	INFO main.py:1099 -- Training in progress (77 seconds since last restart).
[2m[36m(ImplicitFunc pid=874, ip=10.0.0.65)[0m 2022-01-25 08:10:32,298	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=700,000 in 79.66 seconds (77.22 pure XGBoost training time).


[2m[36m(ImplicitFunc pid=874, ip=10.0.0.65)[0m Total time taken: 125.38759446144104
[2m[36m(ImplicitFunc pid=874, ip=10.0.0.65)[0m Final validation error: 0.2145


[2m[36m(ImplicitFunc pid=725, ip=10.0.0.247)[0m 2022-01-25 08:10:32,591	INFO main.py:1024 -- [RayXGBoost] Starting XGBoost training.
[2m[36m(_RemoteRayXGBoostActor pid=593, ip=10.0.0.14)[0m [08:10:32] task [xgboost.ray]:139967660168720 got new rank 1
[2m[36m(_RemoteRayXGBoostActor pid=617, ip=10.0.0.14)[0m [08:10:32] task [xgboost.ray]:139946725706384 got new rank 2
[2m[36m(_RemoteRayXGBoostActor pid=624, ip=10.0.0.14)[0m [08:10:32] task [xgboost.ray]:140694627156176 got new rank 0
[2m[36m(_RemoteRayXGBoostActor pid=612, ip=10.0.0.247)[0m [08:10:32] task [xgboost.ray]:140475920207152 got new rank 3
[2m[36m(_RemoteRayXGBoostActor pid=834, ip=10.0.0.247)[0m [08:10:32] task [xgboost.ray]:140587667979328 got new rank 5
[2m[36m(_RemoteRayXGBoostActor pid=864, ip=10.0.0.247)[0m [08:10:32] task [xgboost.ray]:139803864056880 got new rank 6
[2m[36m(_RemoteRayXGBoostActor pid=894, ip=10.0.0.247)[0m [08:10:32] task [xgboost.ray]:139988594528160 got new rank 7
[2m[36m(_Re

[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:10:42 (running for 00:02:15.83)
[2m[36m(run pid=5676)[0m Memory usage on this node: 13.1/30.9 GiB
[2m[36m(run pid=5676)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=5676)[0m Resources requested: 34.0/136 CPUs, 0/0 GPUs, 0.0/376.72 GiB heap, 0.0/156.92 GiB objects
[2m[36m(run pid=5676)[0m Current best trial: 8b1d5_00006 with eval-error=0.17334 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.04340709367317894, 'subsample': 0.8957024019504654, 'max_depth': 8, 'nthread': 2, 'n_jobs': 2}
[2m[36m(run pid=5676)[0m Result logdir: /home/ray/ray_results/train_xgboost_remote_2022-01-25_08-04-58
[2m[36m(run pid=5676)[0m Number of trials: 8/8 (2 RUNNING, 6 TERMINATED)
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 




[2m[36m(run pid=5676)[0m 2022-01-25 08:10:45,913	INFO command_runner.py:357 -- Fetched IP: 10.0.0.247
[2m[36m(run pid=5676)[0m 2022-01-25 08:10:45,913	INFO log_timer.py:25 -- NodeUpdater: ins_w4xZksByUZcYpNtJ6zNQMs5t: Got IP  [LogTimer=568ms]


[2m[36m(ImplicitFunc pid=610, ip=10.0.0.13)[0m 2022-01-25 08:10:46,792	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=700,000 in 23.26 seconds (20.62 pure XGBoost training time).
[2m[36m(ImplicitFunc pid=725, ip=10.0.0.247)[0m 2022-01-25 08:10:46,787	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=700,000 in 14.77 seconds (14.19 pure XGBoost training time).


[2m[36m(ImplicitFunc pid=610, ip=10.0.0.13)[0m Total time taken: 58.88376712799072
[2m[36m(ImplicitFunc pid=610, ip=10.0.0.13)[0m Final validation error: 0.1610
[2m[36m(ImplicitFunc pid=725, ip=10.0.0.247)[0m Total time taken: 48.71080565452576
[2m[36m(ImplicitFunc pid=725, ip=10.0.0.247)[0m Final validation error: 0.3317
[2m[36m(run pid=5676)[0m 


[2m[36m(run pid=5676)[0m wandb: Waiting for W&B process to finish, PID 22524... (success).
[2m[36m(run pid=5676)[0m wandb: - 0.00MB of 0.00MB uploaded (0.00MB deduped)
[2m[36m(run pid=5676)[0m wandb: \ 0.00MB of 0.00MB uploaded (0.00MB deduped)
[2m[36m(run pid=5676)[0m wandb: | 0.00MB of 0.02MB uploaded (0.00MB deduped)
[2m[36m(run pid=5676)[0m wandb: / 0.00MB of 0.02MB uploaded (0.00MB deduped)
[2m[36m(run pid=5676)[0m wandb: - 0.02MB of 0.02MB uploaded (0.00MB deduped)
[2m[36m(run pid=5676)[0m wandb: \ 0.02MB of 0.02MB uploaded (0.00MB deduped)
[2m[36m(run pid=5676)[0m wandb: | 0.02MB of 0.02MB uploaded (0.00MB deduped)
[2m[36m(run pid=5676)[0m wandb: / 0.02MB of 0.02MB uploaded (0.00MB deduped)
[2m[36m(run pid=5676)[0m wandb: - 0.02MB of 0.02MB uploaded (0.00MB deduped)
[2m[36m(run pid=5676)[0m wandb: \ 0.02MB of 0.02MB uploaded (0.00MB deduped)
[2m[36m(run pid=5676)[0m wandb: | 0.02MB of 0.02MB uploaded (0.00MB deduped)
[2m[36m(run pid=5676)[0

[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:10:52 (running for 00:02:26.31)
[2m[36m(run pid=5676)[0m Memory usage on this node: 12.9/30.9 GiB
[2m[36m(run pid=5676)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=5676)[0m Resources requested: 17.0/136 CPUs, 0/0 GPUs, 0.0/376.72 GiB heap, 0.0/156.92 GiB objects
[2m[36m(run pid=5676)[0m Current best trial: 8b1d5_00006 with eval-error=0.16095 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.04340709367317894, 'subsample': 0.8957024019504654, 'max_depth': 8, 'nthread': 2, 'n_jobs': 2}
[2m[36m(run pid=5676)[0m Result logdir: /home/ray/ray_results/train_xgboost_remote_2022-01-25_08-04-58
[2m[36m(run pid=5676)[0m Number of trials: 8/8 (1 RUNNING, 7 TERMINATED)
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 


[2m[36m(run pid=5676)[0m wandb: Run history:
[2m[36m(run pid=5676)[0m wandb:                 config/eta ▁▁▁▁▁▁▁▁▁▁
[2m[36m(run pid=5676)[0m wandb:           config/max_depth ▁▁▁▁▁▁▁▁▁▁
[2m[36m(run pid=5676)[0m wandb:              config/n_jobs ▁▁▁▁▁▁▁▁▁▁
[2m[36m(run pid=5676)[0m wandb:             config/nthread ▁▁▁▁▁▁▁▁▁▁
[2m[36m(run pid=5676)[0m wandb:           config/subsample ▁▁▁▁▁▁▁▁▁▁
[2m[36m(run pid=5676)[0m wandb:                 eval-error █▂▇▆▅▅▅▅▆▁
[2m[36m(run pid=5676)[0m wandb:               eval-logloss ████▇▇▇▃▁▁
[2m[36m(run pid=5676)[0m wandb:   iterations_since_restore ▁▂▃▃▄▅▆▆▇█
[2m[36m(run pid=5676)[0m wandb:         time_since_restore ▁█████████
[2m[36m(run pid=5676)[0m wandb:           time_this_iter_s █▄▁▁▁▁▁▁▁▁
[2m[36m(run pid=5676)[0m wandb:               time_total_s ▁█████████
[2m[36m(run pid=5676)[0m wandb:                  timestamp ▁█████████
[2m[36m(run pid=5676)[0m wandb:    timesteps_since_restore ▁▁▁▁▁▁▁▁▁▁
[2

[2m[36m(run pid=5676)[0m 


[2m[36m(run pid=5676)[0m wandb: Waiting for W&B process to finish, PID 22358... (success).
[2m[36m(run pid=5676)[0m wandb: - 0.00MB of 0.00MB uploaded (0.00MB deduped)
[2m[36m(run pid=5676)[0m wandb: \ 0.00MB of 0.00MB uploaded (0.00MB deduped)
[2m[36m(run pid=5676)[0m wandb: | 0.00MB of 0.02MB uploaded (0.00MB deduped)
[2m[36m(run pid=5676)[0m wandb: / 0.00MB of 0.02MB uploaded (0.00MB deduped)
[2m[36m(run pid=5676)[0m wandb: - 0.02MB of 0.02MB uploaded (0.00MB deduped)
[2m[36m(run pid=5676)[0m wandb: \ 0.02MB of 0.02MB uploaded (0.00MB deduped)
[2m[36m(run pid=5676)[0m wandb: | 0.02MB of 0.02MB uploaded (0.00MB deduped)
[2m[36m(run pid=5676)[0m wandb: / 0.02MB of 0.02MB uploaded (0.00MB deduped)
[2m[36m(run pid=5676)[0m wandb: - 0.02MB of 0.02MB uploaded (0.00MB deduped)
wandb:                                                                                


[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:10:58 (running for 00:02:31.86)
[2m[36m(run pid=5676)[0m Memory usage on this node: 12.6/30.9 GiB
[2m[36m(run pid=5676)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=5676)[0m Resources requested: 0/136 CPUs, 0/0 GPUs, 0.0/376.72 GiB heap, 0.0/156.92 GiB objects
[2m[36m(run pid=5676)[0m Current best trial: 8b1d5_00006 with eval-error=0.16095 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.04340709367317894, 'subsample': 0.8957024019504654, 'max_depth': 8, 'nthread': 2, 'n_jobs': 2}
[2m[36m(run pid=5676)[0m Result logdir: /home/ray/ray_results/train_xgboost_remote_2022-01-25_08-04-58
[2m[36m(run pid=5676)[0m Number of trials: 8/8 (8 TERMINATED)
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m 
[2m[36m(run pid=5676)[0m == Status ==
[2m[36m(run pid=5676)[0m Current time: 2022-01-25 08:10

[2m[36m(run pid=5676)[0m wandb: Run history:
[2m[36m(run pid=5676)[0m wandb:                 config/eta ▁▁▁▁▁▁▁▁▁▁
[2m[36m(run pid=5676)[0m wandb:           config/max_depth ▁▁▁▁▁▁▁▁▁▁
[2m[36m(run pid=5676)[0m wandb:              config/n_jobs ▁▁▁▁▁▁▁▁▁▁
[2m[36m(run pid=5676)[0m wandb:             config/nthread ▁▁▁▁▁▁▁▁▁▁
[2m[36m(run pid=5676)[0m wandb:           config/subsample ▁▁▁▁▁▁▁▁▁▁
[2m[36m(run pid=5676)[0m wandb:                 eval-error █▅▅▄▃▃▂▂▁▁
[2m[36m(run pid=5676)[0m wandb:               eval-logloss █▇▆▅▄▄▃▂▂▁
[2m[36m(run pid=5676)[0m wandb:   iterations_since_restore ▁▂▃▃▄▅▆▆▇█
[2m[36m(run pid=5676)[0m wandb:         time_since_restore ▁▃▃▃▃▃████
[2m[36m(run pid=5676)[0m wandb:           time_this_iter_s █▂▁▁▁▁▄▁▁▁
[2m[36m(run pid=5676)[0m wandb:               time_total_s ▁▃▃▃▃▃████
[2m[36m(run pid=5676)[0m wandb:                  timestamp ▁▃▃▃▃▃████
[2m[36m(run pid=5676)[0m wandb:    timesteps_since_restore ▁▁▁▁▁▁▁▁▁▁
[2

Best model parameters: {'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.04340709367317894, 'subsample': 0.8957024019504654, 'max_depth': 8}
Best model total accuracy: 0.8390
{'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.04340709367317894, 'subsample': 0.8957024019504654, 'max_depth': 8}


[2m[36m(run pid=5676)[0m --- Logging error ---
[2m[36m(run pid=5676)[0m Traceback (most recent call last):
[2m[36m(run pid=5676)[0m   File "/home/ray/anaconda3/lib/python3.8/logging/__init__.py", line 1085, in emit
[2m[36m(run pid=5676)[0m     self.flush()
[2m[36m(run pid=5676)[0m   File "/home/ray/anaconda3/lib/python3.8/logging/__init__.py", line 1065, in flush
[2m[36m(run pid=5676)[0m     self.stream.flush()
[2m[36m(run pid=5676)[0m OSError: [Errno 28] No space left on device
[2m[36m(run pid=5676)[0m Call stack:
[2m[36m(run pid=5676)[0m   File "/home/ray/anaconda3/lib/python3.8/threading.py", line 890, in _bootstrap
[2m[36m(run pid=5676)[0m     self._bootstrap_inner()
[2m[36m(run pid=5676)[0m   File "/home/ray/anaconda3/lib/python3.8/threading.py", line 932, in _bootstrap_inner
[2m[36m(run pid=5676)[0m     self.run()
[2m[36m(run pid=5676)[0m   File "/home/ray/anaconda3/lib/python3.8/site-packages/wandb/sdk/internal/internal_util.py", line 52, i

# 3a. Inference (regular XGBoost)

In [19]:
total_time = time.time()

df = load_parquet_dataset(data_files_300MB).drop(["labels", "partition"], axis=1, errors="ignore")
inference_df = DMatrix(df)
results = bst.predict(inference_df)

print(f"Total time taken: {time.time()-total_time}")

100%|██████████| 1/1 [00:00<00:00,  1.31it/s]


Dataset size: 0.337000372 GB
Total time taken: 1.5613970756530762


In [20]:
results

array([0.7038215 , 0.18560845, 0.17597654, ..., 0.18545696, 0.87184376,
       0.30593213], dtype=float32)

# 3b. Inference (XGBoost on Ray)


In [12]:
@ray.remote
def xgb_batch_inference(bst, files):
    total_time = time.time()

    inference_df = RayDMatrix(files, ignore=["labels", "partition"])
    results = xgboost_ray.predict(bst, inference_df, ray_params=RayParams(num_actors=8))

    print(f"Total time taken: {time.time()-total_time}")
    return results

In [13]:
batch_results = ray.get(xgb_batch_inference.remote(bst, s3_data_files_300MB))


[2m[36m(xgb_batch_inference pid=46848)[0m 2022-01-26 11:30:40,396	INFO main.py:1543 -- [RayXGBoost] Created 8 remote actors.
[2m[36m(xgb_batch_inference pid=46848)[0m 2022-01-26 11:30:53,481	INFO main.py:1560 -- [RayXGBoost] Starting XGBoost prediction.


[2m[36m(xgb_batch_inference pid=46848)[0m Total time taken: 13.341913938522339


In [14]:
batch_results

array([0.6322733 , 0.70809305, 0.36546   , ..., 0.17210634, 0.90754145,
       0.25357166], dtype=float32)

---

# References
 * [Introducing Distributed XGBoost Training with Ray](https://www.anyscale.com/blog/distributed-xgboost-training-with-ray)
 * [How to Speed Up XGBoost Model Training](https://www.anyscale.com/blog/how-to-speed-up-xgboost-model-training)
 * [Distributed XGBoost on Ray](https://docs.ray.io/en/latest/xgboost-ray.html)

