# Distributed Training of an XGBoost Model on Anyscale


<div align="left">
<a target="_blank" href="https://console.anyscale.com/"><img src="https://img.shields.io/badge/🚀 Run_on-Anyscale-9hf"></a>&nbsp;
<a href="https://github.com/anyscale/e2e-xgboost" role="button"><img src="https://img.shields.io/static/v1?label=&amp;message=View%20On%20GitHub&amp;color=586069&amp;logo=github&amp;labelColor=2f363d"></a>&nbsp;
</div>

In this tutorial, we'll execute a distributed training workload that will connect the following heterogenous workloads:
- Preprocessing the dataset with Ray Data
- Distributed training of an XGBoost model with Ray Train
- Saving model artifacts to a model registry (MLFlow)

**Note**: We won't be tuning our model in this tutorial, but be sure to check out [Ray Tune](https://docs.ray.io/en/latest/tune/index.html) for experiment execution and hyperparameter tuning at any scale.

<img src="https://raw.githubusercontent.com/anyscale/e2e-xgboost/refs/heads/main/images/distributed_training.png" width=800>


Let's start by installing the dependencies:

In [1]:
%load_ext autoreload
%autoreload all

In [2]:
# enable importing from dist_xgboost module
import os
import sys

sys.path.append(os.path.abspath(".."))

In [3]:
# Enable Ray Train v2. This will be the default in an upcoming release.
os.environ["RAY_TRAIN_V2_ENABLED"] = "1"
# Now it's safe to import from ray.train

# Enable uv on Ray
# https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#using-uv-for-package-management
os.environ.pop("RAY_RUNTIME_ENV_HOOK", None)

import ray

ray.init(runtime_env={"py_executable": "uv run", "working_dir": "."})

In [4]:
from dist_xgboost.constants import local_storage_path, preprocessor_path

In [5]:
# make ray data less verbose
ray.data.DataContext.get_current().enable_progress_bars = False
ray.data.DataContext.get_current().print_on_execution_start = False

## Dataset Preparataion

For this example, we're using the ["Breast Cancer Wisconsin (Diagnostic)"](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic) dataset, which contains features computed from digitized images of breast mass cell nuclei.

We'll split the data into:
- 70% for training
- 15% for validation
- 15% for testing

In [6]:
from ray.data import Dataset


def prepare_data() -> tuple[Dataset, Dataset, Dataset]:
    """Load and split the dataset into train, validation, and test sets."""
    # Load the dataset from S3
    dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
    seed = 42

    # Split 70% for training
    train_dataset, rest = dataset.train_test_split(test_size=0.3, shuffle=True, seed=seed)
    # Split the remaining 70% into 15% validation and 15% testing
    valid_dataset, test_dataset = rest.train_test_split(test_size=0.5, shuffle=True, seed=seed)
    return train_dataset, valid_dataset, test_dataset

In [None]:
# Load and split the dataset
train_dataset, valid_dataset, _test_dataset = prepare_data()
train_dataset.take(1)

Looking at the output, we can see the dataset contains features characterizing cell nuclei in breast mass, such as radius, texture, perimeter, area, smoothness, compactness, concavity, symmetry, and more.

## Data Preprocessing

Notice that the features have different magnitudes and ranges. While tree-based models like XGBoost aren't as sensitive to this, feature scaling can still improve numerical stability in some cases.

Ray Data offers built-in preprocessors that simplify common feature preprocessing tasks, especially for tabular data. These can be seamlessly integrated with Ray Datasets, allowing you to preprocess your data in a fault-tolerant and distributed way.

In this example, we'll use Ray's built-in `StandardScaler` to zero-center and normalize the features:

In [None]:
from ray.data.preprocessors import StandardScaler

# Select all columns except the target for scaling
columns_to_scale = [c for c in train_dataset.columns() if c != "target"]

# Initialize the preprocessor
preprocessor = StandardScaler(columns=columns_to_scale)
# Fit the preprocessor on the training set only
# (this prevents data leakage)
preprocessor.fit(train_dataset)

Now that we've fit the preprocessor, we'll save it to a file. Later, we'll register this artifact in MLFlow so we can reuse it in downstream pipelines:

In [9]:
import pickle

with open(preprocessor_path, "wb") as f:
    pickle.dump(preprocessor, f)

Next, we'll transform our datasets using the fitted preprocessor. Note that the `transform()` operation is lazy - it won't be applied to the data until it's required by the train workers:

In [None]:
train_dataset = preprocessor.transform(train_dataset)
valid_dataset = preprocessor.transform(valid_dataset)
train_dataset.take(1)

Using `take()`, we can see that the values are now zero-centered and rescaled to be roughly between -1 and 1.

> **Data Processing Note**:  
> For more advanced data loading and preprocessing techniques, check out the [comprehensive guide](https://docs.ray.io/en/latest/train/user-guides/data-loading-preprocessing.html). Ray Data also supports performant joins, filters, aggregations, and other operations for more structured data processing your workloads may require.

## Model Training with XGBoost

### Checkpointing Configuration

Checkpointing is a powerful feature that enables you to resume training from the last checkpoint in case of interruptions. This is particularly useful for long-running training sessions.

[`XGBoostTrainer`](https://docs.ray.io/en/latest/train/api/doc/ray.train.xgboost.XGBoostTrainer.html) implements checkpointing out of the box. We just need to configure [`CheckpointConfig`](https://docs.ray.io/en/latest/train/api/doc/ray.train.CheckpointConfig.html) to set the checkpointing frequency:

In [11]:
from ray.train import CheckpointConfig, Result, RunConfig, ScalingConfig

# Configure checkpointing to save progress during training
run_config = RunConfig(
    checkpoint_config=CheckpointConfig(
        # Checkpoint every 10 iterations
        checkpoint_frequency=10,
        # Only keep the latest checkpoint
        num_to_keep=1,
    ),
    ## For multi-node clusters, configure storage that is accessible
    ## across all worker nodes with `storage_path="s3://..."`
    storage_path=local_storage_path,
)

> **Note**: Once you enable checkpointing, you can follow [this guide](https://docs.ray.io/en/latest/train/user-guides/fault-tolerance.html) to enable fault tolerance.

### Training with XGBoost

The training parameters are passed as a dictionary, similar to the original [`xgboost.train()`](https://xgboost.readthedocs.io/en/stable/parameter.html) function:

In [12]:
import xgboost
from ray.train.xgboost import RayTrainReportCallback, XGBoostTrainer


def train_fn_per_worker(config: dict):
    """Training function that runs on each worker.

    This function:
    1. Gets the dataset shard for this worker
    2. Converts to pandas for XGBoost
    3. Separates features and labels
    4. Creates DMatrix objects
    5. Trains the model using distributed communication
    """
    # Get this worker's dataset shard
    train_ds, val_ds = (
        ray.train.get_dataset_shard("train"),
        ray.train.get_dataset_shard("validation"),
    )

    # Materialize the data and convert to pandas
    train_ds = train_ds.materialize().to_pandas()
    val_ds = val_ds.materialize().to_pandas()

    # Separate the labels from the features
    train_X, train_y = train_ds.drop("target", axis=1), train_ds["target"]
    eval_X, eval_y = val_ds.drop("target", axis=1), val_ds["target"]

    # Convert the data into DMatrix format for XGBoost
    dtrain = xgboost.DMatrix(train_X, label=train_y)
    deval = xgboost.DMatrix(eval_X, label=eval_y)

    # Do distributed data-parallel training
    # Ray Train sets up the necessary coordinator processes and
    # environment variables for workers to communicate with each other
    _booster = xgboost.train(
        config["xgboost_params"],
        dtrain=dtrain,
        evals=[(dtrain, "train"), (deval, "validation")],
        num_boost_round=10,
        # Handles metric logging and checkpointing
        callbacks=[RayTrainReportCallback()],
    )


# Parameters for the XGBoost model
model_config = {
    "xgboost_params": {
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    }
}

trainer = XGBoostTrainer(
    train_fn_per_worker,
    train_loop_config=model_config,
    # Register the data subsets
    datasets={"train": train_dataset, "validation": valid_dataset},
    # see "How to scale out training?" for more details
    scaling_config=ScalingConfig(
        # Number of workers for data parallelism.
        num_workers=5,
        # Set to True to use GPU acceleration
        use_gpu=True,
    ),
    run_config=run_config,
)

> **Ray Train Benefits**:
> 
> - **Multi-node orchestration**: Automatically handles multi-node, multi-GPU setup without manual SSH or hostfile configurations
> - **Built-in fault tolerance**: Supports automatic retry of failed workers and can continue from the last checkpoint
> - **Flexible training strategies**: Supports various parallelism strategies beyond just data parallel training
> - **Heterogeneous cluster support**: Define per-worker resource requirements and run on mixed hardware
> 
> Ray Train integrates with popular frameworks like PyTorch, TensorFlow, XGBoost, and more. For enterprise needs, [RayTurbo Train](https://docs.anyscale.com/rayturbo/rayturbo-train) offers additional features like elastic training, advanced monitoring, and performance optimization.
>
> <img src="https://raw.githubusercontent.com/anyscale/e2e-xgboost/refs/heads/main/images/train_integrations.png" width=500>

Now let's train our model:

In [None]:
result: Result = trainer.fit()
result

Ray Train returns a [`ray.train.Result`](https://docs.ray.io/en/latest/train/api/doc/ray.train.Result.html) object, which contains important properties such as metrics, checkpoint info, and error details:

In [None]:
metrics = result.metrics
metrics

Expected output (your values may differ):

```python
OrderedDict([('train-logloss', 0.05463397157248817),
             ('train-error', 0.00506329113924051),
             ('validation-logloss', 0.06741214815308066),
             ('validation-error', 0.01176470588235294)])
```

We see that the Ray Train logged metrics based on the values we configured in `eval_metric` and `evals`.

We can also reconstruct the trained model from the checkpoint directory:

In [None]:
booster = RayTrainReportCallback.get_model(result.checkpoint)
booster

## Model Registry

Now that we've trained our model, let's save it to a model registry for future use. We'll use MLflow for this purpose, storing it in our [Anyscale user storage](https://docs.anyscale.com/configuration/storage/#user-storage). Ray also integrates with [other experiment trackers](https://docs.ray.io/en/latest/train/user-guides/experiment-tracking.html).

In [None]:
import shutil
from tempfile import TemporaryDirectory

import mlflow

from dist_xgboost.constants import (
    experiment_name,
    model_fname,
    model_registry,
    preprocessor_fname,
)

# clean up old runs
os.path.isdir(model_registry) and shutil.rmtree(model_registry)
# mlflow.delete_experiment(experiment_name)
os.makedirs(model_registry, exist_ok=True)


# create a model registry in our user storage
mlflow.set_tracking_uri(f"file:{model_registry}")

# create a new experiment and log metrics and artifacts
mlflow.set_experiment(experiment_name)
with mlflow.start_run(description="xgboost breast cancer classifier on all features"):
    mlflow.log_params(model_config)
    mlflow.log_metrics(metrics)

    # Selectively log just the preprocessor and model weights
    with TemporaryDirectory() as tmp_dir:
        shutil.copy(
            os.path.join(result.checkpoint.path, model_fname),
            os.path.join(tmp_dir, model_fname),
        )
        shutil.copy(
            preprocessor_path,
            os.path.join(tmp_dir, preprocessor_fname),
        )

        mlflow.log_artifacts(tmp_dir)

We can start the MLflow server to view our experiments:

`mlflow server -h 0.0.0.0 -p 8080 --backend-store-uri {model_registry}`

To view the dashboard, go to the **Overview tab** → **Open Ports** → `8080`.

<img src="https://raw.githubusercontent.com/anyscale/e2e-xgboost/refs/heads/main/images/mlflow.png" width=685>

You can also view the Ray Dashboard and Train workload dashboards:

<img src="https://raw.githubusercontent.com/anyscale/e2e-xgboost/refs/heads/main/images/train_metrics.png" width=700>

We can retrieve our best model from the registry:

In [None]:
from dist_xgboost.data import get_best_model_from_registry

best_model, artifacts_dir = get_best_model_from_registry()

### Production Deployment with Anyscale Jobs

We can wrap our training workload as a production-grade [Anyscale Job](https://docs.anyscale.com/platform/jobs/) ([API ref](https://docs.anyscale.com/reference/job-api/)):

In [None]:
%%bash
# Production batch job
anyscale job submit --name=train-xboost-breast-cancer-model \
  --containerfile="/home/ray/default/containerfile" \
  --working-dir="/home/ray/default" \
  --exclude="" \
  --max-retries=0 \
  -- python dist_xgboost/train.py

> **Note**: 
> - We're using a `containerfile` to define dependencies, but you could also use a pre-built image
> - You can specify compute requirements as a [compute config](https://docs.anyscale.com/configuration/compute-configuration/) or inline in a [job config](https://docs.anyscale.com/reference/job-api#job-cli)
> - When launched from a workspace without specifying compute, it defaults to the workspace's compute configuration

## Scaling Strategies

One of the key advantages of Ray Train is its ability to effortlessly scale your training workloads. By adjusting the [`ScalingConfig`](https://docs.ray.io/en/latest/train/api/doc/ray.train.ScalingConfig.html), you can optimize resource utilization and reduce training time.

### Scaling Examples

**Multi-node CPU Example** (4 nodes with 8 CPUs each):

```python
scaling_config = ScalingConfig(
    num_workers=4,
    resources_per_worker={"CPU": 8},
)
```

**Single-node multi-GPU Example** (1 node with 8 CPUs and 4 GPUs):

```python
scaling_config = ScalingConfig(
    num_workers=4,
    use_gpu=True,
)
```

**Multi-node multi-GPU Example** (4 nodes with 8 CPUs and 4 GPUs each):

```python
scaling_config = ScalingConfig(
    num_workers=16,
    use_gpu=True,
)
```

> **Important**: For multi-node clusters, you must specify a shared storage location (such as cloud storage or NFS) in the `run_config`. Using a local path will raise an error during checkpointing.
>
> ```python
> trainer = XGBoostTrainer(
>     ..., run_config=ray.train.RunConfig(storage_path="s3://...")
> )
> ```

### Worker Configuration Guidelines

The optimal number of workers depends on your workload and cluster setup:

- For **CPU-only training**, generally use one worker per node (XGBoost can leverage multiple CPUs with threading)
- For **multi-GPU training**, use one worker per GPU
- For **heterogeneous clusters**, consider the greatest common divisor of CPU counts

### GPU Acceleration

To use GPUs for training:

1. Start one actor per GPU with `use_gpu=True`
2. Set GPU-compatible parameters (e.g., `tree_method="gpu_hist"` for XGBoost)
3. Divide CPUs evenly across actors on each machine

Example:

```python
trainer = XGBoostTrainer(
    scaling_config=ScalingConfig(
        # Number of workers to use for data parallelism.
        num_workers=2,
        # Whether to use GPU acceleration.
        use_gpu=True,
    ),
    params={
        # XGBoost specific params
        "tree_method": "gpu_hist",  # GPU-specific parameter
        "eval_metric": ["logloss", "error"],
    },
    ...
)
```

For more advanced topics, explore:
- [Ray Tune](https://docs.ray.io/en/latest/tune/index.html) for hyperparameter optimization
- [Ray Serve](https://docs.ray.io/en/latest/serve/index.html) for model deployment
- [Ray Data](https://docs.ray.io/en/latest/data/data.html) for more advanced data processing