# Model validation using offline batch inference

<div align="left">
<a target="_blank" href="https://console.anyscale.com/"><img src="https://img.shields.io/badge/🚀 Run_on-Anyscale-9hf"></a>&nbsp;
<a href="https://github.com/anyscale/e2e-xgboost" role="button"><img src="https://img.shields.io/static/v1?label=&amp;message=View%20On%20GitHub&amp;color=586069&amp;logo=github&amp;labelColor=2f363d"></a>&nbsp;
</div>

In this tutorial, we'll execute a batch inference workload that will connect the following heterogenous workloads:
- distributed read from cloud storage
- apply distributed preprocessing
- batch inference
- distributed aggregation of summary metrics

<img src="https://raw.githubusercontent.com/anyscale/e2e-xgboost/refs/heads/main/images/batch_inference.png" width=800>

The above figure illustrates how different chunks of data can be processed concurrently at various stages of the pipeline. This parallel execution maximizes resource utilization and throughput.

Note that this diagram is an simplification for various reasons:

* Backpressure mechanisms may throttle upstream operators to prevent overwhelming downstream stages
* Dynamic repartitioning often happens as data moves through the pipeline, changing block counts and sizes
* Available resources change as the cluster autoscales
* System failures may disrupt the clean sequential flow shown in the diagram
* Etc.

<div class="alert alert-block alert"> <b> Ray Data Streaming Execution</b> 

❌ Traditional batch execution (non-streaming like Spark without pipelining, Sagemaker Batch):
- Reads the entire dataset into memory or a persistent intermediate format.
- Only then starts applying transformations (like .map, .filter, etc).
- Higher memory pressure and startup latency.

✅ Streaming execution (Ray Data):
- Starts processing chunks ("blocks") as they are loaded (no need to wait for entire dataset to load)
- Reduces memory footprint (no OOMs) and speeds up time to first output
- increase resource utilization by reducing idle time
- Online-style inference pipelines with minimal latency

<img src="https://raw.githubusercontent.com/anyscale/e2e-xgboost/refs/heads/main/images/streaming.gif" width=700>

**Note**: Ray Data is not a real-time stream processing engine like Flink or Kafka Streams. Instead, it's batch processing with streaming execution, which is especially useful for iterative ML workloads, ETL pipelines, and preprocessing before training or inference. In fact, Ray typically has a [**2-17x throughput improvement**](https://www.anyscale.com/blog/offline-batch-inference-comparing-ray-apache-spark-and-sagemaker#-results-of-throughput-from-experiments) over solutions like Spark and Sagemaker Batch Transform, etc.


In [None]:
%load_ext autoreload
%autoreload all

In [None]:
# Enable importing from dist_xgboost package
import os
import sys

sys.path.append(os.path.abspath(".."))

In [None]:
# Enable Ray Train v2
os.environ["RAY_TRAIN_V2_ENABLED"] = "1"
# now it's safe to import from ray.train

In [None]:
# make Ray Data less verbose
import ray

ray.data.DataContext.get_current().enable_progress_bars = False
ray.data.DataContext.get_current().print_on_execution_start = False

### Data ingestion

We'll start by reading our data from a public cloud storage bucket.

<div class="alert alert-block alert"> <b> ✍️ Distributed READ/WRITE</b> 

Ray Data supports a wide range of data sources for both [loading](https://docs.ray.io/en/latest/data/loading-data.html) and [saving](https://docs.ray.io/en/latest/data/saving-data.html), from generic binary files in cloud storage to structured data formats used by modern data platforms. In this example, we'll read data from a public S3 bucket that we've prepared with our dataset. This `read` operation — like the `write` operation you'll see later—runs in a distributed fashion. As a result, the data is processed in parallel across the cluster and doesn't need to be loaded entirely into memory at once, making it scalable and memory-efficient.

## Validating our XGboost model using Ray Data

In `notebooks/01-Distributed-Training.ipynb`, we trained a XGBoost model and stored it in our MLFlow artifact storage. Now, we can use it to make predictions on our hold-out set.

First, let's load the test dataset using the same procedure as before:

In [None]:
from ray.data import Dataset


def prepare_data() -> tuple[Dataset, Dataset, Dataset]:
    """Load and split the dataset into train, validation, and test sets."""
    dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
    seed = 42
    train_dataset, rest = dataset.train_test_split(
        test_size=0.3, shuffle=True, seed=seed
    )
    # 15% for validation, 15% for testing
    valid_dataset, test_dataset = rest.train_test_split(
        test_size=0.5, shuffle=True, seed=seed
    )
    return train_dataset, valid_dataset, test_dataset


_, _, test_dataset = prepare_data()
test_dataset.take(1)

2025-04-09 13:46:00,301	INFO worker.py:1709 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8266 [39m[22m
2025-04-09 13:46:03,084	INFO dataset.py:2679 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.


[{'mean radius': 14.9,
  'mean texture': 22.53,
  'mean perimeter': 102.1,
  'mean area': 685.0,
  'mean smoothness': 0.09947,
  'mean compactness': 0.2225,
  'mean concavity': 0.2733,
  'mean concave points': 0.09711,
  'mean symmetry': 0.2041,
  'mean fractal dimension': 0.06898,
  'radius error': 0.253,
  'texture error': 0.8749,
  'perimeter error': 3.466,
  'area error': 24.19,
  'smoothness error': 0.006965,
  'compactness error': 0.06213,
  'concavity error': 0.07926,
  'concave points error': 0.02234,
  'symmetry error': 0.01499,
  'fractal dimension error': 0.005784,
  'worst radius': 16.35,
  'worst texture': 27.57,
  'worst perimeter': 125.4,
  'worst area': 832.7,
  'worst smoothness': 0.1419,
  'worst compactness': 0.709,
  'worst concavity': 0.9019,
  'worst concave points': 0.2475,
  'worst symmetry': 0.2866,
  'worst fractal dimension': 0.1155,
  'target': 0}]

<div class="alert alert-block alert"> <b>💡 Ray Data best practices</b>

- **trigger lazy execution**: we use [`take`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.take.html) to trigger the exection because Ray has lazy execution mode, which is great for decreasing execution time and decreasing memory utilization. But, this means that an operation like take, count, write, etc. will be needed to actually execute our workflow DAG.
- **`materialize` during development**: we use [`materialize`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.materialize.html) to execute and materialize our dataset into Ray's [shared memory object store memory](https://docs.ray.io/en/latest/ray-core/objects.html). This way, we save a checkpoint at this point and future operations on our dataset can start from this point (we won't rerun all operations on the dataset again from scratch). This is a conventient feature during development (especially in a stateful environment like Jupyter notebooks) because we can run from our saved checkpoints.
- **shuffling strategies**: we've chosen to shuffle our dataset because it's all ordered by class, so we'll randomly shuffle the ordering of input files before reading. Ray Data also provides an extensive list of [shuffling strategies](https://docs.ray.io/en/latest/data/shuffling-data.html) such as local shuffles, per-epoch shuffles, etc.

We need to transform the input data in the same way we did during training. Here, we fetch the preprocessor from the artifact registry.

In [None]:
from dist_xgboost.data import get_best_model_from_registry
from dist_xgboost.constants import preprocessor_fname
import pickle


best_run, best_artifacts_dir = get_best_model_from_registry()

with open(os.path.join(best_artifacts_dir, preprocessor_fname), "rb") as f:
    preprocessor = pickle.load(f)

Next, we will define the transformation step in our Ray Data pipeline. We could do this on each item of the dataset individually using `.map()`, but that would be inefficient. We could optimize this step by processing entire batches of data at once and using vecotrized operations.

For this, we will use Ray Data's [`map_batches`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html) method paired with the [`Preprocessor.transform_batch`](https://docs.ray.io/en/latest/data/api/doc/ray.data.preprocessor.Preprocessor.transform_batch.html#ray.data.preprocessor.Preprocessor.transform_batch) method.

In [None]:
def transform_with_preprocessor(batch_df):
    # The preprocessor does not know about the `target` column,
    # so we need to remove it temporarily then add it back
    target = batch_df.pop("target")
    transformed_features = preprocessor.transform_batch(batch_df)
    transformed_features["target"] = target
    return transformed_features


# Apply the transformation to each batch
test_dataset = test_dataset.map_batches(
    transform_with_preprocessor, batch_format="pandas", batch_size=1000
)
test_dataset.show(1)

{'mean radius': 0.28316742882286705, 'mean texture': 0.7523604985155783, 'mean perimeter': 0.48669896676905805, 'mean area': 0.1468368988849869, 'mean smoothness': 0.2369449961779808, 'mean compactness': 2.2320249533254315, 'mean concavity': 2.3565557480507966, 'mean concave points': 1.3269287289169034, 'mean symmetry': 0.7690746828912589, 'mean fractal dimension': 0.776477048769651, 'radius error': -0.5412079178215152, 'texture error': -0.6312569374264887, 'perimeter error': 0.3773420778631571, 'area error': -0.3322040297744336, 'smoothness error': -0.06207156278394198, 'compactness error': 1.8842687990223528, 'concavity error': 1.4011696908922329, 'concave points error': 1.6050000169029714, 'symmetry error': -0.6904479637201796, 'fractal dimension error': 0.644062455623064, 'worst radius': 0.06652007879384304, 'worst texture': 0.2951992750088203, 'worst perimeter': 0.6001736321032014, 'worst area': -0.04273924639797485, 'worst smoothness': 0.4222472095938187, 'worst compactness': 2.7

We now have our preprocessing pipeline defined and are ready to run batch inference to get the model predictions on the test set.

Let's load the model from the artifact registry:

In [None]:
from ray.train.xgboost import RayTrainReportCallback
from ray.train import Checkpoint

checkpoint = Checkpoint.from_directory(best_artifacts_dir)
model = RayTrainReportCallback.get_model(checkpoint)

Next, we actually run the inference step. We want to avoid repeatedly loading the model for each batch, so we define a reusable class that can recycle the XGboost model for different batches.

In [None]:
import xgboost
import pandas as pd


class Predictor:
    def __init__(self, model):
        self.model = model

    def __call__(self, batch: pd.DataFrame) -> pd.DataFrame:
        # remove the target column for inference
        target = batch.pop("target")
        dmatrix = xgboost.DMatrix(batch)
        predictions = self.model.predict(dmatrix)

        results = pd.DataFrame({"prediction": predictions, "target": target})
        return results

Now, we can parallelize inference across replicas of the model. As before, we speed up this stage by processing the data in batches.

In [None]:
test_predictions = test_dataset.map_batches(
    Predictor,
    fn_constructor_kwargs={"model": model},
    concurrency=4,  # Number of model replicas
    batch_format="pandas",
)

test_predictions.show(1)



{'prediction': 0.043412428349256516, 'target': 0}


Now that we have a the predictions, it's time to evalaute the model's accuracy, precision, recall, and F1-score. For this, we need to calculate the number of true positives, true negatives, false positives, and false negatives across the test subset. We split this step into its own stage to give us the flexibility to run the previous stage on a different type of machine if we so wish.

In [None]:
from sklearn.metrics import confusion_matrix


def confusion_matrix_batch(batch, threshold=0.5):
    # apply a threshold to the predictions to get binary labels
    batch["prediction"] = (batch["prediction"] > threshold).astype(int)

    result = {}
    cm = confusion_matrix(batch["target"], batch["prediction"], labels=[0, 1])
    result["TN"] = cm[0, 0]
    result["FP"] = cm[0, 1]
    result["FN"] = cm[1, 0]
    result["TP"] = cm[1, 1]
    return pd.DataFrame(result, index=[0])


test_results = test_predictions.map_batches(
    confusion_matrix_batch, batch_format="pandas", batch_size=1000
)

Finally, we can aggregate the confusion matrix results from all the batches in order to get the global counts. For this, we can use Ray Data's built-in `sum()` aggregator. Note that this will finally materialize the dataset and execute all the previously declared lazy transformerations.

In [None]:
# Sum all confusion matrix values across batches
cm_sums = test_results.sum(["TN", "FP", "FN", "TP"])

# Extract confusion matrix components
tn = cm_sums["sum(TN)"]
fp = cm_sums["sum(FP)"]
fn = cm_sums["sum(FN)"]
tp = cm_sums["sum(TP)"]

# Calculate metrics
accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

metrics = {"precision": precision, "recall": recall, "f1": f1, "accuracy": accuracy}



In [None]:
print("Validation results:")
for key, value in metrics.items():
    print(f"{key}: {value:.4f}")

Validation results:
precision: 0.9574
recall: 1.0000
f1: 0.9783
accuracy: 0.9767


This should output something like:

```
Validation results:
precision: 0.9574
recall: 1.0000
f1: 0.9783
accuracy: 0.9767
```

## Summary

In this notebook, we've demonstrated how to validate a machine learning model using Ray Data:

1. We loaded the breast cancer dataset testing subset with distributed reads
2. We transformed the dataset in a streaming fashion
3. We created a validation pipeline using Ray Data's transformations to:
   - Make predictions on the test data
   - Calculate confusion matrix components (TP, TN, FP, FN) for each batch
   - Aggregate results across all batches
4. We computed key performance metrics

Ray Data enables you to efficiently run this pipeline on terabyte scale datasets without changing your code.

In the next notebook, we'll see how we can serve this XGBoost model for online inference jobs using Ray Serve.