# Scalable online XGBoost inference with Ray Serve

In this tutorial, we will show how to create a distributed service for our trained XGBoost model. This allows you to serve your model for real-time predictions in a scalable way.

In [None]:
%load_ext autoreload
%autoreload all

In [None]:
# enable loading of the dist_xgboost module
import os
import sys

sys.path.append(os.path.abspath(".."))

In [None]:
# Enable Ray Train v2
os.environ["RAY_TRAIN_V2_ENABLED"] = "1"
# now it's safe to import from ray.train

In [None]:
# make ray data less verbose
import ray

ray.data.DataContext.get_current().enable_progress_bars = False
ray.data.DataContext.get_current().print_on_execution_start = False

## Loading the Model

Next, we load the pre-trained preprocessor and XGBoost model from the MLFlow registry as we demonstrated in the validation notebook.

In [None]:
from dist_xgboost.data import get_best_model_from_registry
from dist_xgboost.constants import preprocessor_fname
import pickle
from ray.train.xgboost import RayTrainReportCallback
from ray.train import Checkpoint


best_run, best_artifacts_dir = get_best_model_from_registry()

with open(os.path.join(best_artifacts_dir, preprocessor_fname), "rb") as f:
    preprocessor = pickle.load(f)


checkpoint = Checkpoint.from_directory(best_artifacts_dir)
model = RayTrainReportCallback.get_model(checkpoint)

## Creating a Ray Serve Deployment

We'll now define our Ray Serve endpoint. We'll use a reusable class to avoid reloading the model and preprocessor for each request. Our deployment will support both Pythonic and HTTP requests.

In [None]:
from ray import serve
from starlette.requests import Request
import pandas as pd
import xgboost


@serve.deployment
class XGBoostModel:
    def __init__(self, preprocessor, model):
        self.model = model
        self.preprocessor = preprocessor

    def pythonic_call(self, input_data: dict) -> dict:
        # Convert to DataFrame
        input_df = pd.DataFrame([input_data])
        # Preprocess the input
        preprocessed_batch = self.preprocessor.transform_batch(input_df)
        # Create DMatrix for prediction
        dmatrix = xgboost.DMatrix(preprocessed_batch)
        # Get predictions
        predictions = self.model.predict(dmatrix)
        return {"predictions": predictions.tolist()}

    async def __call__(self, request: Request) -> dict:
        # Parse the request body as JSON
        input_data = await request.json()
        return self.pythonic_call(input_data)

Let's ensure that we don't have any existing deployments first using [`serve.shutdown()`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.shutdown.html#ray.serve.shutdown):

In [None]:
if (
    "default" in serve.status().applications
    and serve.status().applications["default"].status == "RUNNING"
):
    print("Shutting down existing serve application")
    serve.shutdown()

2025-04-09 14:27:30,601	INFO worker.py:1709 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8267 [39m[22m


Now that we've defined the deployment, we can create our `ray.serve.Application` using the [`.bind()`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.Deployment.html#ray.serve.Deployment) method:

In [None]:
xgboost_model = XGBoostModel.bind(preprocessor, model)

## Preparing Test Data

Let's prepare some example data to test our deployment. We'll use a sample from our hold-out set:

In [16]:
sample_input = {
    "mean radius": 14.9,
    "mean texture": 22.53,
    "mean perimeter": 102.1,
    "mean area": 685.0,
    "mean smoothness": 0.09947,
    "mean compactness": 0.2225,
    "mean concavity": 0.2733,
    "mean concave points": 0.09711,
    "mean symmetry": 0.2041,
    "mean fractal dimension": 0.06898,
    "radius error": 0.253,
    "texture error": 0.8749,
    "perimeter error": 3.466,
    "area error": 24.19,
    "smoothness error": 0.006965,
    "compactness error": 0.06213,
    "concavity error": 0.07926,
    "concave points error": 0.02234,
    "symmetry error": 0.01499,
    "fractal dimension error": 0.005784,
    "worst radius": 16.35,
    "worst texture": 27.57,
    "worst perimeter": 125.4,
    "worst area": 832.7,
    "worst smoothness": 0.1419,
    "worst compactness": 0.709,
    "worst concavity": 0.9019,
    "worst concave points": 0.2475,
    "worst symmetry": 0.2866,
    "worst fractal dimension": 0.1155,
}
sample_target = 0  # Ground truth label

## Running the Service

There are two ways to run a Ray Serve service:

1) **Serve API**:  use the [`serve run`](https://docs.ray.io/en/latest/serve/getting_started.html#running-a-ray-serve-application) CLI command, e.g. `serve run tutorial:xgboost_model`
2) **Pythonic API**: use `ray.serve`'s [`serve.run` command](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.run.html#ray.serve.run), e.g. `serve.run(xgboost_model)`.

For this example, we'll use the Pythonic API:

In [None]:
from ray.serve.handle import DeploymentHandle

handle: DeploymentHandle = serve.run(
    xgboost_model, name="xgboost-breast-cancer-classifier"
)

INFO 2025-04-09 14:27:31,954 serve 42336 -- Started Serve in namespace "serve".
INFO 2025-04-09 14:27:33,066 serve 42336 -- Application 'xgboost-breast-cancer-classifier' is ready at http://127.0.0.1:8000/.


[36m(ProxyActor pid=42374)[0m INFO 2025-04-09 14:27:31,923 proxy 127.0.0.1 -- Proxy starting on node 94fbfec9b89f75d90b7512b30d7076cf04fcad9cb384734bcd49a346 (HTTP port: 8000).
[36m(ProxyActor pid=42374)[0m INFO 2025-04-09 14:27:31,942 proxy 127.0.0.1 -- Got updated endpoints: {}.
[36m(ServeController pid=42385)[0m INFO 2025-04-09 14:27:31,964 controller 42385 -- Deploying new version of Deployment(name='XGBoostModel', app='xgboost-breast-cancer-classifier') (initial target replicas: 1).
[36m(ProxyActor pid=42374)[0m INFO 2025-04-09 14:27:31,965 proxy 127.0.0.1 -- Got updated endpoints: {Deployment(name='XGBoostModel', app='xgboost-breast-cancer-classifier'): EndpointInfo(route='/', app_is_cross_language=False)}.
[36m(ProxyActor pid=42374)[0m INFO 2025-04-09 14:27:31,969 proxy 127.0.0.1 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x10e8c0350>.
[36m(ServeController pid=42385)[0m INFO 2025-04-09 14:27:32,066 controller 42385 -- Adding 1 replica 

We should see some logs indicating that the service is running locally:

```bash
INFO 2025-04-09 14:06:55,760 serve 31684 -- Started Serve in namespace "serve".
INFO 2025-04-09 14:06:57,875 serve 31684 -- Application 'default' is ready at http://127.0.0.1:8000/.
```

We can also check whether it is running using [`serve.status()`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.status.html#ray.serve.status):

In [None]:
serve.status().applications["xgboost-breast-cancer-classifier"].status == "RUNNING"

True

## Querying the Service

### Using HTTP
The most common way to query services is via an HTTP request. This invokes the `__call__` method we defined earlier:

In [None]:
import requests

url = "http://127.0.0.1:8000/"
response = requests.post(url, json=sample_input).json()

print(f"Prediction: {response['predictions'][0]:.4f}")
print(f"Ground truth: {sample_target}")

Prediction: 0.0434
Ground truth: 0


### Using Python

For a more direct Pythonic way to query the model, you can use the deployment handle:

In [None]:
response = await handle.pythonic_call.remote(sample_input)
print(response)

INFO 2025-04-09 14:27:45,116 serve 42336 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x15f230110>.


{'predictions': [0.043412428349256516]}


This approach is useful if you need to interact with the service from a different process in the same Ray Cluster. If you need to regenerate the serve handle, you can use [`serve.get_deployment_handle`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.get_deployment_handle.html):

In [None]:
handle = serve.get_deployment_handle("XGBoostModel", "xgboost-breast-cancer-classifier")