# Scalable online XGBoost inference with Ray Serve

<div align="left">
<a target="_blank" href="https://console.anyscale.com/"><img src="https://img.shields.io/badge/ðŸš€ Run_on-Anyscale-9hf"></a>&nbsp;
<a href="https://github.com/anyscale/e2e-xgboost" role="button"><img src="https://img.shields.io/static/v1?label=&amp;message=View%20On%20GitHub&amp;color=586069&amp;logo=github&amp;labelColor=2f363d"></a>&nbsp;
</div>

In this tutorial, we'll launch an online service that will:
- deploy our trained XGBoost model artifacts to generate predictions
- autoscale based on real-time incoming traffic
- cover observability and debugging around our service

<div class="alert alert-block alert">

[Ray Serve](https://docs.ray.io/en/latest/serve/index.html) is a highly scalable and flexible model serving library for building online inference APIs.
- wrap our models and business logic as separate [serve deployments](https://docs.ray.io/en/latest/serve/key-concepts.html#deployment) and [connect](https://docs.ray.io/en/latest/serve/model_composition.html) them together (pipeline, ensemble, etc.)
- avoid one large service that network and compute bounded (inefficient use of resources)
- utilize fractional heterogenous [resources](https://docs.ray.io/en/latest/serve/resource-allocation.html) (**not possible** with Sagemaker, Vertex, KServe, etc.) and horizontally scale (`num_replicas`)
- [autoscale](https://docs.ray.io/en/latest/serve/autoscaling-guide.html) up/down based on traffic
- integrations with [FastAPI and HTTP](https://docs.ray.io/en/latest/serve/http-guide.html)
- set up a [gRPC service](https://docs.ray.io/en/latest/serve/advanced-guides/grpc-guide.html#set-up-a-grpc-service) to build distributed systems and microservices.
- enable [dynamic batching](https://docs.ray.io/en/latest/serve/advanced-guides/dyn-req-batch.html) (based on batch size, time, etc.)
- suite of [utilities for serving LLMs](https://docs.ray.io/en/latest/serve/llm/serving-llms.html) (multi-lora, inference engine agnostic, etc.)

<img src="https://github.com/anyscale/e2e-xgboost/blob/main/images/ray_serve.png?raw=true" width=600>

In [None]:
%load_ext autoreload
%autoreload all

In [None]:
# enable loading of the dist_xgboost module
import os
import sys

sys.path.append(os.path.abspath(".."))

In [None]:
# Enable Ray Train v2
os.environ["RAY_TRAIN_V2_ENABLED"] = "1"
# now it's safe to import from ray.train

# Enable uv on Ray
# https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#using-uv-for-package-management
os.environ["RAY_RUNTIME_ENV_HOOK"] = "ray._private.runtime_env.uv_runtime_env_hook.hook"

In [None]:
# make ray data less verbose
import ray

ray.data.DataContext.get_current().enable_progress_bars = False
ray.data.DataContext.get_current().print_on_execution_start = False

## Loading the Model

Next, we load the pre-trained preprocessor and XGBoost model from the MLFlow registry as we demonstrated in the validation notebook.

## Creating a Ray Serve Deployment

We'll now define our Ray Serve endpoint. We'll use a reusable class to avoid reloading the model and preprocessor for each request. Our deployment will support both Pythonic and HTTP requests.

In [None]:
import pandas as pd
import xgboost
from ray import serve
from starlette.requests import Request

from dist_xgboost.data import load_model_and_preprocessor


@serve.deployment(num_replicas=2, ray_actor_options={"num_cpus": 2})
class XGBoostModel:
    def __init__(self):
        self.preprocessor, self.model = load_model_and_preprocessor()

    def pythonic_call(self, input_data: dict) -> dict:
        # Convert to DataFrame
        input_df = pd.DataFrame([input_data])
        # Preprocess the input
        preprocessed_batch = self.preprocessor.transform_batch(input_df)
        # Create DMatrix for prediction
        dmatrix = xgboost.DMatrix(preprocessed_batch)
        # Get predictions
        predictions = self.model.predict(dmatrix)
        return {"predictions": predictions.tolist()}

    async def __call__(self, request: Request) -> dict:
        # Parse the request body as JSON
        input_data = await request.json()
        return self.pythonic_call(input_data)

<div class="alert alert-block alert"> <b>ðŸ§± Model composition</b>

Ray Serve makes it extremely easy to do [model composition](https://docs.ray.io/en/latest/serve/model_composition.html) where we can compose multiple deployments containing ML models or business logic into a single application. And we can independently scale (even fractional resources) and configure each of our deployments.

<img src="https://raw.githubusercontent.com/anyscale/foundational-ray-app/refs/heads/main/images/serve_composition.png" width=800>

Let's ensure that we don't have any existing deployments first using [`serve.shutdown()`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.shutdown.html#ray.serve.shutdown):

In [None]:
if "default" in serve.status().applications and serve.status().applications["default"].status == "RUNNING":
    print("Shutting down existing serve application")
    serve.shutdown()

2025-04-09 14:27:30,601	INFO worker.py:1709 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8267 [39m[22m


Now that we've defined the deployment, we can create our `ray.serve.Application` using the [`.bind()`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.Deployment.html#ray.serve.Deployment) method:

In [17]:
# Define the app
xgboost_model = XGBoostModel.bind()

## Preparing Test Data

Let's prepare some example data to test our deployment. We'll use a sample from our hold-out set:

In [16]:
sample_input = {
    "mean radius": 14.9,
    "mean texture": 22.53,
    "mean perimeter": 102.1,
    "mean area": 685.0,
    "mean smoothness": 0.09947,
    "mean compactness": 0.2225,
    "mean concavity": 0.2733,
    "mean concave points": 0.09711,
    "mean symmetry": 0.2041,
    "mean fractal dimension": 0.06898,
    "radius error": 0.253,
    "texture error": 0.8749,
    "perimeter error": 3.466,
    "area error": 24.19,
    "smoothness error": 0.006965,
    "compactness error": 0.06213,
    "concavity error": 0.07926,
    "concave points error": 0.02234,
    "symmetry error": 0.01499,
    "fractal dimension error": 0.005784,
    "worst radius": 16.35,
    "worst texture": 27.57,
    "worst perimeter": 125.4,
    "worst area": 832.7,
    "worst smoothness": 0.1419,
    "worst compactness": 0.709,
    "worst concavity": 0.9019,
    "worst concave points": 0.2475,
    "worst symmetry": 0.2866,
    "worst fractal dimension": 0.1155,
}
sample_target = 0  # Ground truth label

## Running the Service

There are two ways to run a Ray Serve service:

1) **Serve API**:  use the [`serve run`](https://docs.ray.io/en/latest/serve/getting_started.html#running-a-ray-serve-application) CLI command, e.g. `serve run tutorial:xgboost_model`
2) **Pythonic API**: use `ray.serve`'s [`serve.run` command](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.run.html#ray.serve.run), e.g. `serve.run(xgboost_model)`.

For this example, we'll use the Pythonic API:

In [None]:
from ray.serve.handle import DeploymentHandle

handle: DeploymentHandle = serve.run(xgboost_model, name="xgboost-breast-cancer-classifier")

INFO 2025-04-09 14:27:31,954 serve 42336 -- Started Serve in namespace "serve".
INFO 2025-04-09 14:27:33,066 serve 42336 -- Application 'xgboost-breast-cancer-classifier' is ready at http://127.0.0.1:8000/.


[36m(ProxyActor pid=42374)[0m INFO 2025-04-09 14:27:31,923 proxy 127.0.0.1 -- Proxy starting on node 94fbfec9b89f75d90b7512b30d7076cf04fcad9cb384734bcd49a346 (HTTP port: 8000).
[36m(ProxyActor pid=42374)[0m INFO 2025-04-09 14:27:31,942 proxy 127.0.0.1 -- Got updated endpoints: {}.
[36m(ServeController pid=42385)[0m INFO 2025-04-09 14:27:31,964 controller 42385 -- Deploying new version of Deployment(name='XGBoostModel', app='xgboost-breast-cancer-classifier') (initial target replicas: 1).
[36m(ProxyActor pid=42374)[0m INFO 2025-04-09 14:27:31,965 proxy 127.0.0.1 -- Got updated endpoints: {Deployment(name='XGBoostModel', app='xgboost-breast-cancer-classifier'): EndpointInfo(route='/', app_is_cross_language=False)}.
[36m(ProxyActor pid=42374)[0m INFO 2025-04-09 14:27:31,969 proxy 127.0.0.1 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x10e8c0350>.
[36m(ServeController pid=42385)[0m INFO 2025-04-09 14:27:32,066 controller 42385 -- Adding 1 replica 

We should see some logs indicating that the service is running locally:

```bash
INFO 2025-04-09 14:06:55,760 serve 31684 -- Started Serve in namespace "serve".
INFO 2025-04-09 14:06:57,875 serve 31684 -- Application 'default' is ready at http://127.0.0.1:8000/.
```

We can also check whether it is running using [`serve.status()`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.status.html#ray.serve.status):

In [None]:
serve.status().applications["xgboost-breast-cancer-classifier"].status == "RUNNING"

True

## Querying the Service

### Using HTTP
The most common way to query services is via an HTTP request. This invokes the `__call__` method we defined earlier:

In [None]:
import requests

url = "http://127.0.0.1:8000/"
response = requests.post(url, json=sample_input).json()

print(f"Prediction: {response['predictions'][0]:.4f}")
print(f"Ground truth: {sample_target}")

Prediction: 0.0434
Ground truth: 0


### Using Python

For a more direct Pythonic way to query the model, you can use the deployment handle:

In [None]:
response = await handle.pythonic_call.remote(sample_input)
print(response)

INFO 2025-04-09 14:27:45,116 serve 42336 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x15f230110>.


{'predictions': [0.043412428349256516]}


This approach is useful if you need to interact with the service from a different process in the same Ray Cluster. If you need to regenerate the serve handle, you can use [`serve.get_deployment_handle`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.get_deployment_handle.html):

In [None]:
handle = serve.get_deployment_handle("XGBoostModel", "xgboost-breast-cancer-classifier")

<div class="alert alert-block alert"> <b>ðŸ”Ž Observability for Services</b>

Observability for Ray Serve applications are automatically captured in the Ray dashboard and specifically the [Serve view](https://docs.ray.io/en/latest/ray-observability/getting-started.html#serve-view). Here we can view our service [deployments and their replicas](https://docs.ray.io/en/latest/serve/key-concepts.html#serve-key-concepts-deployment) and time-series metrics to see our service's health.

<img src="https://raw.githubusercontent.com/anyscale/e2e-xgboost/refs/heads/main/images/serve_dashboard.png" width=800>

In [None]:
# Shutdown service
serve.shutdown()

<div class="alert alert-block alert"> <b>Anyscale Services</b>

[Anyscale Services](https://docs.anyscale.com/platform/services/) ([API ref](https://docs.anyscale.com/reference/service-api/)) offers an extremely fault tolerant, scalable and optimized way to serve our Ray Serve applications.
- we can [rollout and update](https://docs.anyscale.com/platform/services/update-a-service) our services with canary deployment (zero-downtime upgrades)
- [monitor](https://docs.anyscale.com/platform/services/monitoring) our Services through a dedicated Service page, unified log viewer, tracing, set up alerts, etc.
- scale a service (`num_replicas=auto`) and utilize replica compaction to consolidate nodes that are fractionally utilized
- [head node fault tolerance](https://docs.anyscale.com/platform/services/production-best-practices#head-node-ft) (OSS Ray recovers from failed workers and replicas but not head node crashes)
- serving [muliple applications](https://docs.anyscale.com/platform/services/multi-app) in a single Service

<img src="https://raw.githubusercontent.com/anyscale/e2e-xgboost/refs/heads/main/images/canary.png" width=1000>

[RayTurbo Serve](https://docs.anyscale.com/rayturbo/rayturbo-serve) on Anyscale has even more functionality on top of Ray Serve:
- **fast autoscaling and model loading** to get our services up and running even faster ([5x improvements](https://www.anyscale.com/blog/autoscale-large-ai-models-faster) even for LLMs)
- 54% **higher QPS** and up-to 3x **streaming tokens per second** for high traffic serving use-cases (no proxy bottlenecks)
- **replica compaction** into fewer nodes where possible to reduce resource fragmentation and improve hardware utilization
- **zero-downtime** [incremental rollouts](https://docs.anyscale.com/platform/services/update-a-service/#resource-constrained-updates) so your service is never interrupted
- [**different environments**](https://docs.anyscale.com/platform/services/multi-app/#multiple-applications-in-different-containers) for each service in a multi-serve application
- **multi availability-zone** aware scheduling of Ray Serve replicas to provide higher redundancy to availability zone failures



**Note**: 
- we're using a `containerfile` to define our dependencies, but we could easily use a pre-built image as well.
- we can specify the compute as a [compute config](https://docs.anyscale.com/configuration/compute-configuration/) or inline in a [Service config](https://docs.anyscale.com/reference/service-api/) file.
- when we don't specify compute and when launching from a workspace, this defaults to the compute configuration of the Workspace.

In [None]:
%%bash
# Production online service
anyscale service deploy dist_xgboost.serve:xgboost_model --name=e2e-xgboost \
    --containerfile="/home/ray/default/containerfile" \
    --working-dir="/home/ray/default" \
    --exclude=""

Your service is now in production! The link to your endpoint should be visible in the logs. Here's an example of how you might query it:

FIXME update

```sh
curl -X POST "https://e2e-xgboost-bxauk.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com/predict/" \
     -H "Authorization: Bearer <BEARER_TOKEN>" \
     -H "Content-Type: application/json" \
     -d '{"url": "https://doggos-dataset.s3.us-west-2.amazonaws.com/samara.png", "k": 4}'
```

To tear down the service, you can run:

In [None]:
%%bash
# Terminate service
anyscale service terminate --name e2e-xgboost

<div class="alert alert-block alert"> <b>CI/CD</b>

While Anyscale [Jobs](https://docs.anyscale.com/platform/jobs/) and [Services](https://docs.anyscale.com/platform/services/) are great atomic concepts that help us productionize our workloads, they're also great for nodes in a larger ML DAG or [CI/CD workflow](https://docs.anyscale.com/ci-cd/). You can chain Jobs together, storge results and then serve your application with those artifacts. And from there, you can trigger updates to your service (and retrigger the Jobs) based on events, time, etc.  And while we can simply use the Anyscale CLI to integrate with any orchestration platform, Anyscale does support some purpose-built integrations ([Airflow](https://docs.anyscale.com/ci-cd/apache-airflow/), [Prefect](https://github.com/anyscale/prefect-anyscale)). 

<img src="https://raw.githubusercontent.com/anyscale/e2e-xgboost/refs/heads/main/images/cicd.png" width=700>

