# Intro to Ray Serve

This notebook will introduce you to Ray Serve, a framework for building and deploying scalable ML applications.

<div class="alert alert-block alert-info">
    
<b>Here is the roadmap for this notebook:</b>

<ul>
    <li><b>1.</b> When to consider Ray Serve</li>
    <li><b>2.</b> Implement an image classification service</li>
    <li><b>3.</b> Key concepts in Ray Serve</li>
    <li><b>4.</b> Model composition with Ray Serve</li>
</ul>
</div>

__Installing Dependencies__

In [None]:
%%bash
uv pip install -r python_depset.lock --no-cache-dir --no-deps --system

## Imports


In [None]:
import subprocess
import logging
from typing import Any
from langdetect import detect

import json
import fastapi
import numpy as np
import requests
import torch
from sentence_transformers import SentenceTransformer
from pydantic import BaseModel
from ray import serve
from ray.serve.handle import DeploymentHandle
from starlette.requests import Request

## 1. When to Consider Ray Serve

Consider using Ray Serve for your project if it meets one or more of the following criteria:

| **Challenge** | **Details** | **Ray Serve Solution** |
|---------------|------------------|--------------------------|
| **Slow iteration speed for ML engineers** | - Developers need to containerize and rollout components on Kubernetes to test changes<br>- Developers need to use complex protocols (e.g. gRPC) to achieve acceptable performance | - Provides a Python-first API to develop lightweight services<br>- Services are lightweight [Ray actors](https://docs.ray.io/en/latest/ray-core/actors.html)<br>- Ray Serve can be run locally for development |
| **Need to efficiently compose multiple components** | - Requiring efficient data sharing between components<br>- Implementing performant streaming protocols (e.g. gRPC) is a complex task | - Relies on [Ray's object store](https://docs.ray.io/en/latest/ray-core/objects.html) to share data optimally<br>- Avoids the need to implement gRPC streaming |
| **Poor utilization of expensive hardware** | Naive request handling e.g. by passing requests one a time to GPUs or accelerators | - Offers [dynamic batching of requests](https://docs.ray.io/en/latest/serve/advanced-guides/dyn-req-batch.html) to improve hardware utilization<br>- Leverages Ray Core's support for accelerators and custom resources:<br>&nbsp;&nbsp;&nbsp;&nbsp;• [Multi-node/multi-GPU serving](https://docs.ray.io/en/latest/serve/tutorials/vllm-example.html)<br>&nbsp;&nbsp;&nbsp;&nbsp;• [Fractional compute resource usage](https://docs.ray.io/en/latest/serve/configure-serve-deployment.html)<br>- RayTurbo Serve offers [replica compaction](https://www.anyscale.com/blog/new-feature-replica-compaction?_gl=1*lrhlou*_gcl_au*OTY4NjkwODIzLjE3Mzg1Mjc2MzA.) |
| **High-latency outliers when juggling many models** | Stuck with naive load balancing and expensive state loading (e.g. ML models) | - Provides [model multiplexing](https://docs.ray.io/en/latest/serve/model-multiplexing.html) to avoid unnecessary load times<br>- Routes to replicas that already have a model loaded |


### Ray vs K8s 

Here are some key points to keep in mind when comparing Ray to Kubernetes:

<table style="border-collapse: collapse; width: 100%; font-family: sans-serif; font-size: 15px;">
  <thead>
    <tr style="text-align: left; border-bottom: 1px solid #ddd;">
      <th style="padding: 6px; color: #444;">Category</th>
      <th style="padding: 6px; color: #444;">Kubernetes (Traditional Microservices)</th>
      <th style="padding: 6px; color: #444;">Ray on Kubernetes (AI/ML-Native Runtime)</th>
    </tr>
  </thead>
  <tbody>
    <tr style="border-bottom: 1px solid #f0f0f0;">
      <td style="padding: 6px;"></td>
      <td style="padding: 6px;"><img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-serve-deep-dive/k8s.png" width="500"></td>
      <td style="padding: 6px;"><img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-serve-deep-dive/ray_on_k8s.png" width="500"></td>
    </tr>
    <tr style="border-bottom: 1px solid #f0f0f0;">
      <td style="padding: 6px;">Definition of Work</td>
      <td style="padding: 6px;">Microservices defined by <strong>Pods</strong></td>
      <td style="padding: 6px;">Work defined by <strong>Tasks/Actors</strong></td>
    </tr>
    <tr style="border-bottom: 1px solid #f0f0f0;">
      <td style="padding: 6px;">Interface</td>
      <td style="padding: 6px;">Declarative configs (YAML)</td>
      <td style="padding: 6px;">Programmatic API (Python-native)</td>
    </tr>
    <tr style="border-bottom: 1px solid #f0f0f0;">
      <td style="padding: 6px;">Orchestration</td>
      <td style="padding: 6px;">Pods orchestrated on shared compute</td>
      <td style="padding: 6px;">Tasks/Actors orchestrated on any substrate (k8s shown)</td>
    </tr>
    <tr style="border-bottom: 1px solid #f0f0f0;">
      <td style="padding: 6px;">State</td>
      <td style="padding: 6px;">Hard separation of stateless/stateful pods</td>
      <td style="padding: 6px;">Built-in object store and stateful actor model</td>
    </tr>
    <tr style="border-bottom: 1px solid #f0f0f0;">
      <td style="padding: 6px;">Scaling</td>
      <td style="padding: 6px;">Pods scaled independently</td>
      <td style="padding: 6px;">Dynamic scheduling + autoscaling built into programming model</td>
    </tr>
    <tr style="border-bottom: 1px solid #f0f0f0;">
      <td style="padding: 6px;">AI/ML Fit</td>
      <td style="padding: 6px;">General-purpose; evolving to meet AI/ML needs</td>
      <td style="padding: 6px;">Optimized for AI/ML workloads, deeply integrated with GPUs/accelerators</td>
    </tr>
    <tr>
      <td style="padding: 6px;">Granularity</td>
      <td style="padding: 6px;">Coarse-grained (~seconds per container, ~500ms startup)</td>
      <td style="padding: 6px;">Fine-grained (~milliseconds per task, <5ms startup)</td>
    </tr>
  </tbody>
</table>


## 2. Implement an image classification service

Let’s jump right in and get a simple ML service up and running on Ray Serve. 

Here is an image classification service that performs inference on a batch of handwritten digits using an `MNISTClassifier` model.


In [None]:
class MNISTClassifier:
    def __init__(self, remote_path: str, local_path: str, device: str):
        subprocess.run(f"aws s3 cp {remote_path} {local_path}", shell=True, check=True)
        
        self.device = device
        self.model = torch.jit.load(local_path).to(device).eval()

    def __call__(self, batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
        return self.predict(batch)
    
    def predict(self, batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
        images = torch.tensor(batch["image"]).float().to(self.device)

        with torch.no_grad():
            logits = self.model(images).cpu().numpy()

        batch["predicted_label"] = np.argmax(logits, axis=1)
        return batch

First we need to load the classifier model


In [None]:
classifier = MNISTClassifier(remote_path="s3://anyscale-public-materials/ray-ai-libraries/mnist/model/model.pt", local_path="/mnt/cluster_storage/model.pt", device="cpu")

Then we can run inference to generate predicted labels


In [None]:
output = classifier({"image": np.random.rand(1, 1, 28, 28).astype(np.float32)})  # Example input (B, C, H, W)
output["predicted_label"]  # Should be a numpy array with the predicted label

Now, if want to migrate to an online inference setting, we can transform this into a Ray Serve Deployment by applying the `@serve.deployment` decorator 



In [None]:
@serve.deployment() # this is the decorator to add
class OnlineMNISTClassifier:
    # same code as MNISTClassifier.__init__
    def __init__(self, remote_path: str, local_path: str, device: str):
        subprocess.run(f"aws s3 cp {remote_path} {local_path}", shell=True, check=True)
        
        self.device = device
        self.model = torch.jit.load(local_path).to(device).eval()

    async def __call__(self, request: Request) -> dict[str, Any]:  # __call__ now takes a Request object
        batch = json.loads(await request.json()) # we will need to parse the JSON body of the request
        return await self.predict(batch)
    
    # same code as MNISTClassifier.predict
    async def predict(self, batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
        images = torch.tensor(batch["image"]).float().to(self.device)

        with torch.no_grad():
            logits = self.model(images).cpu().numpy()

        batch["predicted_label"] = np.argmax(logits, axis=1)
        return batch

We have now defined our Ray Serve deployment


In [None]:
OnlineMNISTClassifier

We can now build an Application using `OnlineMNISTClassifier` deployment


In [None]:
mnist_app = OnlineMNISTClassifier.bind(remote_path="s3://anyscale-public-materials/ray-ai-libraries/mnist/model/model.pt", local_path="/mnt/cluster_storage/model.pt", device="cpu")
mnist_app

<div class="alert alert-block alert-info">

**Note:** `.bind` is a method that takes in the arguments to pass to the Deployment constructor.

</div>

We can then run the application 


In [None]:
mnist_app_handle = serve.run(mnist_app, name='mnist_classifier', blocking=False)
mnist_app_handle

We can test it as an HTTP endpoint


In [None]:
images = np.random.rand(2, 1, 28, 28).tolist()
json_request = json.dumps({"image": images})
response = requests.post("http://localhost:8000/", json=json_request)
response.json()["predicted_label"]

We can also test it as a gRPC endpoint


In [None]:
batch = {"image": np.random.rand(10, 1, 28, 28)}
response = await mnist_app_handle.predict.remote(batch)
response["predicted_label"]

For more details on the recommended development workflow, read the [docs here](https://docs.ray.io/en/latest/serve/advanced-guides/dev-workflow.html#development-workflow)

For unit testing and debugging, Ray Serve provides a local testing mode. For more details, see the [docs here](https://docs.ray.io/en/latest/serve/advanced-guides/dev-workflow.html#local-testing-mode)

## 3. Key concepts in Ray Serve

Serve is a framework for serving ML applications. 

### Applications

Here is a high-level overview of the architecture of a Ray Serve Application.

<img src='https://technical-training-assets.s3.us-west-2.amazonaws.com/Ray_Serve/serve_architecture.png' width=700/>

A Ray Serve cluster is made up of one or more Applications.

An Application is composed of one or more Deployments that work together. Key characteristics:
- Applications are coarse-grained units of functionality
- They can be **independently upgraded** without affecting other applications running on the same cluster
- They provide isolation and separate deployment lifecycles

### Deployments

A Deployment is the fundamental building block in Ray Serve's architecture.

<img src='https://technical-training-assets.s3.us-west-2.amazonaws.com/Ray_Serve/deployment.png' width=600/>

Deployments enable:
- Separation of concerns (e.g., different models, business logic, data transformations)
- **Independent scaling**, including autoscaling capabilities
- Multiple replicas for handling concurrent requests


### Replicas
Each Replica is a worker process (Ray actor) with its own request processing queue. Replicas offer flexible configuration options:

- Specify its own hardware and resource requirements (e.g., GPUs)
- Specify its own runtime environments (e.g., libraries)
- Maintain state (e.g., models)

This architecture provides a clean separation of concerns while enabling high scalability and efficient resource utilization.

## 4. Model composition with Ray Serve

Below is a sample Serve instance that we will build to better understand Ray Serve and its architecture

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-serve-deep-dive/Serve+architecture+-+instance.png" width="800px" loading="lazy">

We can break down the above diagram into the following steps:
1. HTTP or GRPC requests come in 
2. The load balancer routes the request to one of the cluster nodes
3. The request is handled by a proxy
4. The proxy routes the request to the relevant deployment replica
5. The replica processes the request and returns the response
6. The proxy returns the response to the client


### 4.1 Building out an ingress deployment

Let's first build out the ingress deployment


In [None]:
@serve.deployment
class DeploymentA:
    def __init__(
        self, deployment_b: DeploymentHandle, deployment_c: DeploymentHandle
    ) -> None:
        self.deployment_b = self.deployment_b
        self.deployment_c = self.deployment_c

    # __call__ corresponds to post("/") endpoint
    async def __call__(self, request: Request):
        payload = await request.json()  # parse starlette Request
        payload_language = detect(payload["text"])

        if payload_language == "en":
            out = await self.deployment_b.run(payload)
            return "English embedding done"
        elif payload_language == "de":
            out = await self.deployment_b.run(payload)
            return "German embedding done"
        else:
            return "Not supported language"

### 4.2 Integrating with FastAPI

Ray Serve can be integrated with FastAPI to provide:
- HTTP routing
- Pydantic model validation
- OpenAPI documentation

To integrate a Deployment with FastAPI, we can use the `@serve.ingress` decorator to designate a FastAPI app as the entrypoint for HTTP requests to our Serve application.


In [None]:
fastapi_app = fastapi.FastAPI()


class Payload(BaseModel):
    text: str


@serve.deployment
@serve.ingress(fastapi_app)
class DeploymentA:
    def __init__(self, deployment_b: DeploymentHandle, deployment_c: DeploymentHandle) -> None:
        self.deployment_b = deployment_b
        self.deployment_c = deployment_c

    @fastapi_app.post("/predict")
    async def run(self, payload: Payload):
        logger = logging.getLogger("ray.serve")
        logger.info(f"{payload=}")
        payload_language = detect(payload.text)
        logger.info(f"Detected language: {payload_language}")

        if payload_language == "en":
            await self.deployment_b.run.remote(payload.text)
            return "English embedding done"

        elif payload_language == "de":
            await self.deployment_c.run.remote(payload.text)
            return "German embedding done"

        else:
            raise fastapi.HTTPException(
                status_code=400,
                detail="Unsupported language detected. Only English and German are supported."
            )

### 4.3 Resource specification

Then we can build out `DeploymentB` and `DeploymentC`. In this example, the deployments are using different models and different hardware.


In [None]:
@serve.deployment(ray_actor_options={"num_gpus": 1})
class DeploymentB:
    def __init__(self) -> None:
        self.model = SentenceTransformer("intfloat/multilingual-e5-large", trust_remote_code=True, device="cuda")

    def run(self, input: str) -> list[float]:
        return self.model.encode(input)


@serve.deployment(ray_actor_options={"num_gpus": 1})
class DeploymentC:
    def __init__(self) -> None:
        self.model = SentenceTransformer("intfloat/multilingual-e5-small", trust_remote_code=True, device="cuda")

    def run(self, input: str) -> list[float]:
        return self.model.encode(input)


<div class="alert alert-info">

Deployment boundaries allow for independent scaling but introduce latency overhead (serde + data transfer). If both deployments require similar resources, it may be better to fuse them into a single deployment. 

</div>

Continue to build out the application


In [None]:
serve_app = DeploymentA.bind(
    deployment_b=DeploymentB.bind(),
    deployment_c=DeploymentC.bind()
)
serve_app

#### 4.3.1 Fractional GPU Usage

Fractional GPU usage allows for more efficient use of GPU resources by allowing multiple replicas to share a single GPU.





In [None]:
@serve.deployment(ray_actor_options={"num_gpus": 0.5})
class DeploymentB:
    def __init__(self) -> None:
        self.model = SentenceTransformer("intfloat/multilingual-e5-large", trust_remote_code=True, device="cuda")

    def run(self, input: str) -> list[float]:
        return self.model.encode(input)


@serve.deployment(ray_actor_options={"num_gpus": 0.5})
class DeploymentC:
    def __init__(self) -> None:
        self.model = SentenceTransformer("intfloat/multilingual-e5-small", trust_remote_code=True, device="cuda")

    def run(self, input: str) -> list[float]:
        return self.model.encode(input)


Let's rebuild the application 


In [None]:
serve_app = DeploymentA.bind(
    deployment_b=DeploymentB.bind(),
    deployment_c=DeploymentC.bind()
)
serve_app

Finally, let's run the application:


In [None]:
serve_app_handle = serve.run(serve_app, route_prefix="/composed")

We can test the running application via HTTP requests:


In [None]:
response = requests.post("http://localhost:8000/composed/predict", json={"text": "hello there"})
print(response.json())

In [None]:
response = requests.post("http://localhost:8000/composed/predict", json={"text": "Ein, zwei, drei, vier"})
print(response.json())

### 4.4 Activity: Extend the model composition to other languages

Here is what you need to do:

1. Create a new model deployment DeploymentD to handle french text
    1. DeploymentD should use the multilingual-e5-small-model
2. Define a new `DeploymentAV2` which forwards french text to `DeploymentD`
3. Run the application (use route_prefix="/new")
4. Test it by sending a request with french text (e.g. "Quelle est la capitale de la France?")

<div class="alert alert-block alert-info">

<details>
<summary>Click to view hints</summary>

```python
new_fastapi_app = fastapi.FastAPI()


# Hint: Create a new model deployment DeploymentD
@serve.deployment(...)
class DeploymentD:
    def __init__(self):
        self.model = ...

    def run(self, input: str) -> list[float]:
        ...

# Hint: Define a new `DeploymentAV2` which forwards french text to `DeploymentD`
@serve.deployment(...)
@serve.ingress(new_fastapi_app)
class DeploymentAV2:
    def __init__(self, ...):
        ...

    @new_fastapi_app.post("/predict")
    async def run(self, payload: Payload):
        ...

# Hint: Run the application
serve_app = DeploymentAV2.bind(...)
serve.run(serve_app, route_prefix="/new")

# Hint: Test the application with french text
requests.post("http://localhost:8000/new/predict", json=...)
```

</details>
</div>

<div class="alert alert-block alert-info">

<details>
<summary>Click to view the solution</summary>

```python
new_fastapi_app = fastapi.FastAPI()

# Create a new model deployment DeploymentD
@serve.deployment(ray_actor_options={"num_gpus": 0.5})
class DeploymentD:
    def __init__(self) -> None:
        self.model = SentenceTransformer("intfloat/multilingual-e5-small", trust_remote_code=True, device="cuda")

    def run(self, input: str) -> list[float]:
        return self.model.encode(input)

# Define a new DeploymentAV2 so it forwards french text to DeploymentD
@serve.deployment
@serve.ingress(new_fastapi_app)
class DeploymentAV2:
    def __init__(self, deployment_b: DeploymentHandle, deployment_c: DeploymentHandle, deployment_d: DeploymentHandle) -> None:
        self.deployment_b = deployment_b
        self.deployment_c = deployment_c
        self.deployment_d = deployment_d

    @new_fastapi_app.post("/predict")
    async def run(self, payload: Payload):
        logger = logging.getLogger("ray.serve")
        logger.info(f"{payload=}")
        payload_language = detect(payload.text)
        
        if payload_language == "en":
            await self.deployment_b.run.remote(payload.text)
            return "English embedding done"

        elif payload_language == "de":
            await self.deployment_c.run.remote(payload.text)
            return "German embedding done"

        elif payload_language == "fr":
            await self.deployment_d.run.remote(payload.text)
            return "French embedding done"

# Run the application
serve_app = DeploymentAV2.bind(
    deployment_b=DeploymentB.bind(),
    deployment_c=DeploymentC.bind(),
    deployment_d=DeploymentD.bind(),
)
serve.run(serve_app, route_prefix="/new")

# Test the application with french text
print(requests.post("http://localhost:8000/new/predict", json={"text": "Quelle est la capitale de la France ?"}).json())
```

</details>

</div>

In [None]:
# cleanup 
!serve shutdown -y