# Scalable, Flexible Serving with Ray Serve

<img src='images/servelogo.svg' width=400>

Earlier, we saw a server that performed simple request/response operation.

While it's nice to have that feature without deploying additional software, that pattern is fairly well understood and easily scaled with existing technology.

Ray Serve provides more value as we move to more complex patterns such as
- stateful services
- batching
- composition
- integration to model registries

Let's start with a simple service deployment that could work with just about any model or logic. This will also help "take some of the magic away" from that 3-line serve demo featuring a `XGBoostPredictor` and `PredictorDeployment`

In [3]:
from starlette.requests import Request
from typing import Dict
import json
import ray
from ray import serve

class MyModel:
    def __init__(self, demo_param: int):
        self._demo_param = demo_param
        
    def predict(self, data):
        return data * self._demo_param

@serve.deployment(route_prefix="/", num_replicas=2)
class GenericDeployment:
    def __init__(self, demo_param:int):        
        self._model = MyModel(demo_param)

    async def __call__(self, request: Request) -> Dict:
        data = await request.json()
        data = json.loads(data)
        return { "result" : self._model.predict(data['input']) }

serve.run(GenericDeployment.bind(demo_param=42))

2022-12-27 09:51:25,742	INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8267 [39m[22m
[2m[36m(ServeController pid=76357)[0m INFO 2022-12-27 09:51:27,827 controller 76357 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-8f2863c93327a10a49eab8494bb169b1fc40c2a17807c4534e395f6d' on node '8f2863c93327a10a49eab8494bb169b1fc40c2a17807c4534e395f6d' listening on '127.0.0.1:8000'
[2m[36m(HTTPProxyActor pid=76360)[0m INFO:     Started server process [76360]
[2m[36m(ServeController pid=76357)[0m INFO 2022-12-27 09:51:28,857 controller 76357 deployment_state.py:1310 - Adding 2 replicas to deployment 'GenericDeployment'.


RayServeSyncHandle(deployment='GenericDeployment')

In [4]:
sample_json = '{"input":7}'
sample_json

'{"input":7}'

In [5]:
import requests

print(requests.post("http://localhost:8000/", json = sample_json).json())

{'result': 294}


[2m[36m(HTTPProxyActor pid=76360)[0m INFO 2022-12-27 09:51:34,599 http_proxy 127.0.0.1 http_proxy.py:361 - POST / 200 4.1ms
[2m[36m(ServeReplica:GenericDeployment pid=76362)[0m INFO 2022-12-27 09:51:34,598 GenericDeployment GenericDeployment#kToazM replica.py:505 - HANDLE __call__ OK 0.2ms


Ok that illustrates the framework pattern a little bit.

Next let's look at some more interesting features that can power more complex use cases.

In [6]:
@serve.deployment
class Counter:
    def __init__(self):
        self.count = 0

    def __call__(self, *args):
        self.count += 1
        return {"count": self.count}

Counter.deploy()

[2m[36m(ServeController pid=76357)[0m INFO 2022-12-27 09:52:38,044 controller 76357 deployment_state.py:1310 - Adding 1 replica to deployment 'Counter'.


In [9]:
import requests

requests.get("http://127.0.0.1:8000/Counter").json()

{'count': 3}

[2m[36m(HTTPProxyActor pid=76360)[0m INFO 2022-12-27 09:53:10,139 http_proxy 127.0.0.1 http_proxy.py:361 - GET /Counter 200 3.3ms
[2m[36m(ServeReplica:Counter pid=76371)[0m INFO 2022-12-27 09:53:10,137 Counter Counter#SoYBaz replica.py:505 - HANDLE __call__ OK 0.1ms


We can also invoke these services directly from other applications or services in our Ray application

In [10]:
ray.get(Counter.get_handle().remote())

{'count': 4}

[2m[36m(ServeReplica:Counter pid=76371)[0m INFO 2022-12-27 09:53:43,638 Counter Counter#SoYBaz replica.py:505 - HANDLE __call__ OK 0.1ms


Many models achieve much better per-record performance when evaluating batches of records.

We can use Ray Serve to build that batching layer.

In [11]:
import numpy as np
import time

@serve.deployment(route_prefix="/adder")
class BatchAdder:
    @serve.batch(max_batch_size=4)
    async def handle_batch(self, numbers):
        input_array = np.array(numbers)
        print("Our input array has shape:", input_array.shape)
        # Sleep for 200ms, this could be performing CPU intensive computation
        # in real models
        time.sleep(0.2)
        output_array = input_array + 1
        return output_array.astype(int).tolist()

    async def __call__(self, request):
        return await self.handle_batch(int(request.query_params["number"]))

In [12]:
BatchAdder.deploy()

[2m[36m(ServeController pid=76357)[0m INFO 2022-12-27 09:53:58,649 controller 76357 deployment_state.py:1310 - Adding 1 replica to deployment 'BatchAdder'.


In [13]:
def make_request(i):
    return requests.get("http://localhost:8000/adder?number={}".format(i)).text

In [14]:
make_request(17)

[2m[36m(ServeReplica:BatchAdder pid=76384)[0m Our input array has shape: (1,)


'18'

[2m[36m(HTTPProxyActor pid=76360)[0m INFO 2022-12-27 09:54:04,015 http_proxy 127.0.0.1 http_proxy.py:361 - GET /adder 200 209.7ms
[2m[36m(ServeReplica:BatchAdder pid=76384)[0m INFO 2022-12-27 09:54:04,013 BatchAdder BatchAdder#LKmjvF replica.py:505 - HANDLE __call__ OK 204.2ms


In [15]:
from concurrent.futures import ThreadPoolExecutor

executor = ThreadPoolExecutor()

results = executor.map(make_request, range(0, 20, 2))

[2m[36m(ServeReplica:BatchAdder pid=76384)[0m Our input array has shape: (1,)
[2m[36m(ServeReplica:BatchAdder pid=76384)[0m Our input array has shape: (1,)


[2m[36m(ServeReplica:BatchAdder pid=76384)[0m INFO 2022-12-27 09:54:08,075 BatchAdder BatchAdder#LKmjvF replica.py:505 - HANDLE __call__ OK 201.5ms
[2m[36m(HTTPProxyActor pid=76360)[0m INFO 2022-12-27 09:54:08,283 http_proxy 127.0.0.1 http_proxy.py:361 - GET /adder 200 412.0ms
[2m[36m(HTTPProxyActor pid=76360)[0m INFO 2022-12-27 09:54:08,285 http_proxy 127.0.0.1 http_proxy.py:361 - GET /adder 200 413.0ms
[2m[36m(ServeReplica:BatchAdder pid=76384)[0m INFO 2022-12-27 09:54:08,281 BatchAdder BatchAdder#LKmjvF replica.py:505 - HANDLE __call__ OK 206.9ms


[2m[36m(ServeReplica:BatchAdder pid=76384)[0m Our input array has shape: (1,)


[2m[36m(HTTPProxyActor pid=76360)[0m INFO 2022-12-27 09:54:08,491 http_proxy 127.0.0.1 http_proxy.py:361 - GET /adder 200 618.5ms
[2m[36m(ServeReplica:BatchAdder pid=76384)[0m INFO 2022-12-27 09:54:08,488 BatchAdder BatchAdder#LKmjvF replica.py:505 - HANDLE __call__ OK 203.8ms


[2m[36m(ServeReplica:BatchAdder pid=76384)[0m Our input array has shape: (4,)
[2m[36m(ServeReplica:BatchAdder pid=76384)[0m Our input array has shape: (3,)


[2m[36m(ServeReplica:BatchAdder pid=76384)[0m INFO 2022-12-27 09:54:08,897 BatchAdder BatchAdder#LKmjvF replica.py:505 - HANDLE __call__ OK 408.9ms
[2m[36m(ServeReplica:BatchAdder pid=76384)[0m INFO 2022-12-27 09:54:08,898 BatchAdder BatchAdder#LKmjvF replica.py:505 - HANDLE __call__ OK 409.0ms
[2m[36m(ServeReplica:BatchAdder pid=76384)[0m INFO 2022-12-27 09:54:08,898 BatchAdder BatchAdder#LKmjvF replica.py:505 - HANDLE __call__ OK 409.0ms
[2m[36m(ServeReplica:BatchAdder pid=76384)[0m INFO 2022-12-27 09:54:08,898 BatchAdder BatchAdder#LKmjvF replica.py:505 - HANDLE __call__ OK 409.0ms
[2m[36m(ServeReplica:BatchAdder pid=76384)[0m INFO 2022-12-27 09:54:08,898 BatchAdder BatchAdder#LKmjvF replica.py:505 - HANDLE __call__ OK 409.0ms
[2m[36m(ServeReplica:BatchAdder pid=76384)[0m INFO 2022-12-27 09:54:08,898 BatchAdder BatchAdder#LKmjvF replica.py:505 - HANDLE __call__ OK 409.0ms
[2m[36m(ServeReplica:BatchAdder pid=76384)[0m INFO 2022-12-27 09:54:08,899 BatchAdder Batch

In [16]:
list(results)

['1', '3', '5', '7', '9', '11', '13', '15', '17', '19']

__Featurization/Model Composition__

Our pipeline will be structured as follows:
- Input comes in, the composed model sends it to `model_one`
- `model_one` outputs a random number between 0 and 1, if the value is
  greater than 0.5, then the data is sent to `model_two`
- otherwise, the data is returned to the user.

In [17]:
from random import random

@serve.deployment(route_prefix='/one') # remove /refactor this (route) for stateless?
def model_one(data):
    print("Model 1 called with data ", data)
    return random()

model_one.deploy()

@serve.deployment
def model_two(data):
    print("Model 2 called with data ", data)
    return data

model_two.deploy()

[2m[36m(ServeController pid=76357)[0m INFO 2022-12-27 09:54:32,436 controller 76357 deployment_state.py:1310 - Adding 1 replica to deployment 'model_one'.
[2m[36m(ServeController pid=76357)[0m INFO 2022-12-27 09:54:34,419 controller 76357 deployment_state.py:1310 - Adding 1 replica to deployment 'model_two'.


In [18]:
resp = requests.get("http://127.0.0.1:8000/one", data="hey!") # stateless demo only
resp.json()

0.822915679518294

[2m[36m(HTTPProxyActor pid=76360)[0m INFO 2022-12-27 09:54:37,853 http_proxy 127.0.0.1 http_proxy.py:361 - GET /one 200 5.3ms
[2m[36m(ServeReplica:model_one pid=76390)[0m INFO 2022-12-27 09:54:37,851 model_one model_one#LQThHy replica.py:505 - HANDLE __call__ OK 0.2ms


[2m[36m(ServeReplica:model_one pid=76390)[0m Model 1 called with data  <starlette.requests.Request object at 0x7fbfd8a24df0>


In [19]:
# max_concurrent_queries is optional. By default, if you pass in an async
# function, Ray Serve sets the limit to a high number.
@serve.deployment(max_concurrent_queries=10, route_prefix="/composed")
class ComposedModel:
    def __init__(self):
        self.model_one = model_one.get_handle()
        self.model_two = model_two.get_handle()

    # This method can be called concurrently!
    async def __call__(self, starlette_request):
        data = await starlette_request.body()

        score = await self.model_one.remote(data=data)
        if score > 0.5:
            result = await self.model_two.remote(data=data)
            result = {"model_used": 2, "score": score}
        else:
            result = {"model_used": 1, "score": score}

        return result

ComposedModel.deploy()

[2m[36m(ServeController pid=76357)[0m INFO 2022-12-27 09:55:02,462 controller 76357 deployment_state.py:1310 - Adding 1 replica to deployment 'ComposedModel'.


In [20]:
for _ in range(5):
    resp = requests.get("http://127.0.0.1:8000/composed", data="hey!")
    print(resp.json())

{'model_used': 1, 'score': 0.4401545829688882}
{'model_used': 1, 'score': 0.03534013582644946}
{'model_used': 2, 'score': 0.7653627795826481}
{'model_used': 1, 'score': 0.42902257419633383}
{'model_used': 1, 'score': 0.14861545856688863}
[2m[36m(ServeReplica:model_one pid=76390)[0m Model 1 called with data  b'hey!'
[2m[36m(ServeReplica:model_one pid=76390)[0m Model 1 called with data  b'hey!'
[2m[36m(ServeReplica:model_one pid=76390)[0m Model 1 called with data  b'hey!'
[2m[36m(ServeReplica:model_one pid=76390)[0m Model 1 called with data  b'hey!'
[2m[36m(ServeReplica:model_one pid=76390)[0m Model 1 called with data  b'hey!'
[2m[36m(ServeReplica:model_two pid=76391)[0m Model 2 called with data  b'hey!'


[2m[36m(HTTPProxyActor pid=76360)[0m INFO 2022-12-27 09:55:06,077 http_proxy 127.0.0.1 http_proxy.py:361 - GET /composed 200 9.3ms
[2m[36m(HTTPProxyActor pid=76360)[0m INFO 2022-12-27 09:55:06,086 http_proxy 127.0.0.1 http_proxy.py:361 - GET /composed 200 6.0ms
[2m[36m(HTTPProxyActor pid=76360)[0m INFO 2022-12-27 09:55:06,098 http_proxy 127.0.0.1 http_proxy.py:361 - GET /composed 200 9.3ms
[2m[36m(HTTPProxyActor pid=76360)[0m INFO 2022-12-27 09:55:06,106 http_proxy 127.0.0.1 http_proxy.py:361 - GET /composed 200 4.9ms
[2m[36m(HTTPProxyActor pid=76360)[0m INFO 2022-12-27 09:55:06,113 http_proxy 127.0.0.1 http_proxy.py:361 - GET /composed 200 5.1ms
[2m[36m(ServeReplica:model_one pid=76390)[0m INFO 2022-12-27 09:55:06,074 model_one model_one#LQThHy replica.py:505 - HANDLE __call__ OK 0.2ms
[2m[36m(ServeReplica:model_one pid=76390)[0m INFO 2022-12-27 09:55:06,083 model_one model_one#LQThHy replica.py:505 - HANDLE __call__ OK 0.3ms
[2m[36m(ServeReplica:model_one pid=7

## Loading models

Commonly, we want to load a model once per process, not on every request.

This might be because the model is large/expensive to load, or we're retrieving it from a another system like a model registry or model database, and we want to minimize traffic against that other system while caching the model locally for performance.

With RayServe, the pattern is to load/create the model in the service constructor (`__init__`), assign it to an instance variable, and then use that instance variable as needed for prediction. An example is shown at https://docs.ray.io/en/latest/serve/ml-models.html#integration-with-model-registries

If you're loading very large models and want to improve performance further, there a few cool Ray tricks explained in this article, where the authors speed up loading a ~420MB flavor of the BERT language model: https://medium.com/ibm-data-ai/how-to-load-pytorch-models-340-times-faster-with-ray-8be751a6944c

Just to pull back the curtain and take some of the magic away... what if our model didn't "plug right in" to a matching Predictor and Deployment class?

