# Scalable, Flexible Serving with Ray Serve

<img src='images/servelogo.svg' width=400>

Earlier, we saw a server that performed simple request/response operation.

While it's nice to have that feature without deploying additional software, that pattern is fairly well understood and easily scaled with existing technology.

Ray Serve provides more value as we move to more complex patterns such as
- stateful services
- batching
- composition
- integration to model registries

Let's start with a simple service deployment that could work with just about any model or logic. This will also help "take some of the magic away" from that 3-line serve demo featuring a `XGBoostPredictor` and `PredictorDeployment`

In [None]:
from starlette.requests import Request
from typing import Dict
import json
import ray
from ray import serve

class MyModel:
    def __init__(self, demo_param: int):
        self._demo_param = demo_param
        
    def predict(self, data):
        return data * self._demo_param

@serve.deployment(route_prefix="/", num_replicas=2)
class GenericDeployment:
    def __init__(self, demo_param:int):        
        self._model = MyModel(demo_param)

    async def __call__(self, request: Request) -> Dict:
        data = await request.json()
        data = json.loads(data)
        return { "result" : self._model.predict(data['input']) }

serve.run(GenericDeployment.bind(demo_param=42))

In [None]:
sample_json = '{"input":7}'
sample_json

In [None]:
import requests

print(requests.post("http://localhost:8000/", json = sample_json).json())

Ok that illustrates the framework pattern a little bit.

Next let's look at some more interesting features that can power more complex use cases.

In [None]:
@serve.deployment
class Counter:
    def __init__(self):
        self.count = 0

    def __call__(self, *args):
        self.count += 1
        return {"count": self.count}

Counter.deploy()

In [None]:
import requests

requests.get("http://127.0.0.1:8000/Counter").json()

We can also invoke these services directly from other applications or services in our Ray application

In [None]:
ray.get(Counter.get_handle().remote())

Many models achieve much better per-record performance when evaluating batches of records.

We can use Ray Serve to build that batching layer.

In [None]:
import numpy as np
import time

@serve.deployment(route_prefix="/adder")
class BatchAdder:
    @serve.batch(max_batch_size=4)
    async def handle_batch(self, numbers):
        input_array = np.array(numbers)
        print("Our input array has shape:", input_array.shape)
        # Sleep for 200ms, this could be performing CPU intensive computation
        # in real models
        time.sleep(0.2)
        output_array = input_array + 1
        return output_array.astype(int).tolist()

    async def __call__(self, request):
        return await self.handle_batch(int(request.query_params["number"]))

In [None]:
BatchAdder.deploy()

In [None]:
def make_request(i):
    return requests.get("http://localhost:8000/adder?number={}".format(i)).text

In [None]:
make_request(17)

In [None]:
from concurrent.futures import ThreadPoolExecutor

executor = ThreadPoolExecutor()

results = executor.map(make_request, range(0, 20, 2))

In [None]:
list(results)

__Featurization/Model Composition__

Our pipeline will be structured as follows:
- Input comes in, the composed model sends it to `model_one`
- `model_one` outputs a random number between 0 and 1, if the value is
  greater than 0.5, then the data is sent to `model_two`
- otherwise, the data is returned to the user.

In [None]:
from random import random

@serve.deployment(route_prefix='/one') # remove /refactor this (route) for stateless?
def model_one(data):
    print("Model 1 called with data ", data)
    return random()

model_one.deploy()

@serve.deployment
def model_two(data):
    print("Model 2 called with data ", data)
    return data

model_two.deploy()

In [None]:
resp = requests.get("http://127.0.0.1:8000/one", data="hey!") # stateless demo only
resp.json()

In [None]:
# max_concurrent_queries is optional. By default, if you pass in an async
# function, Ray Serve sets the limit to a high number.
@serve.deployment(max_concurrent_queries=10, route_prefix="/composed")
class ComposedModel:
    def __init__(self):
        self.model_one = model_one.get_handle()
        self.model_two = model_two.get_handle()

    # This method can be called concurrently!
    async def __call__(self, starlette_request):
        data = await starlette_request.body()

        score = await self.model_one.remote(data=data)
        if score > 0.5:
            result = await self.model_two.remote(data=data)
            result = {"model_used": 2, "score": score}
        else:
            result = {"model_used": 1, "score": score}

        return result

ComposedModel.deploy()

In [None]:
for _ in range(5):
    resp = requests.get("http://127.0.0.1:8000/composed", data="hey!")
    print(resp.json())

## Loading models

Commonly, we want to load a model once per process, not on every request.

This might be because the model is large/expensive to load, or we're retrieving it from a another system like a model registry or model database, and we want to minimize traffic against that other system while caching the model locally for performance.

With RayServe, the pattern is to load/create the model in the service constructor (`__init__`), assign it to an instance variable, and then use that instance variable as needed for prediction. An example is shown at https://docs.ray.io/en/latest/serve/ml-models.html#integration-with-model-registries

If you're loading very large models and want to improve performance further, there a few cool Ray tricks explained in this article, where the authors speed up loading a ~420MB flavor of the BERT language model: https://medium.com/ibm-data-ai/how-to-load-pytorch-models-340-times-faster-with-ray-8be751a6944c

Just to pull back the curtain and take some of the magic away... what if our model didn't "plug right in" to a matching Predictor and Deployment class?

