# Ray Serve - Model Serving Challenges

© 2019-2022, Anyscale. All Rights Reserved

📖 [Back to Table of Contents](./ex_00_tutorial_overview.ipynb)<br>
➡ [Next notebook](./ex_02_ray_serve_mlflow.ipynb) <br>

### Learning Objective:
In this introductory tutorial, you will:

* learn about model serving challenges
* understand the why Ray Serve and its concepts, components and architecture
* utilize Ray Serve APIs to create and serve deployments
* access deployments using two different methods
* learn how to scale deployments using replicas

## The Challenges of Model Serving

Model development happens in a data science research environment. There are many challenges, such as feature engineering, model selection, missing or messy data, yet there are tools at the data scientists' disposal. By contrast, model deployment to production faces an entirely different set of challenges and requires different tools. We must bridge the divide as much as possible.

So what are some of the challenges of model serving?

<img src="images/serve_challenges.png" width="70%" height="40%">

### 1. It Should Be Framework Agnostic

First, model serving frameworks must be able to serve models from popular frameworks and libraries like TensorFlow, PyTorch, scikit-learn, or even arbitrary Python functions. Even within the same organization, it is common to use several machine learning frameworks, in order to get the best model. 

Second, machine learning models are typically surrounded by (or work in conjunction with) 
lots of application or business logic. For example, some model serving is implemented as a RESTful service to which scoring requests are made. Often this is too restrictive, as some additional processing, such as fetching additional data from a online feature store, to augment the request data, may be desired as part of the scoring process, and the performance overhead of remote calls may be suboptimal.

### 2. Pure Python or Pythonic

In general, model serving should be intuitive for developers and simple to configure and run. Hence, it is desirable to use pure Python and to avoid verbose configurations using YAML files or other means. 

Data scientists and engineers use Python and Python-based ML frameworks to develop their machine learning models, so they should also be able to use Python to deploy their machine learning applications. This need is growing more critical as online learning applications combine training and serving in the same applications.

### 3. Simple and Scalable

Model serving must be simple to scale on demand across many machines. It must also be easy to upgrade models dynamically, over time. Achieving production uptime and performance requirements are essential for success.

### 4. DevOps/MLOps Integrations

Model serving deployments need to integrate with existing "DevOps" CI/CD practices for controlled, audited, and predicatble releases. Patterns like [Canary Deployment](https://martinfowler.com/bliki/CanaryRelease.html) are particularly useful for testing the efficacy of a new model before replacing existing models, just as this pattern is useful for other software deployments.

### 5. Flexible Deployment Patterns

There are unique deployment patterns, too. For example, it should be easy to deploy a forest of models, to split traffic to different instances, and to score data in batches for greater efficiency.

See also this [Ray blog post](https://medium.com/distributed-computing-with-ray/the-simplest-way-to-serve-your-nlp-model-in-production-with-pure-python-d42b6a97ad55) on the challenges of model serving and the way Ray Serve addresses them. It also provides an example of starting with a simple model, then deploying a more sophisticated model into the running application. Along the same lines, this blog post, [Serving ML Models in Production Common Patterns](https://www.anyscale.com/blog/serving-ml-models-in-production-common-patterns) discusses how deployment patterns for model serving and how you can use Ray Serve. Additionally, listen to this webinar: [Building a scalable ML model serving API with Ray Serve](https://www.anyscale.com/events/2021/09/09/building-a-scalable-ml-model-serving-api-with-ray-serve). This introductory webinar highlights how Ray Serve makes it easy to deploy, operate and scale a machine learning API.

<img src="images/PatternsMLProduction.png" width="70%" height="40%"> 


## Why Ray Serve?

[Ray Serve](https://docs.ray.io/en/latest/serve/index.html) is a scalable, framework-agnostic and Python-first model serving library built on [Ray](https://ray.io).

<img src="images/ray_serve_overview.png" width="70%" height="40%"> 

For users, Ray Serve offers these benefits:

* **Framework Agnostic**: You can use the same toolkit to serve everything from deep learning models built with [PyTorch](https://docs.ray.io/en/latest/serve/tutorials/pytorch.html#serve-pytorch-tutorial), [Tensorflow](https://docs.ray.io/en/latest/serve/tutorials/tensorflow.html#serve-tensorflow-tutorial), or [Keras](https://docs.ray.io/en/latest/serve/tutorials/tensorflow.html#serve-tensorflow-tutorial), to [scikit-Learn](https://docs.ray.io/en/latest/serve/tutorials/sklearn.html#serve-sklearn-tutorial) models, to arbitrary business logic.
* **Python First:** Configure your model serving with pure Python code. No YAML or JSON configurations required.

Since Serve is built on Ray, it also allows you to scale to many machines, in your datacenter or in cloud environments, and it allows you to leverage all of the other Ray frameworks.

## Ray Serve Architecture and components

<img src="images/architecture.png" height="40%" width="70%">

There are three kinds of actors that are created to make up a Serve instance:

**Controller**: A global actor unique to each Serve instance that manages the control plane. The Controller is responsible for creating, updating, and destroying other actors. Serve API calls like creating or getting a deployment make remote calls to the Controller.

**Router**: There is one router per node. Each router is a Uvicorn HTTP server that accepts incoming requests, forwards them to replicas, and responds once they are completed.

**Worker Replica**: Worker replicas actually execute the code in response to a request. For example, they may contain an instantiation of an ML model. Each replica processes individual requests from the routers (they may be batched by the replica using `@serve.batch`, see the [batching docs](https://docs.ray.io/en/latest/serve/ml-models.html#serve-batching)).

<img src="images/request_flow.png" height="50%" width="75%">

For more details, see this [key concepts](https://docs.ray.io/en/latest/serve/index.html) and [architecture](https://docs.ray.io/en/latest/serve/architecture.html) documentation.

### Lifetime of a Request

When an HTTP request is sent to the router, the following things happen:

 * The HTTP request is received and parsed.

 * The correct deployment associated with the HTTP url path is looked up. The request is placed on a queue.

 * For each request in a deployment queue, an available replica is looked up and the request is sent to it. If there are no available replicas (there are more than max_concurrent_queries requests outstanding), the request is left in the queue until an outstanding request is finished.

Each replica maintains a queue of requests and executes one at a time, possibly using asyncio to process them concurrently. If the handler (the function for the deployment or __call__) is async, the replica will not wait for the handler to run; otherwise, the replica will block until the handler returns.



## Two Simple Ray Serve Examples

We'll explore a more detailed example later in this notebook, where we actually serve ML models. Here we explore how deployments are simple with Ray Serve! We will first use a function that does "scoring," sufficient for _stateless_ scenarios, then use a class, which enables _stateful_ scenarios.

<img src="images/func_class_deployment.png" width="80%" height="50%">

But first, initialize Ray as before:

In [1]:
import os
import warnings
import time
import logging

In [2]:
import ray
from ray import serve

import requests  # for making web requests

In [3]:
warnings.filterwarnings("ignore")
os.environ["PYTHONWARNINGS"] = "ignore"

In [4]:
if ray.is_initialized:
    ray.shutdown()
ray.init(logging_level=logging.ERROR)

0,1
Python version:,3.8.13
Ray version:,2.0.0rc0
Dashboard:,http://127.0.0.1:8266


Now we initialize Ray Serve itself. Note that we did not have to start a Ray cluster explicity. If one is not running `serve.start()` will automatically launch a Ray cluster, otherwise it'll connect to an exisisting instance.

In [5]:
serve.start()

[2m[36m(ServeController pid=29932)[0m INFO 2022-08-05 20:27:00,838 controller 29932 http_state.py:123 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:NBHGiv:SERVE_PROXY_ACTOR-d5d37d9a470e2fb2b980c6dec0747d81483471a4ba6f52fb1de45718' on node 'd5d37d9a470e2fb2b980c6dec0747d81483471a4ba6f52fb1de45718' listening on '127.0.0.1:8000'


<ray.serve._private.client.ServeControllerClient at 0x11e616fd0>

[2m[36m(HTTPProxyActor pid=29934)[0m INFO:     Started server process [29934]


Next, define our stateless function for processing requests.


Let's define a simple function that will be served by Ray. As with Ray Tasks, we can decorate this function with `@serve.deployment`, meaning this is going to be
deployed on Ray Serve as function to which we can send Starlette requests.

It takes in a `request`, extracts the request parameter with key "name,"
and returns an echoed string. 

Simple to illustrate that Ray Serve can also serve Python functions.

### Create a Python function deployment 

In [6]:
@serve.deployment
def hello(request):
    name = request.query_params["name"]
    return f"Hello {name}!"

Use the `<func_name>.deploy()` method to deploy it on Ray Serve.

### Deploy a Python function for serving

In [7]:
hello.deploy()

[2m[36m(ServeController pid=29932)[0m INFO 2022-08-05 20:27:04,501 controller 29932 deployment_state.py:1232 - Adding 1 replicas to deployment 'hello'.


### Send some requests to our Python function

In [8]:
for i in range(10):
    response = requests.get(f"http://127.0.0.1:8000/hello?name=request_{i}").text
    print(f'{i:2d}: {response}')

 0: Hello request_0!
 1: Hello request_1!
 2: Hello request_2!
 3: Hello request_3!
 4: Hello request_4!
 5: Hello request_5!
 6: Hello request_6!
 7: Hello request_7!
 8: Hello request_8!
 9: Hello request_9!


[2m[36m(HTTPProxyActor pid=29934)[0m INFO 2022-08-05 20:27:06,137 http_proxy 127.0.0.1 http_proxy.py:315 - GET /hello 200 4.1ms
[2m[36m(HTTPProxyActor pid=29934)[0m INFO 2022-08-05 20:27:06,146 http_proxy 127.0.0.1 http_proxy.py:315 - GET /hello 200 6.7ms
[2m[36m(HTTPProxyActor pid=29934)[0m INFO 2022-08-05 20:27:06,157 http_proxy 127.0.0.1 http_proxy.py:315 - GET /hello 200 6.5ms
[2m[36m(HTTPProxyActor pid=29934)[0m INFO 2022-08-05 20:27:06,160 http_proxy 127.0.0.1 http_proxy.py:315 - GET /hello 200 1.6ms
[2m[36m(HTTPProxyActor pid=29934)[0m INFO 2022-08-05 20:27:06,164 http_proxy 127.0.0.1 http_proxy.py:315 - GET /hello 200 1.8ms
[2m[36m(HTTPProxyActor pid=29934)[0m INFO 2022-08-05 20:27:06,167 http_proxy 127.0.0.1 http_proxy.py:315 - GET /hello 200 1.7ms
[2m[36m(HTTPProxyActor pid=29934)[0m INFO 2022-08-05 20:27:06,171 http_proxy 127.0.0.1 http_proxy.py:315 - GET /hello 200 1.4ms
[2m[36m(ServeReplica:hello pid=29936)[0m INFO 2022-08-05 20:27:06,136 hello hell

You should see `hello request_N` in the output. 

Now let's serve another "model" in the same Ray Serve instance:

In [9]:
from random import random
import starlette
from starlette.requests import Request

@serve.deployment
class SimpleModel:
    def __init__(self):
        self.weight = 0.5
        self.bias = 1
        self.prediction = 0.0

    def __call__(self, starlette_request):
        data = starlette_request.query_params['data']
        self.prediction = float(data) * self.weight * random() + self.bias
        return {"prediction": self.prediction}

In [10]:
SimpleModel.deploy()

[2m[36m(ServeController pid=29932)[0m INFO 2022-08-05 20:27:08,332 controller 29932 deployment_state.py:1232 - Adding 1 replicas to deployment 'SimpleModel'.


### Send some requests to our Model

In [11]:
url = f"http://127.0.0.1:8000/SimpleModel"
for i in range(5):
    print(f"prediction  : {requests.get(url, params={'data': random()}).text}")

prediction  : {"prediction": 1.2492219367040882}
prediction  : {"prediction": 1.0216153948312}
prediction  : {"prediction": 1.0299163807604585}
prediction  : {"prediction": 1.0428100425301339}
prediction  : {"prediction": 1.2859704013650775}


[2m[36m(HTTPProxyActor pid=29934)[0m INFO 2022-08-05 20:27:09,642 http_proxy 127.0.0.1 http_proxy.py:315 - GET /SimpleModel 200 3.5ms
[2m[36m(HTTPProxyActor pid=29934)[0m INFO 2022-08-05 20:27:09,646 http_proxy 127.0.0.1 http_proxy.py:315 - GET /SimpleModel 200 1.7ms
[2m[36m(HTTPProxyActor pid=29934)[0m INFO 2022-08-05 20:27:09,649 http_proxy 127.0.0.1 http_proxy.py:315 - GET /SimpleModel 200 1.7ms
[2m[36m(HTTPProxyActor pid=29934)[0m INFO 2022-08-05 20:27:09,653 http_proxy 127.0.0.1 http_proxy.py:315 - GET /SimpleModel 200 1.5ms
[2m[36m(HTTPProxyActor pid=29934)[0m INFO 2022-08-05 20:27:09,656 http_proxy 127.0.0.1 http_proxy.py:315 - GET /SimpleModel 200 1.2ms
[2m[36m(ServeReplica:SimpleModel pid=29939)[0m INFO 2022-08-05 20:27:09,641 SimpleModel SimpleModel#TGQttZ replica.py:482 - HANDLE __call__ OK 0.2ms
[2m[36m(ServeReplica:SimpleModel pid=29939)[0m INFO 2022-08-05 20:27:09,645 SimpleModel SimpleModel#TGQttZ replica.py:482 - HANDLE __call__ OK 0.1ms
[2m[36m(S

### List Deployments

In [12]:
serve.list_deployments()

{'hello': Deployment(name=hello,version=None,route_prefix=/hello),
 'SimpleModel': Deployment(name=SimpleModel,version=None,route_prefix=/SimpleModel)}

## Serving ML Models

Now let's see how to deploy ML models and query them via two methods:
 1. **ServeHandle API** gives you control and a pythonic interface to your deployments
 2. **HTTP** offers an HTTP client and web interface to access your deployments. This could be suitable for web application sending an HTTP request to your model deployment 
 <img src="images/func_class_deployment_2.png" width="80%" height="50%">

Below is a simple example model stored in a pickled format at an accessible path in the cloud storage or model registry
that can be reloaded and deserialized into a model instance. Once deployed
in Ray Serve, we can use it for prediction. The prediction is a fake condition,
based on threshold of weight greater than 0.5.

In [13]:
class Model:
    def __init__(self, path):
        self.path = path

    def predict(self, data):
        return random() + data if data > 0.5 else data

In [14]:
import os

@serve.deployment
class Deployment:
    # Take in a path to load your desired model
    def __init__(self, path: str) -> None:
        self.path = path
        self.model = Model(path)
        # Get the pid on which this deployment is running on
        self.pid = os.getpid()

    # Deployments are callable. Here we simply return a prediction from
    # our request
    def __call__(self, starlette_request) -> str:
        # Request came via an HTTP
        if isinstance(starlette_request, starlette.requests.Request):
            data = starlette_request.query_params['data']
        else:
            # Request came via a ServerHandle API method call.
            data = starlette_request
        pred = self.model.predict(float(data))
        return f"(pid: {self.pid}); path: {self.path}; data: {float(data):.3f}; prediction: {pred:.3f}"

Create two distinct deployments of the same class as two replicas. 
Associate each deployment with a unique 'name'. This name can be used to fetch its respective ServeHandle.

In [15]:
Deployment.options(name="rep-1", num_replicas=1).deploy("/model/rep-1.pkl")
Deployment.options(name="rep-2", num_replicas=1).deploy("/model/rep-2.pkl")

[2m[36m(ServeController pid=29932)[0m INFO 2022-08-05 20:27:24,135 controller 29932 deployment_state.py:1232 - Adding 1 replicas to deployment 'rep-1'.
[2m[36m(ServeController pid=29932)[0m INFO 2022-08-05 20:27:25,172 controller 29932 deployment_state.py:1232 - Adding 1 replicas to deployment 'rep-2'.


### List deployments again

In [16]:
print(serve.list_deployments())

{'hello': Deployment(name=hello,version=None,route_prefix=/hello), 'SimpleModel': Deployment(name=SimpleModel,version=None,route_prefix=/SimpleModel), 'rep-1': Deployment(name=rep-1,version=None,route_prefix=/rep-1), 'rep-2': Deployment(name=rep-2,version=None,route_prefix=/rep-2)}


### Method 1: Access each deployment using the ServeHandle API

In [17]:
for _ in range(2):
    for d_name in ["rep-1", "rep-2"]:
        # Get handle to the each deployment and invoke its method.
        # Which replica the request is dispatched to is determined
        # by the Router actor.
        handle = serve.get_deployment(d_name).get_handle()
        print(f"handle name : {d_name}")
        print(f"prediction  : {ray.get(handle.remote(random()))}")
        print("-" * 2)

You are retrieving a sync handle inside an asyncio loop. Try getting Deployment.get_handle(.., sync=False) to get better performance. Learn more at https://docs.ray.io/en/latest/serve/handle-guide.html#sync-and-async-handles
You are retrieving a sync handle inside an asyncio loop. Try getting Deployment.get_handle(.., sync=False) to get better performance. Learn more at https://docs.ray.io/en/latest/serve/handle-guide.html#sync-and-async-handles


handle name : rep-1
prediction  : (pid: 29945); path: /model/rep-1.pkl; data: 0.170; prediction: 0.170
--
handle name : rep-2
prediction  : (pid: 29947); path: /model/rep-2.pkl; data: 0.664; prediction: 0.956
--
handle name : rep-1
prediction  : (pid: 29945); path: /model/rep-1.pkl; data: 0.769; prediction: 1.061
--
handle name : rep-2
prediction  : (pid: 29947); path: /model/rep-2.pkl; data: 0.566; prediction: 1.535
--


[2m[36m(ServeReplica:rep-1 pid=29945)[0m INFO 2022-08-05 20:27:26,562 rep-1 rep-1#ZEFtCs replica.py:482 - HANDLE __call__ OK 0.1ms
[2m[36m(ServeReplica:rep-1 pid=29945)[0m INFO 2022-08-05 20:27:26,573 rep-1 rep-1#ZEFtCs replica.py:482 - HANDLE __call__ OK 0.1ms
[2m[36m(ServeReplica:rep-2 pid=29947)[0m INFO 2022-08-05 20:27:26,570 rep-2 rep-2#tklbvF replica.py:482 - HANDLE __call__ OK 0.1ms
[2m[36m(ServeReplica:rep-2 pid=29947)[0m INFO 2022-08-05 20:27:26,576 rep-2 rep-2#tklbvF replica.py:482 - HANDLE __call__ OK 0.1ms


### Method 2: Access deployment via HTTP Request

In [18]:
for _ in range(2):
    for d_name in ["rep-1", "rep-2"]:
        # Send HTTP request along with data payload
        url = f"http://127.0.0.1:8000/{d_name}"
        print(f"handle name : {d_name}")
        print(f"prediction  : {requests.get(url, params={'data': random()}).text}")

handle name : rep-1
prediction  : (pid: 29945); path: /model/rep-1.pkl; data: 0.791; prediction: 1.760
handle name : rep-2
prediction  : (pid: 29947); path: /model/rep-2.pkl; data: 0.195; prediction: 0.195
handle name : rep-1
prediction  : (pid: 29945); path: /model/rep-1.pkl; data: 0.865; prediction: 1.116
handle name : rep-2
prediction  : (pid: 29947); path: /model/rep-2.pkl; data: 0.112; prediction: 0.112


[2m[36m(HTTPProxyActor pid=29934)[0m INFO 2022-08-05 20:27:28,562 http_proxy 127.0.0.1 http_proxy.py:315 - GET /rep-1 200 5.7ms
[2m[36m(HTTPProxyActor pid=29934)[0m INFO 2022-08-05 20:27:28,567 http_proxy 127.0.0.1 http_proxy.py:315 - GET /rep-2 200 2.8ms
[2m[36m(HTTPProxyActor pid=29934)[0m INFO 2022-08-05 20:27:28,570 http_proxy 127.0.0.1 http_proxy.py:315 - GET /rep-1 200 1.4ms
[2m[36m(HTTPProxyActor pid=29934)[0m INFO 2022-08-05 20:27:28,573 http_proxy 127.0.0.1 http_proxy.py:315 - GET /rep-2 200 1.5ms
[2m[36m(ServeReplica:rep-1 pid=29945)[0m INFO 2022-08-05 20:27:28,562 rep-1 rep-1#ZEFtCs replica.py:482 - HANDLE __call__ OK 0.2ms
[2m[36m(ServeReplica:rep-1 pid=29945)[0m INFO 2022-08-05 20:27:28,570 rep-1 rep-1#ZEFtCs replica.py:482 - HANDLE __call__ OK 0.1ms
[2m[36m(ServeReplica:rep-2 pid=29947)[0m INFO 2022-08-05 20:27:28,567 rep-2 rep-2#tklbvF replica.py:482 - HANDLE __call__ OK 0.1ms
[2m[36m(ServeReplica:rep-2 pid=29947)[0m INFO 2022-08-05 20:27:28,573 r

### Shut down Ray Serve

In [19]:
serve.shutdown()

[2m[36m(ServeController pid=29932)[0m INFO 2022-08-05 20:27:30,376 controller 29932 deployment_state.py:1257 - Removing 1 replicas from deployment 'hello'.
[2m[36m(ServeController pid=29932)[0m INFO 2022-08-05 20:27:30,379 controller 29932 deployment_state.py:1257 - Removing 1 replicas from deployment 'SimpleModel'.
[2m[36m(ServeController pid=29932)[0m INFO 2022-08-05 20:27:30,384 controller 29932 deployment_state.py:1257 - Removing 1 replicas from deployment 'rep-1'.
[2m[36m(ServeController pid=29932)[0m INFO 2022-08-05 20:27:30,385 controller 29932 deployment_state.py:1257 - Removing 1 replicas from deployment 'rep-2'.


In [20]:
ray.shutdown()

## Exercise

Here are some things you can try:

1. Increase the number of replicas. For each of Method 1 and Method 2, send ten requests
2. Do requests get sent to different replicas? (check the pids or the Ray Dashboard)
3. Write function or class and deploy it. You can modify class `Deployment`.

### Homework

* Try the tutorials below with Ray Serve

### Framework-Specific Tutorials

Ray Serve seamlessly integrates with popular Python ML libraries. Below are tutorials with some of these frameworks to help get you started.

 * [PyTorch Tutorial](https://docs.ray.io/en/latest/serve/tutorials/pytorch.html#serve-pytorch-tutorial)
 * [Scikit-Learn Tutorial](https://docs.ray.io/en/latest/serve/tutorials/sklearn.html#serve-sklearn-tutorial)
 * [Keras and Tensorflow Tutorial](https://docs.ray.io/en/latest/serve/tutorials/tensorflow.html#serve-tensorflow-tutorial)
 * [Ray Serve MLflow Deployment Plugin](https://github.com/ray-project/mlflow-ray-serve)


### Next
We will learn how you can use Ray Serve integration with [MLflow](https://mlflow.org/)

📖 [Back to Table of Contents](./ex_00_tutorial_overview.ipynb)<br>
➡ [Next notebook](./ex_02_ray_serve_mlflow.ipynb) <br>
