# Ray Serve - Model Serving Challenges

© 2019-2022, Anyscale. All Rights Reserved

📖 [Back to Table of Contents](./ex_00_tutorial_overview.ipynb)<br>
➡ [Next notebook](./ex_02_ray_serve_fastapi.ipynb) <br>

### Learning Objective:
In this introductory tutorial, you will:

* learn about model serving challenges
* understand the why Ray Serve and its concepts, components and architecture
* utilize Ray Serve APIs to create and serve deployments

## The Challenges of Model Serving

Model development happens in a data science research environment. There are many challenges, such as feature engineering, model selection, missing or messy data, yet there are tools at the data scientists' disposal. By contrast, model deployment to production faces an entirely different set of challenges and requires different tools. We must bridge the divide as much as possible.

So what are some of the challenges of model serving?

<img src="images/serve_challenges.png" width="70%" height="40%">

### 1. It Should Be Framework Agnostic

First, model serving frameworks must be able to serve models from popular frameworks and libraries like TensorFlow, PyTorch, scikit-learn, or even arbitrary Python functions. Even within the same organization, it is common to use several machine learning frameworks, in order to get the best model. 

Second, machine learning models are typically surrounded by (or work in conjunction with) 
lots of application or business logic. For example, some model serving is implemented as a RESTful service to which scoring requests are made. Often this is too restrictive, as some additional processing, such as fetching additional data from a online feature store, to augment the request data, may be desired as part of the scoring process, and the performance overhead of remote calls may be suboptimal.

### 2. Pure Python or Pythonic

In general, model serving should be intuitive for developers and simple to configure and run. Hence, it is desirable to use pure Python and to avoid verbose configurations using YAML files or other means. 

Data scientists and engineers use Python and Python-based ML frameworks to develop their machine learning models, so they should also be able to use Python to deploy their machine learning applications. This need is growing more critical as online learning applications combine training and serving in the same applications.

### 3. Simple and Scalable

Model serving must be simple to scale on demand across many machines. It must also be easy to upgrade models dynamically, over time. Achieving production uptime and performance requirements are essential for success.

### 4. DevOps/MLOps Integrations

Model serving deployments need to integrate with existing "DevOps" CI/CD practices for controlled, audited, and predicatble releases. Patterns like [Canary Deployment](https://martinfowler.com/bliki/CanaryRelease.html) are particularly useful for testing the efficacy of a new model before replacing existing models, just as this pattern is useful for other software deployments.

### 5. Flexible Deployment Patterns

There are unique deployment patterns, too. For example, it should be easy to deploy a forest of models, to split traffic to different instances, and to score data in batches for greater efficiency.

See also this [Ray blog post](https://medium.com/distributed-computing-with-ray/the-simplest-way-to-serve-your-nlp-model-in-production-with-pure-python-d42b6a97ad55) on the challenges of model serving and the way Ray Serve addresses them. It also provides an example of starting with a simple model, then deploying a more sophisticated model into the running application. Along the same lines, this blog post, [Serving ML Models in Production Common Patterns](https://www.anyscale.com/blog/serving-ml-models-in-production-common-patterns) discusses how deployment patterns for model serving and how you can use Ray Serve. Additionally, listen to this webinar: [Building a scalable ML model serving API with Ray Serve](https://www.anyscale.com/events/2021/09/09/building-a-scalable-ml-model-serving-api-with-ray-serve). This introductory webinar highlights how Ray Serve makes it easy to deploy, operate and scale a machine learning API.

<img src="images/PatternsMLProduction.png" width="70%" height="40%"> 


## Why Ray Serve?

[Ray Serve](https://docs.ray.io/en/latest/serve/index.html) is a scalable, framework-agnostic and Python-first model serving library built on [Ray](https://ray.io).

<img src="images/ray_serve_overview.png" width="70%" height="40%"> 

For users, Ray Serve offers these benefits:

* **Framework Agnostic**: You can use the same toolkit to serve everything from deep learning models built with [PyTorch](https://docs.ray.io/en/latest/serve/tutorials/pytorch.html#serve-pytorch-tutorial), [Tensorflow](https://docs.ray.io/en/latest/serve/tutorials/tensorflow.html#serve-tensorflow-tutorial), or [Keras](https://docs.ray.io/en/latest/serve/tutorials/tensorflow.html#serve-tensorflow-tutorial), to [scikit-Learn](https://docs.ray.io/en/latest/serve/tutorials/sklearn.html#serve-sklearn-tutorial) models, to arbitrary business logic.
* **Python First:** Configure your model serving with pure Python code. No YAML or JSON configurations required.

Since Serve is built on Ray, it also allows you to scale to many machines, in your datacenter or in cloud environments, and it allows you to leverage all of the other Ray frameworks.

## Ray Serve Architecture and components

<img src="images/serve-architecture-2.0.png" height="40%" width="70%">

There are three kinds of actors that are created to make up a Serve instance:
- **Controller**: A global actor unique to each Serve instance that manages
  the control plane. The Controller is responsible for creating, updating, and
  destroying other actors. Serve API calls like creating or getting a deployment
  make remote calls to the Controller.
- **HTTP Proxy**: By default there is one HTTP proxy actor on the head node. This actor runs a [Uvicorn](https://www.uvicorn.org/) HTTP
  server that accepts incoming requests, forwards them to replicas, and
  responds once they are completed.  For scalability and high availability,
  you can also run a proxy on each node in the cluster via the `location` field of [`http_options`](core-apis).
- **Replicas**: Actors that actually execute the code in response to a
  request. For example, they may contain an instantiation of an ML model. Each
  replica processes individual requests from the HTTP proxy (these may be batched
  by the replica using `@serve.batch`, see the [batching](https://docs.ray.io/en/latest/serve/ml-models.html#serve-batching) docs).

For more details, see this [key concepts](https://docs.ray.io/en/latest/serve/index.html) and [architecture](https://docs.ray.io/en/latest/serve/architecture.html) documentation.

### Lifetime of a Request

When an HTTP request is sent to the router, the following things happen:

 * The HTTP request is received and parsed.

 * The correct deployment associated with the HTTP url path is looked up. The request is placed on a queue.

 * For each request in a deployment queue, an available replica is looked up and the request is sent to it. If there are no available replicas (there are more than max_concurrent_queries requests outstanding), the request is left in the queue until an outstanding request is finished.

Each replica maintains a queue of requests and executes one at a time, possibly using asyncio to process them concurrently. If the handler (the function for the deployment or __call__) is async, the replica will not wait for the handler to run; otherwise, the replica will block until the handler returns.



## Two Simple Ray Serve Examples

We'll explore a more detailed example later in this notebook, where we actually serve ML models. Here we explore how deployments are simple with Ray Serve! We will first use a function that does "scoring," sufficient for _stateless_ scenarios, then use a class, which enables _stateful_ scenarios.

But first, initialize Ray as before:

In [19]:
import os
import warnings
import time
import logging

In [20]:
import ray
from ray import serve

import requests  # for making web requests

In [21]:
warnings.filterwarnings("ignore")
os.environ["PYTHONWARNINGS"] = "ignore"

In [22]:
if ray.is_initialized:
    ray.shutdown()
ray.init(logging_level=logging.ERROR)

0,1
Python version:,3.8.11
Ray version:,3.0.0.dev0
Dashboard:,http://127.0.0.1:8265


Let's define a simple function that will be served by Ray. As with Ray Tasks, we can decorate this function with `@serve.deployment`, meaning this is going to be
deployed on Ray Serve as a function to which we can send Starlette requests.

It takes in a `request`, extracts the request parameter with key "name,"
and returns an echoed string. 

Simple to illustrate that Ray Serve can also serve Python functions.

### Create a Python function deployment 

In [23]:
@serve.deployment
def hello(request):
    name = request.query_params["name"]
    return f"Hello {name}!"

Use the `<func_name>.bind()` method to deploy it on Ray Serve.

### Deploy a Python function for serving by binding it to a deployment

In [24]:
handle = hello.bind()

To run Ray Serve and deploy the model, simply call `serve.run(bound_handle)`.

In [25]:
serve.run(handle)

[2m[36m(ServeController pid=18569)[0m INFO 2022-08-08 13:33:33,369 controller 18569 http_state.py:123 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-16c743807f0d7c9853a49f5a543dbd87ddf424518fbbb297dce157b1' on node '16c743807f0d7c9853a49f5a543dbd87ddf424518fbbb297dce157b1' listening on '127.0.0.1:8000'
[2m[36m(ServeController pid=18569)[0m INFO 2022-08-08 13:33:34,816 controller 18569 deployment_state.py:1277 - Adding 1 replicas to deployment 'hello'.
[2m[36m(HTTPProxyActor pid=18578)[0m INFO:     Started server process [18578]


RayServeSyncHandle(deployment='hello')

### Send some requests to our Python function

In [26]:
for i in range(10):
    response = requests.get(f"http://127.0.0.1:8000/hello?name=request_{i}").text
    print(f'{i:2d}: {response}')

 0: Hello request_0!
 1: Hello request_1!
 2: Hello request_2!
 3: Hello request_3!
 4: Hello request_4!
 5: Hello request_5!
 6: Hello request_6!
 7: Hello request_7!
 8: Hello request_8!
 9: Hello request_9!


[2m[36m(HTTPProxyActor pid=18578)[0m INFO 2022-08-08 13:33:36,888 http_proxy 127.0.0.1 http_proxy.py:315 - GET / 200 3.2ms
[2m[36m(HTTPProxyActor pid=18578)[0m INFO 2022-08-08 13:33:36,894 http_proxy 127.0.0.1 http_proxy.py:315 - GET / 200 2.0ms
[2m[36m(HTTPProxyActor pid=18578)[0m INFO 2022-08-08 13:33:36,898 http_proxy 127.0.0.1 http_proxy.py:315 - GET / 200 1.7ms
[2m[36m(HTTPProxyActor pid=18578)[0m INFO 2022-08-08 13:33:36,902 http_proxy 127.0.0.1 http_proxy.py:315 - GET / 200 1.6ms
[2m[36m(HTTPProxyActor pid=18578)[0m INFO 2022-08-08 13:33:36,907 http_proxy 127.0.0.1 http_proxy.py:315 - GET / 200 1.9ms
[2m[36m(HTTPProxyActor pid=18578)[0m INFO 2022-08-08 13:33:36,912 http_proxy 127.0.0.1 http_proxy.py:315 - GET / 200 1.9ms
[2m[36m(HTTPProxyActor pid=18578)[0m INFO 2022-08-08 13:33:36,917 http_proxy 127.0.0.1 http_proxy.py:315 - GET / 200 1.9ms
[2m[36m(ServeReplica:hello pid=18583)[0m INFO 2022-08-08 13:33:36,887 hello hello#aYfVqn replica.py:505 - HANDLE __

You should see `hello request_N` in the output. 

Now let's serve another "model" in the same Ray Serve instance:

In [27]:
from random import random
import starlette
from starlette.requests import Request

@serve.deployment
class SimpleModel:
    def __init__(self):
        self.weight = 0.5
        self.bias = 1
        self.prediction = 0.0

    def __call__(self, starlette_request):
        data = starlette_request.query_params['data']
        self.prediction = float(data) * self.weight * random() + self.bias
        return {"prediction": self.prediction}

[2m[36m(HTTPProxyActor pid=18578)[0m INFO 2022-08-08 13:33:36,929 http_proxy 127.0.0.1 http_proxy.py:315 - GET / 200 2.0ms
[2m[36m(HTTPProxyActor pid=18578)[0m INFO 2022-08-08 13:33:36,934 http_proxy 127.0.0.1 http_proxy.py:315 - GET / 200 2.1ms
[2m[36m(ServeReplica:hello pid=18583)[0m INFO 2022-08-08 13:33:36,928 hello hello#aYfVqn replica.py:505 - HANDLE __call__ OK 0.1ms
[2m[36m(ServeReplica:hello pid=18583)[0m INFO 2022-08-08 13:33:36,933 hello hello#aYfVqn replica.py:505 - HANDLE __call__ OK 0.1ms


In [28]:
simple_model = SimpleModel.bind()
serve.run(simple_model)

[2m[36m(ServeController pid=18569)[0m INFO 2022-08-08 13:33:37,401 controller 18569 deployment_state.py:1277 - Adding 1 replicas to deployment 'SimpleModel'.
[2m[36m(ServeController pid=18569)[0m INFO 2022-08-08 13:33:39,373 controller 18569 deployment_state.py:1302 - Removing 1 replicas from deployment 'hello'.


RayServeSyncHandle(deployment='SimpleModel')

### Send some requests to our Model

In [29]:
url = f"http://127.0.0.1:8000/SimpleModel"
for i in range(5):
    print(f"prediction  : {requests.get(url, params={'data': random()}).text}")

prediction  : {"prediction": 1.2083630123910705}
prediction  : {"prediction": 1.0758797678071188}
prediction  : {"prediction": 1.201087171997287}
prediction  : {"prediction": 1.125423103835021}
prediction  : {"prediction": 1.07045986879704}


[2m[36m(HTTPProxyActor pid=18578)[0m INFO 2022-08-08 13:33:42,484 http_proxy 127.0.0.1 http_proxy.py:315 - GET / 200 3.5ms
[2m[36m(HTTPProxyActor pid=18578)[0m INFO 2022-08-08 13:33:42,488 http_proxy 127.0.0.1 http_proxy.py:315 - GET / 200 1.8ms
[2m[36m(ServeReplica:SimpleModel pid=18604)[0m INFO 2022-08-08 13:33:42,483 SimpleModel SimpleModel#OvBGsm replica.py:505 - HANDLE __call__ OK 0.2ms
[2m[36m(ServeReplica:SimpleModel pid=18604)[0m INFO 2022-08-08 13:33:42,488 SimpleModel SimpleModel#OvBGsm replica.py:505 - HANDLE __call__ OK 0.2ms
[2m[36m(ServeReplica:SimpleModel pid=18604)[0m INFO 2022-08-08 13:33:42,493 SimpleModel SimpleModel#OvBGsm replica.py:505 - HANDLE __call__ OK 0.2ms
[2m[36m(HTTPProxyActor pid=18578)[0m INFO 2022-08-08 13:33:42,494 http_proxy 127.0.0.1 http_proxy.py:315 - GET / 200 2.9ms


### Shut down Ray Serve

In [35]:
serve.shutdown()

[2m[36m(ServeController pid=18569)[0m INFO 2022-08-08 13:33:49,694 controller 18569 deployment_state.py:1302 - Removing 2 replicas from deployment 'Predictor'.
[2m[36m(ServeController pid=18569)[0m INFO 2022-08-08 13:33:49,697 controller 18569 deployment_state.py:1302 - Removing 2 replicas from deployment 'Predictor_1'.
[2m[36m(ServeController pid=18569)[0m INFO 2022-08-08 13:33:49,699 controller 18569 deployment_state.py:1302 - Removing 1 replicas from deployment 'ServeHandleDemo'.
[2m[36m(HTTPProxyActor pid=18578)[0m INFO 2022-08-08 13:33:49,625 http_proxy 127.0.0.1 http_proxy.py:315 - GET / 200 45.6ms
[2m[36m(ServeReplica:ServeHandleDemo pid=18655)[0m INFO 2022-08-08 13:33:49,625 ServeHandleDemo ServeHandleDemo#YoxtQo replica.py:505 - HANDLE __call__ OK 41.8ms
[2m[36m(ServeReplica:Predictor_1 pid=18653)[0m INFO 2022-08-08 13:33:49,624 Predictor_1 Predictor_1#LnMFmK replica.py:505 - HANDLE predict OK 0.2ms


In [36]:
ray.shutdown()

## Exercise

Here are some things you can try:

1. Write function or class and deploy it. You can modify class `SimpleMode`.

### Homework

* Try the framework specific tutorials below with Ray Serve

### Framework-Specific Tutorials

Ray Serve seamlessly integrates with popular Python ML libraries. Below are tutorials with some of these frameworks to help get you started.

 * [PyTorch Tutorial](https://docs.ray.io/en/latest/serve/tutorials/pytorch.html#serve-pytorch-tutorial)
 * [Scikit-Learn Tutorial](https://docs.ray.io/en/latest/serve/tutorials/sklearn.html#serve-sklearn-tutorial)
 * [Keras and Tensorflow Tutorial](https://docs.ray.io/en/latest/serve/tutorials/tensorflow.html#serve-tensorflow-tutorial)


### Next
We will learn how you can use Ray Serve integration with [MLflow](https://mlflow.org/)

📖 [Back to Table of Contents](./ex_00_tutorial_overview.ipynb)<br>
➡ [Next notebook](./ex_02_ray_serve_fastapi.ipynb) <br>
