# Ray Serve - Model Serving Challenges

© 2019-2022, Anyscale. All Rights Reserved

## The Challenges of Model Serving

Model development happens in a data science research environment. There are many challenges, such as feature engineering, model selection, missing or messy data, yet there are tools at the data scientists' disposal. By contrast, model deployment to production faces an entirely different set of challenges and requires different tools. We must bridge the divide as much as possible.

So what are some of the challenges of model serving?

<img src="https://images.ctfassets.net/xjan103pcp94/6IcTIir1U1WBJdSbdygQ08/70ceeb0e4f5c8b72b7007c61cb19eed8/WhereRayServeFitsIn.png" width="70%" height="40%">

### 1. It Should Be Framework Agnostic

First, model serving frameworks must be able to serve models from popular frameworks and libraries like TensorFlow, PyTorch, scikit-learn, or even arbitrary Python functions. Even within the same organization, it is common to use several machine learning frameworks, in order to get the best model. 

Second, machine learning models are typically surrounded by (or work in conjunction with) 
lots of application or business logic. For example, some model serving is implemented as a RESTful service to which scoring requests are made. Often this is too restrictive, as some additional processing, such as fetching additional data from a online feature store, to augment the request data, may be desired as part of the scoring process, and the performance overhead of remote calls may be suboptimal.

### 2. Pure Python or Pythonic

In general, model serving should be intuitive for developers and simple to configure and run. Hence, it is desirable to use pure Python and to avoid verbose configurations using YAML files or other means. 

Data scientists and engineers use Python and Python-based ML frameworks to develop their machine learning models, so they should also be able to use Python to deploy their machine learning applications. This need is growing more critical as online learning applications combine training and serving in the same applications.

### 3. Simple and Scalable

Model serving must be simple to scale on demand across many machines. It must also be easy to upgrade models dynamically, over time. Achieving production uptime and performance requirements are essential for success.

### 4. DevOps/MLOps Integrations

Model serving deployments need to integrate with existing "DevOps" CI/CD practices for controlled, audited, and predicatble releases. Patterns like [Canary Deployment](https://martinfowler.com/bliki/CanaryRelease.html) are particularly useful for testing the efficacy of a new model before replacing existing models, just as this pattern is useful for other software deployments.

### 5. Flexible Deployment Patterns

There are unique deployment patterns, too. For example, it should be easy to deploy a forest of models, to split traffic to different instances, and to score data in batches for greater efficiency.

See also this [Ray blog post](https://medium.com/distributed-computing-with-ray/the-simplest-way-to-serve-your-nlp-model-in-production-with-pure-python-d42b6a97ad55) on the challenges of model serving and the way Ray Serve addresses them. It also provides an example of starting with a simple model, then deploying a more sophisticated model into the running application. Along the same lines, this blog post, [Serving ML Models in Production Common Patterns](https://www.anyscale.com/blog/serving-ml-models-in-production-common-patterns) discusses how deployment patterns for model serving and how you can use Ray Serve. Additionally, listen to this webinar: [Building a scalable ML model serving API with Ray Serve](https://www.anyscale.com/events/2021/09/09/building-a-scalable-ml-model-serving-api-with-ray-serve). This introductory webinar highlights how Ray Serve makes it easy to deploy, operate and scale a machine learning API.

<img src="images/PatternsMLProduction.png" width="70%" height="40%"> 


## Why Ray Serve?

[Ray Serve](https://docs.ray.io/en/latest/serve/index.html) is a scalable, framework-agnostic and Python-first model serving library built on [Ray](https://ray.io).

<img src="images/ray_serve_overview.png" width="70%" height="40%"> 

For users, Ray Serve offers these benefits:

* **Framework Agnostic**: You can use the same toolkit to serve everything from deep learning models built with [PyTorch](https://docs.ray.io/en/latest/serve/tutorials/pytorch.html#serve-pytorch-tutorial), [Tensorflow](https://docs.ray.io/en/latest/serve/tutorials/tensorflow.html#serve-tensorflow-tutorial), or [Keras](https://docs.ray.io/en/latest/serve/tutorials/tensorflow.html#serve-tensorflow-tutorial), to [scikit-Learn](https://docs.ray.io/en/latest/serve/tutorials/sklearn.html#serve-sklearn-tutorial) models, to arbitrary business logic.
* **Python First:** Configure your model serving with pure Python code. No YAML or JSON configurations required.

Since Serve is built on Ray, it also allows you to scale to many machines, in your datacenter or in cloud environments, and it allows you to leverage all of the other Ray frameworks.

## Ray Serve Architecture and components

<img src="images/architecture.png" height="40%" width="70%">

There are three kinds of actors that are created to make up a Serve instance:

**Controller**: A global actor unique to each Serve instance that manages the control plane. The Controller is responsible for creating, updating, and destroying other actors. Serve API calls like creating or getting a deployment make remote calls to the Controller.

**Router**: There is one router per node. Each router is a Uvicorn HTTP server that accepts incoming requests, forwards them to replicas, and responds once they are completed.

**Worker Replica**: Worker replicas actually execute the code in response to a request. For example, they may contain an instantiation of an ML model. Each replica processes individual requests from the routers (they may be batched by the replica using `@serve.batch`, see the [batching docs](https://docs.ray.io/en/latest/serve/ml-models.html#serve-batching)).

For more details, see this [key concepts](https://docs.ray.io/en/latest/serve/index.html) and [architecture](https://docs.ray.io/en/latest/serve/architecture.html) documentation.

### Lifetime of a Request

When an HTTP request is sent to the router, the following things happen:

 * The HTTP request is received and parsed.

 * The correct deployment associated with the HTTP url path is looked up. The request is placed on a queue.

 * For each request in a deployment queue, an available replica is looked up and the request is sent to it. If there are no available replicas (there are more than max_concurrent_queries requests outstanding), the request is left in the queue until an outstanding request is finished.

Each replica maintains a queue of requests and executes one at a time, possibly using asyncio to process them concurrently. If the handler (the function for the deployment or __call__) is async, the replica will not wait for the handler to run; otherwise, the replica will block until the handler returns.



## Two Simple Ray Serve Examples

We'll explore a more detailed example in the next lesson, where we actually serve ML models. Here we explore how deployments are simple with Ray Serve! We will first use a function that does "scoring," sufficient for _stateless_ scenarios, then use a class, which enables _stateful_ scenarios.

<img src="images/func_class_deployment.png" width="80%" height="50%">

But first, initialize Ray as before:

In [1]:
import ray
from ray import serve

import requests  # for making web requests

Now we initialize Ray Serve itself. Note that we did not have to start a Ray cluster explicity. If one is not running `serve.start()` will automatically launch a Ray cluster, otherwise it'll connect to an exisisting instance.

In [2]:
serve.start()

2022-04-19 11:27:00,593	INFO services.py:1460 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8266[39m[22m
[2m[36m(ServeController pid=3302)[0m 2022-04-19 11:27:04,425	INFO checkpoint_path.py:15 -- Using RayInternalKVStore for controller checkpoint and recovery.
[2m[36m(ServeController pid=3302)[0m 2022-04-19 11:27:04,530	INFO http_state.py:106 -- Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:hGqEft:SERVE_PROXY_ACTOR-node:127.0.0.1-0' on node 'node:127.0.0.1-0' listening on '127.0.0.1:8000'
[2m[36m(HTTPProxyActor pid=3304)[0m INFO:     Started server process [3304]
2022-04-19 11:27:05,829	INFO api.py:797 -- Started Serve instance in namespace 'serve'.


<ray.serve.api.Client at 0x7fd2307742e0>

Next, define our stateless function for processing requests.


Let's define a simple function that will be served by Ray. As with Ray Tasks, we can decorate this function with `@serve.deployment`, meaning this is going to be
deployed on Ray Serve as function to which we can send Starlette requests.

It takes in a `request`, extracts the request parameter with key "name,"
and returns an echoed string. 

Simple to illustrate that Ray Serve can also serve Python functions.

### Create a Python function deployment 

In [3]:
@serve.deployment
def hello(request):
    name = request.query_params["name"]
    return f"Hello {name}!"

Use the `<func_name>.deploy()` method to deploy it on Ray Serve.

### Deploy a Python function for serving

In [4]:
hello.deploy()

2022-04-19 11:27:09,057	INFO api.py:618 -- Updating deployment 'hello'. component=serve deployment=hello
[2m[36m(ServeController pid=3302)[0m 2022-04-19 11:27:09,125	INFO deployment_state.py:1210 -- Adding 1 replicas to deployment 'hello'. component=serve deployment=hello
2022-04-19 11:27:11,070	INFO api.py:633 -- Deployment 'hello' is ready at `http://127.0.0.1:8000/hello`. component=serve deployment=hello


### Send some requests to our Python function

In [5]:
for i in range(10):
    response = requests.get(f"http://127.0.0.1:8000/hello?name=request_{i}").text
    print(f'{i:2d}: {response}')

 0: Hello request_0!
 1: Hello request_1!
 2: Hello request_2!
 3: Hello request_3!
 4: Hello request_4!
 5: Hello request_5!
 6: Hello request_6!
 7: Hello request_7!
 8: Hello request_8!
 9: Hello request_9!


You should see `hello request_N` in the output. 

Now let's serve another "model" in the same Ray Serve instance:

In [6]:
from random import random
import starlette
from starlette.requests import Request

@serve.deployment
class SimpleModel:
    def __init__(self):
        self.weight = 0.5
        self.bias = 1
        self.prediction = 0.0

    def __call__(self, starlette_request):
        if isinstance(starlette_request, starlette.requests.Request):
            data = starlette_request.query_params['data']
        else:
            # Request came via a ServeHandle API method call.
            data = starlette_request
        self.prediction = float(data) * self.weight * random() + self.bias
        return {"prediction": self.prediction}

In [7]:
SimpleModel.deploy()

2022-04-19 11:27:37,760	INFO api.py:618 -- Updating deployment 'SimpleModel'. component=serve deployment=SimpleModel
[2m[36m(ServeController pid=3302)[0m 2022-04-19 11:27:37,825	INFO deployment_state.py:1210 -- Adding 1 replicas to deployment 'SimpleModel'. component=serve deployment=SimpleModel
2022-04-19 11:27:39,766	INFO api.py:633 -- Deployment 'SimpleModel' is ready at `http://127.0.0.1:8000/SimpleModel`. component=serve deployment=SimpleModel


### Send some requests to our Model

In [8]:
url = f"http://127.0.0.1:8000/SimpleModel"
for i in range(5):
    print(f"prediction  : {requests.get(url, params={'data': random()}).text}")

prediction  : {
  "prediction": 1.0000931509310669
}
prediction  : {
  "prediction": 1.1667181099089832
}
prediction  : {
  "prediction": 1.1590113094186356
}
prediction  : {
  "prediction": 1.1115950636861034
}
prediction  : {
  "prediction": 1.3231067880072571
}


### List Deployments

In [9]:
serve.list_deployments()

{'hello': Deployment(name=hello,version=None,route_prefix=/hello),
 'SimpleModel': Deployment(name=SimpleModel,version=None,route_prefix=/SimpleModel)}

In [10]:
serve.shutdown()

[2m[36m(ServeController pid=3302)[0m 2022-04-19 11:27:58,981	INFO deployment_state.py:1236 -- Removing 1 replicas from deployment 'hello'. component=serve deployment=hello
[2m[36m(ServeController pid=3302)[0m 2022-04-19 11:27:58,984	INFO deployment_state.py:1236 -- Removing 1 replicas from deployment 'SimpleModel'. component=serve deployment=SimpleModel


## Exercise (Optional) - Try adding more examples

Here are some things you can try:

1. Add a function, deploy, and send requests.
2. Add a class, deploy, and send requests