# Ray Serve - Inference Graphs APIs

© 2019-2022, Anyscale. All Rights Reserved

📖 [Back to Table of Contents](./ex_00_tutorial_overview.ipynb)<br>
➡ [Next notebook]() <br>
⬅️ [Previous notebook](./ex_03_model_composition.ipynb) <br>


### Learning Objective:
In this introductory tutorial, you will:

* learn about motivations for inference graph APIs
* understand main concepts 
* utilize inference graph APIs to create a single deployment
* and score an inference graph end-to-end

### Motivation, Concepts and Features

Machine learning serving systems are getting more complex, consisting of ensemble of models as deployment patterns with only a single deployment making a final prediction or outcome. For example,
an image/video content may consist of a model doing classification and then followed by tagging, or fraud detection pipeline may have multiple policies, multi-stage ranking and recommendation, etc.

<img src="images/deployment_patterns.png" width="50%" height="25%">

It makes sense, then, to create a multi-stage model into an inference graph, where each node within the graph could be a single model deployment performing a particular a prediction given an input from
upstream node and generating result for consumption downstream node.

One of the attractive feature of inference graph is the ability to easily build, test, and deploy an inference graph, in its totality, on your local machine and then deploy it onto a staging or production server—all using Python APIs. Second, while in production, a deployment graph can be modified or scaled dynamically, without touching the underlying Python code.

Each node within an inference graph serves a purpose, as a functional unit of inference, and you can programmatically stitch together myriad types of functional nodes and building blocks of your DAG. Let's examine at what types comprise those building blocks as DAG nodes and some terms:

**Deployment** : This is an end-to-end single, scalable, and upgradeable group of actors managed by Ray Serve.

**DeploymentNode**: Each node within this inference graph a unit of execution in the graph, created by calling `.bind()` on a Ray Serve decorated class or function. In short, any function or class decorated with a `@serve.deployment` is a candidate a node within an inference graph.

**InputNode**: As the name suggests, it's a special node that represents the input passed to a graph at runtime or inference. There can only be a single `InputNode` per a deployment graph.

**Deployment Graph**: This is a composite of deployment nodes bound together to define an entire inference graph, which can be deployed behind an HTTP endpoint.

An example of an entire inference graph, with all its nodes:

<img src="images/deployment_graph_example.png" width="50%" height="25%">


### Five Simple steps to build an inference graph.

Let's build the above inference graph using [Serve Graph Inference APIs](https://docs.ray.io/en/latest/serve/package-ref.html). We will build the following nodes in our directed acyclic graph (DAG):
 * InputNode
 * Prepocessor
 * AvgProcessor
 * Combiner
 * Model
 * Aggregator
 
 

### Step 1: Build processor nodes. 

Note that any function or class decorated with `@serve.deployment` can be converted into a node in the DAG. So let's define those functions. We are
using `aysnc` to allow many requests or invocations of each respective deployment.

As we noted above, you can convert a function or class deployment into a node by simply using the `.bind(..)` suffix; this will result an instance of a `DeploymentNode`, which is
of type `DAGNode`, our basic building block.



In [2]:
import time
import asyncio
import requests
import starlette

import ray
from ray import serve
from ray.serve.dag import InputNode
from ray.serve.drivers import DAGDriver
from ray.serve.http_adapters import json_request

In [3]:
@serve.deployment
async def preprocessor(input_data: str):
    """Simple feature processing that converts str to int"""
    await asyncio.sleep(0.1) # Manual delay for blocking computation
    return int(input_data)

@serve.deployment
async def avg_preprocessor(input_data):
    """Simple feature processing that returns average of input list as float."""
    await asyncio.sleep(0.15) # Manual delay for blocking computation
    return sum(input_data) / len(input_data)

### Step 2: Model nodes

After we got the preprocessed inputs, we’re ready to combine them to construct the request object we want to sent to a model instantiated with different initial weights. This means we need:

1. An model instance in the graph instantiated with initial weights
2. A Combiner that references Model nodes for its runtime implementation by passing them as init args in `.bind()`
3. The ability of Combiner to receive and merge preprocessed inputs for the same user input, even they might be produced async and received out of order.

In [4]:
@serve.deployment
class Model:
    def __init__(self, weight: int):
        self.weight = weight

    async def forward(self, input: int):
        await asyncio.sleep(0.3) # Manual delay for blocking computation
        return f"({self.weight} * {input})"


### Step 3: Build over Combiner aggregation based on user input and operation

Now we have the backbone of our DAG setup: splitting and preprocessing user inputs, aggregate into new request data and send to multiple models downstream. Let’s add a bit more dynamic flavor in it to demostrate deployment graph is fully python programmable by introducing control flow based on user input.

In [5]:
@serve.deployment
class Combiner:
    def __init__(self, m: Model):
        self.m = m

    async def run(self, req_part_1, req_part_2, operation):
        # Merge model input from two preprocessors
        req = f"({req_part_1} + {req_part_2})"

        # Submit to model for inference
        r1_ref = self.m.forward.remote(req)

        # Async gathering of model forward results for same request data
        rst = await asyncio.gather(r1_ref)

        # Control flow that determines runtime behavior based on user input
        if operation == "sum":
            return f"sum({rst})"
        else:
            return f"max({rst})"

### Step 4: Build our InputNode and driver deployment to handle HTTP ingress¶

Now we’ve built the entire serve DAG with the topology, args binding and user input. It’s time to add the last piece for Serve – a Driver deployment to expose and configure HTTP. We can configure it to start with two replicas in case the ingress of deployment becomes bottleneck of the DAG.

Serve provides a default `DAGDriver` implementation that accepts HTTP request and orchestrate the deployment graph execution. You can import it from from `ray.serve.drivers import DAGDriver.`

You can configure how does the `DAGDriver` convert `HTTP` request types. By default, we directly send in a `starlette.requests.Request` object to represent the whole request. You can also specifies [built-in adapters](https://docs.ray.io/en/latest/serve/http-servehandle.html#serve-http-adapters). In this example, we will use a `json_request` adapter that parses `HTTP` body with `JSON` parser.

In [6]:
# DAG building done within the context manager for the InputNode
with InputNode() as dag_input:
    # Partial access of user input by index
    preprocessed_1 = preprocessor.bind(dag_input[0])
    preprocessed_2 = avg_preprocessor.bind(dag_input[1])
    
    # Create a model Node 
    m1 = Model.bind(1)
    
    # Use other DeploymentNode in bind()
    combiner = Combiner.bind(m1)
    
    # Use output of function DeploymentNode in bind()
    dag = combiner.run.bind(
        preprocessed_1, preprocessed_2, dag_input[2]
    )
    
    # Each serve dag has a driver deployment as ingress that can be user provided.
    serve_dag = DAGDriver.options(route_prefix="/my-dag", num_replicas=2).bind(
        dag, http_adapter=json_request
    )

### Step 5: Test the full DAG in both python and http


In [7]:
dag_handle = serve.run(serve_dag)

2022-08-09 08:40:30,835	INFO worker.py:1481 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m.
[2m[36m(ServeController pid=99732)[0m INFO 2022-08-09 08:40:34,658 controller 99732 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-7611d638b249e3daec94c13412c09f68c30b0b0ac5ce9dc2c8bea79e' on node '7611d638b249e3daec94c13412c09f68c30b0b0ac5ce9dc2c8bea79e' listening on '127.0.0.1:8000'
[2m[36m(HTTPProxyActor pid=99749)[0m INFO:     Started server process [99749]
[2m[36m(ServeController pid=99732)[0m INFO 2022-08-09 08:40:36,099 controller 99732 deployment_state.py:1232 - Adding 1 replicas to deployment 'preprocessor'.
[2m[36m(ServeController pid=99732)[0m INFO 2022-08-09 08:40:36,116 controller 99732 deployment_state.py:1232 - Adding 1 replicas to deployment 'avg_preprocessor'.
[2m[36m(ServeController pid=99732)[0m INFO 2022-08-09 08:40:36,127 controller 99732 deployment_state.py:1232 - Ad

### Use Python API

In [8]:
# Python handle
cur = time.time()
print(ray.get(dag_handle.predict.remote(["5", [1, 2], "sum"])))
print(f"Time spent: {round(time.time() - cur, 2)} secs.")
# Http endpoint
cur = time.time()

sum([ObjectRef(8a908affb261979c08cc7f48cf6fc65e41cbd9df0100000002000000)])
Time spent: 0.23 secs.


### Use HTTP endpoint

In [9]:
# Http endpoint
cur = time.time()
print(requests.post("http://127.0.0.1:8000/my-dag", json=["5", [1, 2], "sum"]).text)
print(f"Time spent: {round(time.time() - cur, 2)} secs.")

"sum([ObjectRef(92344ed58307ae8a08cc7f48cf6fc65e41cbd9df0100000002000000)])"
Time spent: 0.25 secs.


[2m[36m(HTTPProxyActor pid=99749)[0m INFO 2022-08-09 08:40:38,677 http_proxy 127.0.0.1 http_proxy.py:315 - POST /my-dag 307 3.6ms
[2m[36m(ServeReplica:Model pid=99759)[0m INFO 2022-08-09 08:40:38,671 Model Model#VGiIin replica.py:482 - HANDLE forward OK 300.5ms
[2m[36m(ServeReplica:DAGDriver pid=99762)[0m INFO 2022-08-09 08:40:38,676 DAGDriver DAGDriver#xkuLPD replica.py:482 - HANDLE __call__ OK 0.2ms


In [10]:
serve.shutdown()

[2m[36m(ServeReplica:Model pid=99759)[0m INFO 2022-08-09 08:40:39,217 Model Model#VGiIin replica.py:482 - HANDLE forward OK 300.5ms
[2m[36m(ServeController pid=99732)[0m INFO 2022-08-09 08:40:39,287 controller 99732 deployment_state.py:1257 - Removing 1 replicas from deployment 'preprocessor'.
[2m[36m(ServeController pid=99732)[0m INFO 2022-08-09 08:40:39,290 controller 99732 deployment_state.py:1257 - Removing 1 replicas from deployment 'avg_preprocessor'.
[2m[36m(ServeController pid=99732)[0m INFO 2022-08-09 08:40:39,293 controller 99732 deployment_state.py:1257 - Removing 1 replicas from deployment 'Model'.
[2m[36m(ServeController pid=99732)[0m INFO 2022-08-09 08:40:39,295 controller 99732 deployment_state.py:1257 - Removing 1 replicas from deployment 'Combiner'.
[2m[36m(ServeController pid=99732)[0m INFO 2022-08-09 08:40:39,301 controller 99732 deployment_state.py:1257 - Removing 2 replicas from deployment 'DAGDriver'.


### Exercise

Try at least any three exercises here:

1. Try some input with `max` operation instead of `sum`
2. Try with larger list in input numbers
3. Create a notebook and walk through an [extended example](https://docs.ray.io/en/latest/serve/deployment-graph.html#full-end-to-end-example-code) of above example
4. [Optional] Add a `mean` operator as preprocessor
5. [Optional] Run [Simple Inference Graph](extras/simple_inference_graph.ipynb) in the extras directory
6. [Optional] Run [Simple Model Composition Graph](extras/simple_model_composition_graph.ipynb) in the extras directory

### Homework Challenge
 1. Read the Ray Serve Deployment [blog](https://www.anyscale.com/blog/multi-model-composition-with-ray-serve-deployment-graphs)
 2. Convert our [Model Compostion example](ex_03_model_composition.ipynb) into Inference graph.

### References

1. [Multi-model composition with Ray Serve deployment graphs](https://www.anyscale.com/blog/multi-model-composition-with-ray-serve-deployment-graphs)
2. [Ray Serve 101](https://www.anyscale.com/events/2022/05/05/ray-serve-101-deploying-your-first-ml-model-locally-and-as-a-managed-service): Deploying your first ML model locally and as a managed service
3. [Productionizing ML at scale with Ray Serve](https://www.anyscale.com/events/2022/04/14/productionizing-ml-at-scale-with-ray-serve) 

📖 [Back to Table of Contents](./ex_00_tutorial_overview.ipynb)<br>
➡ [Next notebook]() <br>
⬅️ [Previous notebook](./ex_03_model_composition.ipynb) <br>