
<a id='end-to-end-tutorial'></a>

[Open in colab](https://colab.research.google.com/github/anyscale/academy/blob/main/ray-serve/e2e/tutorial.ipynb)

# Model Deployment with Ray Serve: Introduction

This tutorial has few goals in mind. 

1. Introduce you to the the landscape of ML serving tools, and where does Ray Serve fit.
2. Teach you how to deploy any Python based model with Ray Serve, and compose them for production ready pipelines.
3. Show you the concrete steps required to deploy models for interactive REST endpoint.

If you any question about this tutorial, or any follow up questions, please feel free to ask in the [Ray discussion forum](https://discuss.ray.io/c/ray-serve/6).

# 1 Landscape of ML Tools

Where does Ray Serve fits in the lanscape of machine learning deployment tools? 

Commonly there's a spectrum of tools:
- People typically starts with either framework specific servers (TFServing, TorchServer) or web servers (Flask, FastAPI) as an easy start to deploy a single model. 
- For more "production-readiness", various custom toolings are added (Docker, K8s, Golang based microservices). 
- But you can't just maintain a glued-together system. Folks starting looking for special purpose deployment tools (KubeFlow, KServe, Triton, etc) to manage and deploy many models in production. 

Over the spectrum, our team observe that you have to trade-off ease of development with production scalability. Ray Serve lets you easily develop locally and then transparently scale to production.

![Serve aims at both ease of development and ready for production.](serve-position.svg)


# 2 Model Serving with Ray Serve

Adapted from our [documentation](https://docs.ray.io/en/master/serve/index.html#rayserve)

By the end of this tutorial you will have learned how to deploy a machine
learning model locally via Ray Serve.

First, install Ray Serve and all of its dependencies by running the following
command in your terminal:

In [4]:
%pip install -qq "ray[serve]"

Note: you may need to restart the kernel to use updated packages.


For this tutorial, we’ll use [HuggingFace’s SummarizationPipeline](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.SummarizationPipeline)
to access a model that summarizes text.

In [6]:
%pip install -qq transformers

Note: you may need to restart the kernel to use updated packages.


## Example Model

Let’s first take a look at how the model works, without using Ray Serve.
This is the code for the model:

In [7]:
from transformers import pipeline


def summarize(text):
    # Load model
    summarizer = pipeline("summarization", model="t5-small")

    # Run inference
    summary_list = summarizer(text)

    # Post-process output to return only the summary text
    summary = summary_list[0]["summary_text"]

    return summary


article_text = (
    "HOUSTON -- Men have landed and walked on the moon. "
    "Two Americans, astronauts of Apollo 11, steered their fragile "
    "four-legged lunar module safely and smoothly to the historic landing "
    "yesterday at 4:17:40 P.M., Eastern daylight time. Neil A. Armstrong, the "
    "38-year-old commander, radioed to earth and the mission control room "
    "here: \"Houston, Tranquility Base here. The Eagle has landed.\" The "
    "first men to reach the moon -- Armstrong and his co-pilot, Col. Edwin E. "
    "Aldrin Jr. of the Air Force -- brought their ship to rest on a level, "
    "rock-strewn plain near the southwestern shore of the arid Sea of "
    "Tranquility. About six and a half hours later, Armstrong opened the "
    "landing craft\'s hatch, stepped slowly down the ladder and declared as "
    "he planted the first human footprint on the lunar crust: \"That\'s one "
    "small step for man, one giant leap for mankind.\" His first step on the "
    "moon came at 10:56:20 P.M., as a television camera outside the craft "
    "transmitted his every move to an awed and excited audience of hundreds "
    "of millions of people on earth.")

summary = summarize(article_text)
print(summary)


Some weights of T5ForConditionalGeneration were not initialized from the model checkpoint at t5-small and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


two astronauts steered their lunar module safely and smoothly to the historic landing . the first men to reach the moon brought their ship to rest on a level, rock-strewn plain near the arid sea of Tranquility . a television camera outside the craft transmitted his every move to an awed audience .


The Python file, called `local_model.py`, uses the `summarize` function to
generate summaries of text.

- The `summarizer` variable on line 7 inside `summarize` points to a
  function that uses the [t5-small](https://huggingface.co/t5-small)
  model to summarize text.  
- When `summarizer` is called on a Python String, it returns summarized text
  inside a dictionary formatted as `[{"summary_text": "...", ...}, ...]`.  
- `summarize` then extracts the summarized text on line 13 by indexing into
  the dictionary.  


Keep in mind that the `SummarizationPipeline` is an example machine learning
model for this tutorial. You can follow along using arbitrary models in any
framework that has a Python API. Check out our tutorials on sckit-learn,
PyTorch, and Tensorflow for more info and examples:

- [Keras and Tensorflow Tutorial](https://docs.ray.io/en/latest/serve/tutorials/tensorflow.html)
- [PyTorch Tutorial](https://docs.ray.io/en/latest/serve/tutorials/pytorch.html)
- [Scikit-Learn Tutorial](https://docs.ray.io/en/latest/serve/tutorials/sklearn.html)
- [Batching Tutorial](https://docs.ray.io/en/latest/serve/tutorials/batch.html)
- [RLlib Tutorial](https://docs.ray.io/en/latest/serve/tutorials/rllib.html)


## Converting to Ray Serve Deployment

This tutorial’s goal is to deploy this model using Ray Serve, so it can be
scaled up and queried over HTTP. We’ll start by converting the above Python
function into a Ray Serve deployment that can be launched locally on a laptop.

First, we need to import `ray` and `ray serve`, to use features in Ray Serve such as `deployments`, which provide HTTP access to our model.

In [8]:
import ray
from ray import serve

After these imports, we can include our model code from above.
We won’t call our `summarize` function just yet though!
We will soon add logic to handle HTTP requests, so the `summarize` function
can operate on article text sent via HTTP request.

In [10]:
from transformers import pipeline

def summarize(text):
    summarizer = pipeline("summarization", model="t5-small")
    summary_list = summarizer(text)
    summary = summary_list[0]["summary_text"]
    return summary


Ray Serve needs to run on top of a Ray cluster, so we create a local one.
See [Deploying Ray Serve](https://docs.ray.io/en/latest/serve/deployment.html) to learn more about starting a Ray Serve
instance and deploying to a Ray cluster.

>**Note**
>
>You can use Ray to perform data processing, hyperparameter-tuning, and distributed model training as well! Learn more at http://ray.io/

In [11]:
ray.init()

2022-01-28 14:39:28,285	INFO services.py:1412 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'node_ip_address': '127.0.0.1',
 'raylet_ip_address': '127.0.0.1',
 'redis_address': '127.0.0.1:6379',
 'object_store_address': '/tmp/ray/session_2022-01-28_14-39-24_412847_89101/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2022-01-28_14-39-24_412847_89101/sockets/raylet',
 'webui_url': '127.0.0.1:8265',
 'session_dir': '/tmp/ray/session_2022-01-28_14-39-24_412847_89101',
 'metrics_export_port': 64097,
 'gcs_address': '127.0.0.1:63389',
 'address': '127.0.0.1:6379',
 'node_id': '1a270d4109b030c049a391f40914cd03d9e33a0ff2b2abadc052ac5e'}

In [12]:
serve.start()

[2m[36m(ServeController pid=41296)[0m 2022-01-28 14:40:03,882	INFO checkpoint_path.py:16 -- Using RayInternalKVStore for controller checkpoint and recovery.
[2m[36m(ServeController pid=41296)[0m 2022-01-28 14:40:03,989	INFO http_state.py:101 -- Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:dPuHru:SERVE_PROXY_ACTOR-node:127.0.0.1-0' on node 'node:127.0.0.1-0' listening on '127.0.0.1:8000'
[2m[36m(HTTPProxyActor pid=41297)[0m INFO:     Started server process [41297]
2022-01-28 14:40:04,456	INFO api.py:518 -- Started Serve instance in namespace 'ca9d03ec-9ba7-4937-b1a8-9d2e3e95ad99'.


<ray.serve.api.Client at 0x7fdc21e5a7f0>

Now that we have defined our `summarize` function, connected to a Ray
Cluster, and started the Ray Serve runtime, we can define a function that
accepts HTTP requests and routes them to the `summarize` function. We
define a function called `router` that takes in a Starlette `request`
objec:

In [14]:
@serve.deployment
def router(request):
    txt = request.query_params["txt"]
    return summarize(txt)


- In line 1, we add the decorator `@serve.deployment`
  to the `router` function to turn the function into a Serve `Deployment`
  object.  
- In line 3, `router` uses the `"txt"` query parameter in the `request`
  to get the article text to summarize.  
- In line 4, it then passes this article text into the `summarize` function
  and returns the value.  


>**Note**
>
>Lines 3 and 4 define our HTTP request schema. The HTTP requests sent to this
endpoint must have a `"txt"` query parameter that contains a string.
In general, you can accept HTTP data using query parameters or the
request body. Additionally, you can add other Serve deployments with
different names to create more endpoints that can accept different schemas.
For more complex validation, you can also use FastAPI (see
[FastAPI HTTP Deployments](https://docs.ray.io/en/latest/serve/http-servehandle.html#fastapi-http-deployments) for more info).

This routing function’s name doesn’t have to be `router`. Serve uses your function name to namespace the route. For example the router will be accessible via `http://localhost:8000/router?txt=your-data`. You can change the function name or explicitly provider a `route_prefix` to Ray Serve via `@serve.deployment(route_prefix="/...")`. 

Since `@serve.deployment` makes `router` a `Deployment` object, it can be
deployed using `router.deploy()`:

In [15]:
router.deploy()

2022-01-28 14:44:59,844	INFO api.py:259 -- Updating deployment 'router'. component=serve deployment=router
[2m[36m(ServeController pid=41296)[0m 2022-01-28 14:44:59,866	INFO deployment_state.py:919 -- Adding 1 replicas to deployment 'router'. component=serve deployment=router
2022-01-28 14:45:05,895	INFO api.py:272 -- Deployment 'router' is ready at `http://127.0.0.1:8000/router`. component=serve deployment=router


Once we deploy `router`, we can query the model over HTTP.

## Testing the Ray Serve Deployment

We can now test our model over HTTP. The structure of our HTTP query is:

`http://127.0.0.1:8000/[Deployment Name]?[Parameter Name-1]=[Parameter Value-1]&[Parameter Name-2]=[Parameter Value-2]&...&[Parameter Name-n]=[Parameter Value-n]`

Since the cluster is deployed locally in this tutorial, the `127.0.0.1:8000`
refers to a localhost with port 8000. The `[Deployment Name]` refers to
either the name of the function that we called `.deploy()` on (in our case,
this is `router`), or the `name` keyword parameter’s value in
`@serve.deployment` (see the Tip under the `router` function definition
above for more info).

Each `[Parameter Name]` refers to a field’s name in the
request’s `query_params` dictionary for our deployed function. In our
example, the only parameter we need to pass in is `txt`. This parameter is
referenced in the `txt = request.query_params["txt"]` line in the `router`
function. Each [Parameter Name] object has a corresponding [Parameter Value]
object. The `txt`’s [Parameter Value] is a string containing the article
text to summarize. We can chain together any number of the name-value pairs
using the `&` symbol in the request URL.

Now that the `summarize` function is deployed on Ray Serve, we can make HTTP
requests to it. Here’s a client script that requests a summary from the same
article as the original Python script:

In [16]:

import requests

article_text = (
    "HOUSTON -- Men have landed and walked on the moon. "
    "Two Americans, astronauts of Apollo 11, steered their fragile "
    "four-legged lunar module safely and smoothly to the historic landing "
    "yesterday at 4:17:40 P.M., Eastern daylight time. Neil A. Armstrong, the "
    "38-year-old commander, radioed to earth and the mission control room "
    "here: \"Houston, Tranquility Base here. The Eagle has landed.\" The "
    "first men to reach the moon -- Armstrong and his co-pilot, Col. Edwin E. "
    "Aldrin Jr. of the Air Force -- brought their ship to rest on a level, "
    "rock-strewn plain near the southwestern shore of the arid Sea of "
    "Tranquility. About six and a half hours later, Armstrong opened the "
    "landing craft\'s hatch, stepped slowly down the ladder and declared as "
    "he planted the first human footprint on the lunar crust: \"That\'s one "
    "small step for man, one giant leap for mankind.\" His first step on the "
    "moon came at 10:56:20 P.M., as a television camera outside the craft "
    "transmitted his every move to an awed and excited audience of hundreds "
    "of millions of people on earth.")

response = requests.get("http://127.0.0.1:8000/router?txt=" +
                        article_text).text

response

[2m[36m(router pid=41299)[0m Some weights of T5ForConditionalGeneration were not initialized from the model checkpoint at t5-small and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight']
[2m[36m(router pid=41299)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2m[36m(router pid=41299)[0m   beam_id = beam_token_id // vocab_size


'two astronauts steered their lunar module safely and smoothly to the historic landing . the first men to reach the moon brought their ship to rest on a level, rock-strewn plain near the arid sea of Tranquility . a television camera outside the craft transmitted his every move to an awed audience .'

## Using Classes in the Ray Serve Deployment

Our application is still a bit inefficient though. In particular, the
`summarize` function loads the model on each call when it sets the
`summarizer` variable. However, the model never changes, so it would be more
efficient to define `summarizer` only once and keep its value in memory
instead of reloading it for each HTTP query.

We can achieve this by converting our `summarize` function into a class:

In [17]:
@serve.deployment
class Summarizer:
    def __init__(self):
        self.summarize = pipeline("summarization", model="t5-small")

    def __call__(self, request):
        txt = request.query_params["txt"]
        summary_list = self.summarize(txt)
        summary = summary_list[0]["summary_text"]
        return summary


Summarizer.deploy()


2022-01-28 14:53:05,646	INFO api.py:259 -- Updating deployment 'Summarizer'. component=serve deployment=Summarizer
[2m[36m(ServeController pid=41296)[0m 2022-01-28 14:53:05,703	INFO deployment_state.py:919 -- Adding 1 replicas to deployment 'Summarizer'. component=serve deployment=Summarizer
[2m[36m(Summarizer pid=41292)[0m Some weights of T5ForConditionalGeneration were not initialized from the model checkpoint at t5-small and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight']
[2m[36m(Summarizer pid=41292)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2022-01-28 14:53:13,052	INFO api.py:272 -- Deployment 'Summarizer' is ready at `http://127.0.0.1:8000/Summarizer`. component=serve deployment=Summarizer


In this configuration, we can query the `Summarizer` class directly.
The `Summarizer` is initialized once (after calling `Summarizer.deploy()`).
In line 13, its `__init__` function loads and stores the model in
`self.summarize`. HTTP queries for the `Summarizer` class are routed to its
`__call__` method by default, which takes in the Starlette `request`
object. The `Summarizer` class can then take the request’s `txt` data and
call the `self.summarize` function on it without loading the model on each
query.

>**Note**
>
>Instance variables can also store state. For example, to
>count the number of requests served, a `@serve.deployment` class can define
>a `self.counter` instance variable in its `__init__` function and set it
>to 0. When the class is queried, it can increment the `self.counter`
>variable inside of the function responding to the query. The `self.counter`
>will keep track of the number of requests served across requests.

HTTP queries for the Ray Serve class deployments follow a similar format to Ray
Serve function deployments. Here’s an example client script for the
`Summarizer` class. Notice that the only difference from the `router`’s
client script is that the URL uses the `Summarizer` path instead of
`router`.

In [18]:
response = requests.get("http://127.0.0.1:8000/Summarizer?txt=" +
                        article_text).text

response

[2m[36m(Summarizer pid=41292)[0m   beam_id = beam_token_id // vocab_size


'two astronauts steered their lunar module safely and smoothly to the historic landing . the first men to reach the moon brought their ship to rest on a level, rock-strewn plain near the arid sea of Tranquility . a television camera outside the craft transmitted his every move to an awed audience .'

## Adding Functionality with FastAPI

Now suppose we want to expose additional functionality in our model. In
particular, the `summarize` function also has `min_length` and
`max_length` parameters. Although we could expose these options as additional
parameters in URL, Ray Serve also allows us to add more route options to the
URL itself and handle each route separately.

Because this logic can get complex, Serve integrates with
[FastAPI](https://fastapi.tiangolo.com/). This allows us to define a Serve
deployment by adding the `@serve.ingress` decorator to a FastAPI app. For
more info about FastAPI with Serve, please see [FastAPI HTTP Deployments](https://docs.ray.io/en/latest/serve/http-servehandle.html#fastapi-http-deployments).

As an example of FastAPI, here’s a modified version of our `Summarizer` class
with route options to request a minimum or maximum length of ten words in the
summaries:

In [19]:
summarizer = pipeline("summarization", model="t5-small")

Some weights of T5ForConditionalGeneration were not initialized from the model checkpoint at t5-small and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [32]:
import ray
from ray import serve
from fastapi import FastAPI
from transformers import pipeline
from pydantic import BaseModel, PositiveInt, constr
from typing import Optional

app = FastAPI()


class Request(BaseModel):
    text: constr(min_length=1, strip_whitespace=True)
    min_length: Optional[PositiveInt] 
    max_length: Optional[PositiveInt]


@serve.deployment
@serve.ingress(app)
class Summarizer:
    def __init__(self):
        self.summarize = pipeline("summarization", model="t5-small")

    @app.post("/")
    def get_summary(self, payload: Request):
        summary_list = self.summarize(
            payload.text, 
            min_length=payload.min_length or 0, 
            max_length=payload.max_length or 256,
        )
        summary = summary_list[0]["summary_text"]
        return summary


Summarizer.deploy()


2022-01-28 16:19:43,564	INFO api.py:259 -- Updating deployment 'Summarizer'. component=serve deployment=Summarizer
[2m[36m(ServeController pid=41296)[0m 2022-01-28 16:19:43,618	INFO deployment_state.py:881 -- Stopping 1 replicas of deployment 'Summarizer' with outdated versions. component=serve deployment=Summarizer
[2m[36m(ServeController pid=41296)[0m 2022-01-28 16:19:45,789	INFO deployment_state.py:919 -- Adding 1 replicas to deployment 'Summarizer'. component=serve deployment=Summarizer
[2m[36m(Summarizer pid=41285)[0m Some weights of T5ForConditionalGeneration were not initialized from the model checkpoint at t5-small and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight']
[2m[36m(Summarizer pid=41285)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2022-01-28 16:19:52,480	INFO api.py:272 -- Deployment 'Summarizer' is ready at `http://127.0.0.1:8000/Sum

The class now exposes the same route but with the following feature:

- Takes a JSON formatted POST method. 
- Validate the input text is not empty.
- Accept optional min_length and max_length values and validate them.

Notice that `Summarizer`’s methods no longer take in a Starlette `request`
object. Instead, they take in the URL’s txt parameter directly with FastAPI’s
[body parameter](https://fastapi.tiangolo.com/tutorial/body/)
feature.

Since we still deploy our model locally, the full URL still uses the
localhost IP. This means each of our three routes comes after the
`http://127.0.0.1:8000` IP and port address. 

In [33]:
response = requests.post("http://127.0.0.1:8000/Summarizer",
                        json={"text": article_text, "max_length": 10}).text

response

[2m[36m(Summarizer pid=41285)[0m   beam_id = beam_token_id // vocab_size


'"two astronauts steered their lunar module"'

Congratulations! You just built and deployed a machine learning model on Ray
Serve! You should now have enough context to dive into the [Core API: Deployments](https://docs.ray.io/en/latest/serve/core-apis.html) to
get a deeper understanding of Ray Serve.

# 3 Deploying to Cloud Endpoint

Now the Ray Serve endpoint is ready for production! It has request validation, it efficiently use memory, and you can transparently scale it out. But you can't just keep this running as a notebook for live serving instances. 

Here are some practical recommednations to deployment:
- For quick demo, use [ngrok](https://ngrok.com/) to expose your local port to the internet. 
- For running on GCP for quick proof of concept, use Google Cloud Run to run a containerized application. 
- For production deployment, use [Kubernetes with Ray operator](https://docs.ray.io/en/latest/serve/deployment.html#deploying-on-kubernetes) or hosted provider like [Anyscale](anysclae.com).

### Cloud Run Example
You can take a look at the [example directory](https://github.com/anyscale/academy/tree/main/ray-serve/e2e/deploy-cloud-run) for a complete example containerizing the application and deploy to Google Coud Run. It contains
- `deploy.py` for the application.
- `test-query.py` for calling the application as a client.
- `requirements.txt` for Python dependencies.
- `Dockerfile` for describing how to containerize the application and make it runnable as docker containers.

### Anyscale Example
For deploying to Anyscale hosted service, all you need is to change set the environment variable `RAY_ADDRESS="anyscale://your-cluster-name` or use [production services construct](https://docs.anyscale.com/user-guide/run-and-monitor/production-services). You do need an invite at the moment though. 


# 4 Bonus: Real World Model Composition

.. TODO


![