# Development to Production Workflow in Ray Serve

© 2019-2022, Anyscale. All Rights Reserved


📖 [Back to Table of Contents](./ex_00_tutorial_overview.ipynb)<br>
⬅️ [Previous notebook](./ex_03_deployment_graph.ipynb) <br>

### Learning Objective:
In this tutorial, you will learn how to:

 * operationalize your model into production
 * run Serve interactively from a terminal
 * use the `serve` command-line interface (CLI) to update deployment configs 
 * use the `serve` CLI to query and monitor model deployment status

Now that you have learned about the fundamentals of Ray Serve and its deployment graph feature, let's take a look at how can you operate it in production. So far, the only API you've learned is calling `serve.run(node)` in Python to interactively deploy a deployment graph to a local Ray cluster.

The main environment you will be working with will be the terminal instead of a notebook. The terminal will be the primary environment to perform operations in the production workflow.

You can use the terminal in few ways:
- Click the "+" button in the file browser and select "Terminal" in the launcher tab. This will create a terminal within the [Jupyter environment](https://jupyterlab.readthedocs.io/en/stable/user/terminal.html). 
- Switch back to your Anyscale cluster page and use the embedded terminal.
- For a one-off command, you can prefix it with `!` and run it in a Python code cell in this notebook. This is not suitable for long-running commands that require `Ctrl+C` to exit, because it will block the execution of subsequent cells.

Once you have a environment, try running `serve --help` to list all available commands.

In [4]:
!serve --help

Usage: serve [OPTIONS] COMMAND [ARGS]...

  CLI for managing Serve instances on a Ray cluster.

Options:
  --help  Show this message and exit.

Commands:
  build     Writes a Serve Deployment Graph's config file.
  config    Get the current config of the running Serve app.
  deploy    Deploy a Serve app from a YAML config file.
  run       Run a Serve app.
  shutdown  Deletes the Serve app.
  start     Start a detached Serve instance on the Ray cluster.
  status    Get the current status of the running Serve app.
[0m

Let's go over a typical development to production workflow. In particular, we will show:
- How to use `serve run` to interactively develop and test your program.
- How to use `serve build` to generate a YAML configuration file for production.
- How to use `serve deploy` to deploy your application to remote Ray cluster, and use the same command to perform idempotent updates.
- How to use `serve status` and `serve config` to inspect the current state and target state of your Serve application, respectively. 

These cover a typical production workflow from running to monitoring your application. In more detail:

|Command|Effect|Stage|Called When|
|---|---|---|---|
|`run`|Interactively run a Serve app|Development|Each time you update the code|
| `build`|Generate YAML configuration file|Development|One or few times before moving to production|
| `deploy`|Deploy the Serve app to remote cluster|Production|One for every production update|
| `status`|Inspect current state of the running Serve app|Both|Periodically or on demand|
| `config`|Retrieve the desired config for the running Serve app|Both|Periodically or on demand|

## `serve run`: Interactively run a Serve app

The first command is a equivalent CLI command that just calls the Python API `serve.run(node)` under the hood. This command is very useful when you develop as you can start the server in a terminal and query it side-by-side. 

In [11]:
!serve run --help

Usage: serve run [OPTIONS] CONFIG_OR_IMPORT_PATH

  Runs the Serve app from the specified import path or YAML config. Any import
  path must lead to a FunctionNode or ClassNode object. By default, this will
  block and periodically log status. If you Ctrl-C the command, it will tear
  down the app.

Options:
  --runtime-env TEXT           Path to a local YAML file containing a
                               runtime_env definition. This will be passed to
                               ray.init() as the default for deployments.
  --runtime-env-json TEXT      JSON-serialized runtime_env dictionary. This
                               will be passed to ray.init() as the default for
                               deployments.
  --working-dir TEXT           Directory containing files that your job will
                               run in. Can be a local directory or a remote
                               URI to a .zip file (S3, GS, HTTP). This
                               overrides the 

Let's write our "Hello, world!" application.

In [15]:
%%writefile prod_examples/hello_world.py
from ray import serve
import starlette

@serve.deployment
def hello_world(request: starlette.requests.Request):
    return f"Hello, {request.query_params.get('name', 'Ray')}!"

app = hello_world.bind()

Overwriting prod_examples/hello_world.py


Next, go to a terminal and run `serve run prod_examples.hello_world:app`. You should see output similar to the following:

```console
2022-08-15 23:36:18,605	INFO scripts.py:294 -- Deploying from import path: "prod_examples.hello_world:app".
2022-08-15 23:36:20,327	INFO worker.py:1481 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265.
(ServeController pid=17067) INFO 2022-08-15 23:36:21,258 controller 17067 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-ad023f4d960fb94ad83734830acf137ece63aaa0d889d5d3f490160d' on node 'ad023f4d960fb94ad83734830acf137ece63aaa0d889d5d3f490160d' listening on '127.0.0.1:8000'
(ServeController pid=17067) INFO 2022-08-15 23:36:21,672 controller 17067 deployment_state.py:1232 - Adding 1 replicas to deployment 'hello_world'.
(HTTPProxyActor pid=17069) INFO:     Started server process [17069]
2022-08-15 23:36:22,691	SUCC scripts.py:307 -- Deployed successfully.
```

If we call the application using `curl http://localhost:8000\?name\=Summit`, logs will start to show up:
```console
(HTTPProxyActor pid=17323) INFO 2022-08-15 23:37:39,506 http_proxy 127.0.0.1 http_proxy.py:315 - POST / 200 4.2ms
(ServeReplica:hello_world pid=17327) INFO 2022-08-15 23:37:39,505 hello_world hello_world#BncTya replica.py:482 - HANDLE __call__ OK 0.3ms
(HTTPProxyActor pid=17323) INFO 2022-08-15 23:37:42,670 http_proxy 127.0.0.1 http_proxy.py:315 - POST / 200 2.0ms
(ServeReplica:hello_world pid=17327) INFO 2022-08-15 23:37:42,669 hello_world hello_world#BncTya replica.py:482 - HANDLE __call__ OK 0.1ms
```

`serve run` is a convenient, oftentimes preferred, tool for interactively developing your application.

## `serve build`: generate a config file for a Serve application.

For the rest of this tutorial, we will use the following Serve application as a working example.

The application takes in requests containing a list of two values, a fruit name and an amount, and returns the total price for the batch of fruits.

The file is saved in `prod_examples/fruit.py` and rendered here.

In [22]:
%pfile prod_examples/fruit.py

Object `prod_examples/fruit.py` not found.


[0;32mimport[0m [0mray[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0mray[0m [0;32mimport[0m [0mserve[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0mray[0m[0;34m.[0m[0mserve[0m[0;34m.[0m[0mdrivers[0m [0;32mimport[0m [0mDAGDriver[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0mray[0m[0;34m.[0m[0mserve[0m[0;34m.[0m[0mdeployment_graph[0m [0;32mimport[0m [0mInputNode[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0mray[0m[0;34m.[0m[0mserve[0m[0;34m.[0m[0mhandle[0m [0;32mimport[0m [0mRayServeDeploymentHandle[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0mray[0m[0;34m.[0m[0mserve[0m[0;34m.[0m[0mhttp_adapters[0m [0;32mimport[0m [0mjson_request[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;31m# These imports are used only for type hints:[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0mtyping[0m [0;32mimport[0m [0mDict[0m[0;34m,[0m [0mList[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0mstarlette[0m[0;34m.[0m[0mrequests[0m [0;32mim

Serve supports configuring your application through a YAML file. It is important to know that *your code* is still the ground truth and describes the end to end dataflow. However, the YAML configuration allows you to apply updates to the desired state of the application. 

Let's take a look at the YAML file built by the `serve build` command.

In [25]:
!serve build --help

Usage: serve build [OPTIONS] IMPORT_PATH

  Imports the ClassNode or FunctionNode at IMPORT_PATH and generates a
  structured config for it that can be used by `serve deploy` or the REST API.

Options:
  -d, --app-dir TEXT      Local directory to look for the IMPORT_PATH (will be
                          inserted into PYTHONPATH). Defaults to '.', meaning
                          that an object in ./main.py can be imported as
                          'main.object'. Not relevant if you're importing from
                          an installed module.
  -o, --output-path TEXT  Local path where the output config will be written
                          in YAML format. If not provided, the config will be
                          printed to STDOUT.
  --help                  Show this message and exit.
[0m

In [24]:
!serve build prod_examples.fruit:deployment_graph | head -n 20

# This file was generated using the `serve build` command on Ray v2.0.0rc1.

import_path: prod_examples.fruit:deployment_graph

runtime_env: {}

deployments:

- name: MangoStand
  num_replicas: 1
  route_prefix: null
  max_concurrent_queries: 100
  user_config:
    price: 3
  autoscaling_config: null
  graceful_shutdown_wait_loop_s: 2.0
  graceful_shutdown_timeout_s: 20.0
  health_check_period_s: 10.0
  health_check_timeout_s: 30.0
  ray_actor_options: null


Taking a closer look at the YAML file, you might have noticed a simple structure:

```yaml
import_path: ...
runtime_env: ...
deployments:
    - name: ...
      num_replicas: ...
      ...
    - name:
      ...
    ...
```

The file contains the following fields:

- An `import_path`, which is the path to your top-level bound Serve deployment (or the same path passed to `serve run`). The most minimal config file consists of only an `import_path`.
- A `runtime_env` that defines the environment that the application will run in. Note that the file specified by `import_path` must be available _within_ the `working_dir` directory of the `runtime_env` if it's specified.
- A list of `deployments`. This is optional and allows you to override the `@serve.deployment` settings specified in the deployment graph code. Each entry in this list must include the deployment `name`, which must match the one in the code. If this section is omitted, Serve launches all deployments in the graph with the settings specified in the code.


The file uses the same `fruit:deployment_graph` import path that was used with `serve run` and it has five entries in the `deployments` list—one for each deployment. All the entries contain a `name` setting and some other configuration options such as `num_replicas` or `user_config`.

Each individual entry in the `deployments` list is optional. In the example config file above, we could omit the `PearStand`, including its `name` and `user_config`, and the file would still be valid. When we deploy the file, the `PearStand` deployment will still be deployed, using the configurations set in the `@serve.deployment` decorator from the deployment graph's code.

Settings from `@serve.deployment` can be overridden with this Serve config YAML file. The order of priority is (from highest to lowest):

1. Config File
2. Deployment graph code (either through the `@serve.deployment` decorator or a `.set_options()` call)
3. Serve defaults

For example, if a deployment's `num_replicas` is specified in the config file and their graph code, Serve will use the config file's value. If it's only specified in the code, Serve will use the code value. If the user doesn't specify it anywhere, Serve will use a default (which is `num_replicas=1`).

## `serve deploy`: Deploy the Serve app to a Ray cluster

With the configuration file at hand, we can deploy the application to the Ray cluster. This command allows you to create, update, or configure the Serve application on any cluster.

In [28]:
!serve deploy --help

Usage: serve deploy [OPTIONS] CONFIG_FILE_NAME

  Deploys deployment(s) from a YAML config file.

  This call is async; a successful response only indicates that the request
  was sent to the Ray cluster successfully. It does not mean the the
  deployments have been deployed/updated.

  Use `serve config` to fetch the current config and `serve status` to check
  the status of the deployments after deploying.

Options:
  -a, --address TEXT  Address to use to query the Ray dashboard agent
                      (defaults to http://localhost:52365). Can also be
                      specified using the RAY_AGENT_ADDRESS environment
                      variable.
  --help              Show this message and exit.
[0m

Let's try to deploy the application specified in the configuration YAML generated by `serve build`.

Note that in the YAML, the import path (e.g., `fruit:deployment_graph`) must be importable by Serve at runtime. When running locally, this might be in your current working directory.
However, when running on a cluster, you also need to make sure the path is importable.

You can achieve this either by building the code into the cluster's container image (see [Cluster Configuration](https://docs.ray.io/en/master/cluster/kubernetes/user-guides/config.html#kuberay-config) for more details) or by using a `runtime_env` with a [remote URI](https://docs.ray.io/en/master/ray-core/handling-dependencies.html#remote-uris) that hosts the code in remote storage.

As an example, we have [pushed a copy of the FruitStand deployment graph to GitHub](https://github.com/ray-project/test_dag/blob/40d61c141b9c37853a7014b8659fc7f23c1d04f6/fruit.py). We will be using that as the working directory to host the application.

In [30]:
!head prod_examples/config.yaml

# This file was generated using the `serve build` command on Ray v2.0.0rc1.

import_path: fruit:deployment_graph

runtime_env:
    working_dir: "https://github.com/ray-project/serve_config_examples/archive/HEAD.zip"

deployments:

- name: MangoStand


Let's call the command to deploy the application!

In [32]:
!serve deploy prod_examples/config.yaml

2022-08-16 00:34:33,990	SUCC scripts.py:180 -- [32m
Sent deploy request successfully!
 * Use `serve status` to check deployments' statuses.
 * Use `serve config` to see the running app's config.
[39m
[0m

In [35]:
!curl http://localhost:8000/ -d '["PEAR", 2]' 

8

It works! 

However, let's maybe adjust the price of pear. This workflow is similar to how you would update configurations during production such as reloading models, adjusting weighted percentages, or changing hyperparameters.

You can update your Serve applications once they’re in production by updating the settings in your config file and redeploying it using the `serve deploy` command. In the redeployed config file, you can add new deployment settings or remove old deployment settings. This is because serve deploy is *idempotent*, meaning your Serve application’s config always matches (or honors) the latest config you deployed successfully—regardless of what config files you deployed before that.

Lightweight config updates modify running deployment replicas without tearing them down and restarting them, so there’s less downtime as the deployments update. For each deployment, modifying `num_replicas`, `autoscaling_config`, and/or `user_config` is considered a lightweight config update, and won’t tear down the replicas for that deployment.

**Let's change the price of pear from 4 to 2 in `prod_examples/config.yaml`**.

In [42]:
# Check the price has been updated.
import yaml

with open("prod_examples/config.yaml") as f:
    assert (
        yaml.safe_load(f)["deployments"][2]["user_config"]["price"] == 2
    ), "Make sure to update the user_config field of PearStand deployment"

Now let's call `serve deploy` again to update the configuration live in production. You can also observe the state via the Ray Dashboard.

In [43]:
!serve deploy prod_examples/config.yaml

2022-08-16 00:42:58,707	SUCC scripts.py:180 -- [32m
Sent deploy request successfully!
 * Use `serve status` to check deployments' statuses.
 * Use `serve config` to see the running app's config.
[39m
[0m

In [44]:
!curl http://localhost:8000/ -d '["PEAR", 2]' 

4

Our new configuration is in effect! 

## `serve status` and `serve config`: inspect current state and target state.

The Serve CLI also offers two commands to help you inspect your Serve application in production: `serve config` and `serve status`.

`serve config` gets the latest config file the Ray cluster received. This config file represents the Serve application's goal (or golden) state. The Ray cluster will constantly attempt to reach and maintain this state by deploying deployments, recovering failed replicas, and more.

Using the `config.yaml` example:

```console
$ serve deploy fruit_config.yaml
...

$ serve config
import_path: fruit:deployment_graph

runtime_env: {}

deployments:

- name: MangoStand
  num_replicas: 2
  route_prefix: null
...
```

`serve status` gets your Serve application's current status. It's divided into two parts: the `app_status` and the `deployment_statuses`.

The `app_status` contains three fields:
* `status`: a Serve application has four possible statuses:
    * `"NOT_STARTED"`: no application has been deployed on this cluster.
    * `"DEPLOYING"`: the application is currently carrying out a `serve deploy` request. It is deploying new deployments or updating existing ones.
    * `"RUNNING"`: the application is at steady-state. It has finished executing any previous `serve deploy` requests, and it is attempting to maintain the goal state set by the latest `serve deploy` request.
    * `"DEPLOY_FAILED"`: the latest `serve deploy` request has failed.
* `message`: provides context on the current status.
* `deployment_timestamp`: a Unix timestamp of when Serve received the last `serve deploy` request. This is calculated using the `ServeController`'s local clock.

The `deployment_statuses` contains a list of dictionaries representing each deployment's status. Each dictionary has three fields:
* `name`: the deployment's name.
* `status`: a Serve deployment has three possible statuses:
    * `"UPDATING"`: the deployment is updating to meet the goal state set by a previous `deploy` request.
    * `"HEALTHY"`: the deployment is at the latest requests goal state.
    * `"UNHEALTHY"`: the deployment has either failed to update, or it has updated and has become unhealthy afterwards. This may be due to an error in the deployment's constructor, a crashed replica, or a general system or machine error.
* `message`: provides context on the current status.

You can use the `serve status` command to inspect your deployments after they are deployed and throughout their lifetime.

In [46]:
!serve status --help

Usage: serve status [OPTIONS]

  Prints status information about all deployments in the Serve app.

  Deployments may be:

  - HEALTHY: all replicas are acting normally and passing their health checks.

  - UNHEALTHY: at least one replica is not acting normally and may not be
  passing its health check.

  - UPDATING: the deployment is updating.

Options:
  -a, --address TEXT  Address to use to query the Ray dashboard agent
                      (defaults to http://localhost:52365). Can also be
                      specified using the RAY_AGENT_ADDRESS environment
                      variable.
  --help              Show this message and exit.
[0m

In [45]:
!serve status

app_status:
  status: RUNNING
  message: ''
  deployment_timestamp: 1660635778.705992
deployment_statuses:
- name: MangoStand
  status: HEALTHY
  message: ''
- name: OrangeStand
  status: HEALTHY
  message: ''
- name: PearStand
  status: HEALTHY
  message: ''
- name: FruitMarket
  status: HEALTHY
  message: ''
- name: DAGDriver
  status: HEALTHY
  message: ''

[0m

You can try to introduce an error in user configuration or other places to see how the error message propagates here for you to observe.

`serve status` shows the *current* state of the application. You can use `serve config` to observe the desired configuration for the application, as undestood by Serve.

In [47]:
!serve config --help

Usage: serve config [OPTIONS]

  Get the current config of the running Serve app.

Options:
  -a, --address TEXT  Address to use to query the Ray dashboard agent
                      (defaults to http://localhost:52365). Can also be
                      specified using the RAY_AGENT_ADDRESS environment
                      variable.
  --help              Show this message and exit.
[0m

In [48]:
!serve config

import_path: fruit:deployment_graph
runtime_env:
  working_dir: https://github.com/ray-project/serve_config_examples/archive/HEAD.zip
deployments:
- name: MangoStand
  num_replicas: 1
  user_config:
    price: 3
- name: OrangeStand
  num_replicas: 1
  user_config:
    price: 2
- name: PearStand
  num_replicas: 1
  user_config:
    price: 2
- name: FruitMarket
  num_replicas: 2
- name: DAGDriver
  num_replicas: 1
  route_prefix: /

[0m

## Summary of CLI Commands

This tutorial demonstrated how to use Serve's CLI to take your application from development to production. In short:
* Use `serve run` to manually test and improve your deployment graph locally.
* Use `serve build` to create a Serve config file for your deployment graph.
    * Put your deployment graph's code in a remote repository and manually configure the `working_dir` or `py_modules` fields in your Serve config file's `runtime_env` to point to that repository.
* Use `serve deploy` to deploy your graph and its deployments to your Ray cluster. After the deployment is finished, you can start serving traffic from your cluster.
* Use `serve status` to track your Serve application's health and deployment progress.
* Use `serve config` to check the latest config that your Serve application received. This is its goal state.
* Make lightweight configuration updates (e.g. `num_replicas` or `user_config` changes) by modifying your Serve config file and redeploying it with `serve deploy`.

## Bonus: Monitoring with Prometheus Metrics and Grafana

You can leverage built-in Ray Serve metrics to get a closer look at your application's performance. This is particulary useful for production to help you observe, inspect, and debug your application.

Ray Serve exposes important system metrics like the number of successful and
failed requests through the [Ray metrics monitoring infrastructure](https://docs.ray.io/en/master/ray-observability/ray-metrics.html#ray-metrics). By default, the metrics are exposed in Prometheus format on each node.


Different metrics are collected when Deployments are called via Python `ServeHandle` and when they are called via HTTP.

See the list of metrics below marked for each.


The following metrics are exposed by Ray Serve:


   - ``serve_deployment_request_counter`` [**]
     - The number of queries that have been processed in this replica.
   - ``serve_deployment_error_counter`` [**]
     - The number of exceptions that have occurred in the deployment.
   - ``serve_deployment_replica_starts`` [**]
     - The number of times this replica has been restarted due to failure.
   - ``serve_deployment_processing_latency_ms`` [**]
     - The latency for queries to be processed.
   - ``serve_replica_processing_queries`` [**]
     - The current number of queries being processed.
   - ``serve_num_http_requests`` [*]
     - The number of HTTP requests processed.
   - ``serve_num_http_error_requests`` [*]
     - The number of non-200 HTTP responses.
   - ``serve_num_router_requests`` [*]
     - The number of requests processed by the router.
   - ``serve_handle_request_counter`` [**]
     - The number of requests processed by this ServeHandle.
   - ``serve_deployment_queued_queries`` [*]
     - The number of queries for this deployment waiting to be assigned to a replica.
   - ``serve_num_deployment_http_error_requests`` [*]
     - The number of non-200 HTTP responses returned by each deployment.

    [*] - only available when using HTTP calls  
    [**] - only available when using Python `ServeHandle` calls


Let's interact with the system metrics exposed by Serve. You can do that by going to your Anyscale cluster page and clicking "Grafana". 

![metrics-cluster-page](./images/metrics-cluster-page.png)

Once you are in Grafana, you can see the pre-build system metrics dashboard in Anyscale. 

![metrics-preset-0](./images/metrics-preset-0.png)

![metrics-preset-1](./images/metrics-preset-0.png)

These metrics cover machine utilizations across all Ray nodes in your cluster.

Let's see some Serve metrics!

Click the the "Explore" icon on the left side, you should be able to select a data source to interact with, choose "Cortex" which is the metrics aggregation service we used. 

Let's define a query to plot the queries per second of each deployments:

![metrics-custom-explore](./images/metrics-custom-explore.png)

You can see the Fibonacci deployment was getting some traffic from our previous load tests! 

## Exercises

The CLI is built on top of [Serve's REST API](https://docs.ray.io/en/master/serve/rest_api.html). Can you try implementing your own `serve status` and `serve deploy` using this API?


📖 [Back to Table of Contents](./ex_00_tutorial_overview.ipynb)<br>
⬅️ [Previous notebook](./ex_03_deployment_graph.ipynb) <br>

Cheers, we are done! 