# Notebook 05. Using RLlib with Ray Serve to deploy a policy into production

© 2019-2022, Anyscale. All Rights Reserved

### Learning objectives
In this this notebook, you will learn:

 * [How to deploy a trained policy into production using Ray Serve](#ray_serve)
 * [How to send requests to a deployed policy via HTTP](#ray_serve_requests)

In [2]:
# import required packages

import gym
from gym.spaces import Discrete, MultiDiscrete
import numpy as np
import requests
from requests import Request
import tree  # pip install dm_tree

import ray
from ray import serve
from ray.rllib.algorithms.crr import CRRConfig

print(f"gym: {gym.__version__}")
print(f"ray: {ray.__version__}")

# !ale-import-roms --import-from-pkg atari_py.atari_roms

gym: 0.21.0
ray: 3.0.0.dev0


### Using Ray Serve to deploy a trained policy into production <a class="anchor" id="ray_serve"></a>

<img src="images/rllib_and_ray_serve.png" width=800 />

This is a quick demo on how to use our already (offline RL) trained CRR policy and deploy it using Ray Serve.
All you need to run the following code and produce a serve deployment is one of the checkpoints files from the previous notebook. Let's take checkpoint number 

In [3]:
# Call `serve.start()` to get 
serve.start()


@serve.deployment(route_prefix="/serve-deployment")
class ServeModel:
    def __init__(self, config, checkpoint) -> None:
        # Create new algo from scratch.
        self.algo = config.build()
        # Restore state of algo to a already trained one (using a checkpoint).
        self.algo.restore(checkpoint)

    async def __call__(self, request):
        json_input = await request.json()
        # Extract observation from input.
        obs = json_input["observation"]
        # Translate obs back to np.arrays.
        np_obs = np.array(obs)
        action = self.algo.compute_single_action(np_obs, explore=False)
        return {"action": action}


2022-07-28 12:32:41,219	INFO services.py:1477 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8268[39m[22m
[2m[36m(ServeController pid=56782)[0m INFO 2022-07-28 12:32:44,662 controller 56782 checkpoint_path.py:17 - Using RayInternalKVStore for controller checkpoint and recovery.
[2m[36m(ServeController pid=56782)[0m INFO 2022-07-28 12:32:44,665 controller 56782 http_state.py:115 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:vdkxiM:SERVE_PROXY_ACTOR-node:127.0.0.1-0' on node 'node:127.0.0.1-0' listening on '127.0.0.1:8000'


In the following cell, you will be asked to provide the path to an already existing checkpoint file.

We have created Pendulum-v1 CRR checkpoints in the previous notebook and you should by now be able to navigate to one of these checkpoints using your notebook file browser on the left side:

<img src="images/copy_checkpoint_path.png" width=400>

Once you copied the path, you can paste it into the `checkpoint = "..."` code below:

In [4]:
# In order to create a deployment from the tagge class above, simply call its `deploy()`
# method and pass this method the `ServeModel` constructor arguments:

# Create config to use. Same as before, but lighter:
# 1) No evaluation necessary (model only used for inference).
# 2) No `offline_data` settings necessary (model only used for inference).

config = CRRConfig().environment(env="Pendulum-v1")
config.framework("torch")

# Pick a solid checkpoint from the previous notebook's CRR experiment:
checkpoint = "results/CRR/CRR_Pendulum-v1_93a30_00000_0_2022-07-26_23-26-14/checkpoint_000028/checkpoint-28"

ServeModel.deploy(config, checkpoint)
    
# That's it: Deployment created!

[2m[36m(HTTPProxyActor pid=56783)[0m INFO:     Started server process [56783]
  ServeModel.deploy(config, checkpoint)
[2m[36m(ServeController pid=56782)[0m INFO 2022-07-28 12:32:46,454 controller 56782 deployment_state.py:1280 - Adding 1 replicas to deployment 'ServeModel'.
[2m[36m(ServeModel pid=56784)[0m 2022-07-28 12:32:53,768	INFO algorithm.py:332 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(ServeModel pid=56784)[0m 2022-07-28 12:33:01,384	INFO trainable.py:654 -- Restored on 127.0.0.1 from checkpoint: results/CRR/CRR_Pendulum-v1_93a30_00000_0_2022-07-26_23-26-14/checkpoint_000028
[2m[36m(ServeModel pid=56784)[0m 2022-07-28 12:33:01,384	INFO trainable.py:663 -- Current state after restoring: {'_iteration': 28, '_timesteps_total': None, '_time_total': 372.1818161010742, '_episodes_total': 0}


### Using the deployment for serving actions <a class="anchor" id="ray_serve_requests"></a>

Now let's send action inference requests to the existing deployment.
We'll be using a test `Pendulum-v1` environment here, pretending we are some client that would like to query the server for good Pendulum actions.

In [6]:
# Create a environment so we can step through episodes using requested, served actions.
env = gym.make("Pendulum-v1")
# Get the initial observation.
obs = env.reset()

# Request 5 actions of an episode from served policy and step through the env using the received actions.
for _ in range(3):
    # Convert numpy array to list (needed for http transfer).
    obs = obs.tolist()

    print(f"-> Sending observation {obs}")
    resp = requests.get(
        "http://localhost:8000/serve-deployment", json={"observation": obs}
    )
    # Convert to response to JSON.
    # The received JSON should include an "action" field (see our ServeModel class for details).
    response_json = resp.json()
    print(f"<- got {response_json}")

    # Convert to numpy array.
    action = np.array(response_json["action"])

    # Perform a step with the served action.
    obs, _, _, _ = env.step(action)


-> Sending observation [0.7845883369445801, -0.6200169920921326, 0.0003206760447937995]
<- got {'action': [1.449544906616211]}
-> Sending observation [0.7768633365631104, -0.6296692490577698, -0.24726034700870514]
<- got {'action': [1.591677188873291]}
-> Sending observation [0.7615043520927429, -0.6481598019599915, -0.48076072335243225]
<- got {'action': [1.6422221660614014]}
-> Sending observation [0.737663745880127, -0.6751682758331299, -0.7205472588539124]
<- got {'action': [1.6735374927520752]}
-> Sending observation [0.7038542032241821, -0.7103444337844849, -0.9758928418159485]
<- got {'action': [1.555694580078125]}


[2m[36m(HTTPProxyActor pid=56783)[0m INFO 2022-07-28 12:33:26,716 http_proxy 127.0.0.1 http_proxy.py:316 - GET /in-game-recommendations 200 3.9ms
[2m[36m(HTTPProxyActor pid=56783)[0m INFO 2022-07-28 12:33:26,724 http_proxy 127.0.0.1 http_proxy.py:316 - GET /in-game-recommendations 200 3.9ms
[2m[36m(HTTPProxyActor pid=56783)[0m INFO 2022-07-28 12:33:26,733 http_proxy 127.0.0.1 http_proxy.py:316 - GET /in-game-recommendations 200 4.5ms
[2m[36m(HTTPProxyActor pid=56783)[0m INFO 2022-07-28 12:33:26,741 http_proxy 127.0.0.1 http_proxy.py:316 - GET /in-game-recommendations 200 3.7ms
[2m[36m(HTTPProxyActor pid=56783)[0m INFO 2022-07-28 12:33:26,749 http_proxy 127.0.0.1 http_proxy.py:316 - GET /in-game-recommendations 200 3.8ms
[2m[36m(ServeModel pid=56784)[0m INFO 2022-07-28 12:33:26,714 ServeModel ServeModel#DlCdxY replica.py:467 - HANDLE __call__ OK 1.2ms
[2m[36m(ServeModel pid=56784)[0m INFO 2022-07-28 12:33:26,723 ServeModel ServeModel#DlCdxY replica.py:467 - HANDLE _

### Summary

In this notebook, we have learnt:

* How to create a Ray Serve "deployment" using a custom `@serve.deployment`-tagged class
* How to query actions from this custom deployment

### Exercises

#### 1) Run a complete episode using our Policy deployment

Use the code in the cell above and now run a complete episode through a Pendulum environment, then report the episode's rewards at the end.

In [None]:
# Create a new environment using gym.make():
# ...

# Get the initial observation via the env's `reset()` method.
# ...

total_reward = 0.0

# Loop through an entire episode using a while loop.
# while True:

    # Remember to convert all np-arrays to lists prior to sending them via the Convert numpy array to list (needed for http transfer).
    # obs = obs.tolist()

    # Send the action request using `resp = request.get([address], json={"observation": [current observation]})`.
    # ...

    # Convert response to JSON and extract the "action" field, then convert the action to a numpy array.
    # ...

    # Perform an env step with the served action (using the env's `step([some action])` method).
    # obs, reward, done, info = env.step(...)
    
    # Add up reward.
    
    # Check done flag; if True -> break out of while loop.

    
print(f"Played one episode entirely using served actions; Total episode reward is {total_reward}.")

#### 2) What would happen, if we had a "stateful" policy?

Think about the problem of having trained a neural network that uses one or more memory-capable (stateful) layers, like an LSTM layer.
Given the problem of distributed deployment (across several endpoints accessible through different addresses/ports), what do you think would be the best solution for keeping or passing around the layer's state (e.g. the LSTM's internal c- and h- states)?

* Should the server side (Ray Serve deployment) keep these tensors in between action requests?
* Or should the clients handle these states and pass them back via each new requests?

 ## References
 * [Ray Serve: Scalable and Programmable Serving](https://docs.ray.io/en/latest/serve/index.html)

➡️ [Link to previous notebook](./ex_04_offline_rl_with_rllib.ipynb) <br>
➡️ [Link to next notebook](./ex_06_rllib_end_to_end_demo.ipynb) <br>

📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb)