# Notebook 05. Using RLlib with Ray Serve to deploy a policy into production

© 2019-2022, Anyscale. All Rights Reserved

### Learning objectives
In this this notebook, you will learn:

 * [How to deploy a trained policy into production using Ray Serve](#ray_serve)
 * [How to send requests to a deployed policy via HTTP](#ray_serve_requests)

In [None]:
# import required packages

import gym
import numpy as np
import requests
from requests import Request
import tree  # pip install dm_tree

import ray
from ray import serve
from ray.rllib.algorithms.crr import CRRConfig

if ray.is_initialized():
    ray.shutdown()

print(f"gym: {gym.__version__}")
print(f"ray: {ray.__version__}")

# !ale-import-roms --import-from-pkg atari_py.atari_roms

### Using Ray Serve to deploy a trained policy into production <a class="anchor" id="ray_serve"></a>

<img src="images/rllib_and_ray_serve.png" width=800 />

This is a quick demo on how to use our already (offline RL) trained CRR policy and deploy it using Ray Serve.
All you need to run the following code and produce a serve deployment is one of the checkpoints files from the previous notebook. Let's take checkpoint number 

In [None]:
@serve.deployment(route_prefix="/serve-deployment")  # , num_replicas=2)
class ServeModel:
    def __init__(self, config, checkpoint) -> None:
        # Create new algo from scratch.
        self.algo = config.build()
        # Restore state of algo to a already trained one (using a checkpoint).
        self.algo.restore(checkpoint)

    async def __call__(self, request):
        json_input = await request.json()
        # Extract observation from input.
        obs = json_input["observation"]
        # Translate obs back to np.arrays.
        np_obs = np.array(obs)
        action = self.algo.compute_single_action(np_obs, explore=False)
        return {"action": action}


In the following cell, you will be asked to provide the path to an already existing checkpoint file.

We have created Pendulum-v1 CRR checkpoints in the previous notebook and you should by now be able to navigate to one of these checkpoints using your notebook file browser on the left side:

<img src="images/copy_checkpoint_path.png" width=400>

Once you copied the path, you can paste it into the `checkpoint = "..."` code below:

In [None]:
# In order to create a deployment from the tagged class above, simply `.bind()`
# the deployments with its constructor arguments:

# Create config to use. Same as before, but lighter:
# 1) No evaluation necessary (model only used for inference).
# 2) No `offline_data` settings necessary (model only used for inference).

config = CRRConfig().environment(env="Pendulum-v1").framework("torch").rollouts(num_rollout_workers=0)

# Pick a solid checkpoint from the previous notebook's CRR experiment:
#checkpoint = "results/CRR/CRR_Pendulum-v1_93a30_00000_0_2022-07-26_23-26-14/checkpoint_000028/checkpoint-28"
checkpoint = "/home/ray/Ray-Tutorial/ray-summit-2022-training/ray-rllib/offline_rl_data/pendulum_checkpoint/checkpoint-2"

serve_model = ServeModel.bind(config, checkpoint)
serve.run(serve_model)
    
# That's it: Deployment created!

### Using the deployment for serving actions <a class="anchor" id="ray_serve_requests"></a>

Now let's send action inference requests to the existing deployment.
We'll be using a test `Pendulum-v1` environment here, pretending we are some client that would like to query the server for good Pendulum actions.

In [None]:
# Create a environment so we can step through episodes using requested, served actions.
env = gym.make("Pendulum-v1")
# Get the initial observation.
obs = env.reset()

# Request 5 actions of an episode from served policy and step through the env using the received actions.
for _ in range(3):
    # Convert numpy array to list (needed for http transfer).
    obs = obs.tolist()

    print(f"-> Sending observation {obs}")
    resp = requests.get(
        "http://localhost:8000/serve-deployment", json={"observation": obs}
    )
    # Convert to response to JSON.
    # The received JSON should include an "action" field (see our ServeModel class for details).
    response_json = resp.json()
    print(f"<- got {response_json}")

    # Convert to numpy array.
    action = np.array(response_json["action"])

    # Perform a step with the served action.
    obs, _, _, _ = env.step(action)


### Summary

In this notebook, we have learnt:

* How to create a Ray Serve "deployment" using a custom `@serve.deployment`-tagged class
* How to query actions from this custom deployment

### Exercise 05

#### a) Run a complete episode using our Policy deployment

Use the code in the cell above and now run a complete episode through a Pendulum environment, then report the episode's rewards at the end.

In [None]:
# Create a new environment using gym.make():
# ...

# Get the initial observation via the env's `reset()` method.
# ...

total_reward = 0.0

# Loop through an entire episode using a while loop.
# while True:

    # Remember to convert all np-arrays to lists prior to sending them via the Convert numpy array to list (needed for http transfer).
    # obs = obs.tolist()

    # Send the action request using `resp = request.get([address], json={"observation": [current observation]})`.
    # ...

    # Convert response to JSON and extract the "action" field, then convert the action to a numpy array.
    # ...

    # Perform an env step with the served action (using the env's `step([some action])` method).
    # obs, reward, done, info = env.step(...)
    
    # Add up reward.
    
    # Check done flag; if True -> break out of while loop.

    
print(f"Played one episode entirely using served actions; Total episode reward is {total_reward}.")

#### b) What would happen, if we had a "stateful" policy?

Think about the problem of having trained a neural network that uses one or more memory-capable (stateful) layers, like an LSTM layer.
Given the problem of distributed deployment (across several endpoints accessible through different addresses/ports), what do you think would be the best solution for keeping or passing around the layer's state (e.g. the LSTM's internal c- and h- states)?

* Should the server side (Ray Serve deployment) keep these tensors in between action requests?
* Or should the clients handle these states and pass them back via each new requests?

 ## References
 * [Ray Serve: Scalable and Programmable Serving](https://docs.ray.io/en/latest/serve/index.html)

➡️ [Link to previous notebook](./ex_04_offline_rl_with_rllib.ipynb) <br>
➡️ [Link to next notebook](./ex_06_rllib_in_recsys_overview.ipynb) <br>

📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb)