# Exercise 04. (Take-home) Advanced Topic: Introduction to Offline RL and Serving your RLlib Model using Ray Serve API

© 2019-2022, Anyscale. All Rights Reserved

### Learning objectives
In this this tutorial, you will learn:
 * [What's offline RL (aka "batch RL")?](#offline_rl)
 * [How to configure RLlib for offline RL](#offline_rl_with_rllib)

## What's offline RL (aka "batch RL")? <a class="anchor" id="offline_rl"></a>

So far, we have dealt with a so-called "online" setting for RL, in which we had direct control over a live environment (or a simulator). We were able to send arbitrary actions to this simulator and collect its responses (rewards and observations), thereby learning "as we go". This setup is called "online" RL:

<img src="images/online_rl.png" width="80%"></img>

However, often and especially in real-life industry settings, we are faced with the problem of not having a simulator at hand.
In this case, we need to fall back to offline RL:

<img src="images/offline_rl.png" width="70%"></img>


**Note:** Due to the dynamic nature of adversarial multi-agent scenarios, we will cover the topic of
of offline RL here only for the single-agent case.
Research on multi-agent offline RL is bleeding edge and not well explored by RLlib thus far (see references).

### Offline RL comes in two flavours:

#### Pure imitation learning

The agent will try to imitate 100% the actions/behavior that it finds in the offline data).
This setup is nothing else but supervised learning with a `-log(p)` loss function.

#### Imitation learning plus improvement over the recorded behavior

The agent will partly imitate the offline, recorded behavior, but also try to improve over it, learning a policy that will
perform better in the actual environment. This is achieved by focusing on those actions within the distribution that seem more 
promising, e.g. via weighting based on the received rewards.


### Example
In fact, modern offline RL algorithms are capable learning to perfectly play e.g. the Pendulum environment, when only behavioral data from a randomly acting agent is available! We'll explore this right now using RLlib's new CRR algorithm.


In [None]:
from IPython.display import Image
Image(url="images/pendulum.gif", width=300)

In [2]:
import os

# Learning a decent policy using offline RL requires specialized RL algorithms.
# Examples of offline RL algos are RLlib's "CRR", "MARWIL", or "CQL".
# For this example, we'll use the "Pendulum-v0" environment and have the "CRR"
# (critic regularized regression) algorithm learn how to solve this environment, purely from
# data recorded from a random/beginner agent.


# Import the config class of the algorithm, we would like to train with: CRR.
from ray.rllib.algorithms.crr import CRRConfig

# Create a defaut CRR config:
config = CRRConfig()

# Set it up for the correct environment:
# NOTE: We said above that we wouldn't really have an environment available (so how can
# we set one up here??).
# The following is only to tell the algorithm
config.environment(env="Pendulum-v1")

#################################################
# This is the most important piece of code 
# in this notebook:
# It explains how to point your 
# algorithm to the correct offline data file
# (instead of a live-environment).
#################################################
config.offline_data(
    input_="dataset",
    input_config={
        "paths": os.path.join(os.getcwd(), "offline_rl_data/pendulum_replay_v1.1.0.zip"),
        "format": "json",
    },
    # The (continuous) actions in our input files are already normalized
    # (meaning between -1.0 and 1.0) -> We don't have to do anything with them prior to
    # computing losses.
    actions_in_input_normalized=True,
)

# RLlib's CRR is a very new algorithm (since 1.13) and only supports
# the PyTorch framework thus far. We'll provide a tf version in the near future.
config.framework("torch")

# Set up evaluation as follows:
config.evaluation(
    # Run evaluation once per `train()` call (by default, RLlib will evaluate 10 episodes).
    evaluation_interval=1,
    # Use a separate resource ("RLlib rollout worker")
    evaluation_num_workers=1,
    # Run evaluation parallel to training.
    evaluation_parallel_to_training=True,
    # Use a slightly different config for the evaluation:
    evaluation_config={
        # - Use a real environment (so we can fully trust the evaluation results, rewards, etc..)
        "input": "sampler",
        # - Switch off exploration for better (less stochastic) action computations.
        "explore": False,
    },
)


<ray.rllib.algorithms.crr.crr.CRRConfig at 0x7fbf3b52eee0>

### Exercises

#### 1) Finish CRR configuration

Keep configuring our CRR algorithm by calling the config object's `training()` method and passing the following settings into that call:

```
gamma: 0.99
train_batch_size: 1024
target_network_update_freq: 1
tau: 0.0001
weight_type: "exp"
```

In [None]:
# Make the `training()` call on your config here in this cell:
config.training(
    gamma=0.99,
    # <- complete the other arguments to configure our CRR algo
)

<ray.rllib.algorithms.crr.crr.CRRConfig at 0x7fbf3b52eee0>

#### 2. Use `tune.run()` to kick off the experiment

Similar to how we did it in the previous notebook, use `tune.run()` to kick off our offline RL learning experiment.
Let's see how fast CRR can learn to play pendulum from scratch (from beginner's data)!

- As stopping criteria, use `timesteps_total=2000000` and `evaluation/episode_reward_mean=-300`.
- Also, make sure checkpoints are created every iteration (`checkpoint_freq=1`).
- Set the output directory (`local_dir` arg) to "results".

In [9]:
# Perform the `tune.run()` call here:

from ray import tune

tune.run(
    "CRR",
    # config=...  <- check out the previous notebook on how to use tune.run() with an RLlib config object
    # ...
)


2022-07-24 17:27:37,953	INFO services.py:1477 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8268[39m[22m
2022-07-24 17:27:41,517	INFO plugin_schema_manager.py:51 -- Loading the default runtime env schemas: ['/Users/sven/opt/anaconda3/envs/rllib_tutorial/lib/python3.9/site-packages/ray/_private/runtime_env/../../runtime_env/schemas/working_dir_schema.json', '/Users/sven/opt/anaconda3/envs/rllib_tutorial/lib/python3.9/site-packages/ray/_private/runtime_env/../../runtime_env/schemas/pip_schema.json'].
2022-07-24 17:27:51,606	ERROR trial_runner.py:920 -- Trial CRR_282d2_00000: Error processing event.
ray.tune.error._TuneNoNextExecutorEventError: Traceback (most recent call last):
  File "/Users/sven/opt/anaconda3/envs/rllib_tutorial/lib/python3.9/site-packages/ray/tune/execution/ray_trial_executor.py", line 989, in get_next_executor_event
    future_result = ray.get(ready_future)
  File "/Users/sven/opt/anaconda3/envs/rllib_tutorial/lib/python3.9/site-packages/ray/_private/clien

== Status ==
Current time: 2022-07-24 17:27:51 (running for 00:00:10.36)
Memory usage on this node: 11.9/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 5.0/16 CPUs, 0/0 GPUs, 0.0/4.79 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/sven/ray_results/CRR
Number of trials: 1/1 (1 RUNNING)
+-----------------+----------+-------+
| Trial name      | status   | loc   |
|-----------------+----------+-------|
| CRR_282d2_00000 | RUNNING  |       |
+-----------------+----------+-------+


Result for CRR_282d2_00000:
  trial_id: 282d2_00000
  
== Status ==
Current time: 2022-07-24 17:27:51 (running for 00:00:10.38)
Memory usage on this node: 11.9/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/4.79 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/sven/ray_results/CRR
Number of trials: 1/1 (1 ERROR)
+-----------------+----------+-------+
| Trial name      | status   | loc   |
|-----------------+----------+-------|
| CRR_282d2_00000 | E

TuneError: ('Trials did not complete', [CRR_282d2_00000])

In [27]:
# Perform the `tune.run()` call here:
from ray import tune


tune.run(
    "CRR",
    config=config.to_dict(),
    checkpoint_freq=1,
    checkpoint_at_end=True,
    local_dir="results",
    verbose=1,
)


[2m[36m(CRR pid=40772)[0m 2022-07-26 12:02:45,086	INFO algorithm.py:332 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


[2m[36m(CRR pid=40772)[0m Checking /Users/sven/Dropbox/Projects/ray-summit-2022-training/ray-rllib/offline_rl_data/pendulum_replay_v1.1.0.zip ...
[2m[36m(CRR pid=40772)[0m fpath=/Users/sven/Dropbox/Projects/ray-summit-2022-training/ray-rllib/offline_rl_data/pendulum_replay_v1.1.0.zip ...




[2m[36m(CRR pid=40772)[0m [dataset]: Run `pip install tqdm` to enable progress reporting.




[2m[36m(RolloutWorker pid=40791)[0m DatasetReader 2 has 24875, samples.
[2m[36m(RolloutWorker pid=40793)[0m DatasetReader 4 has 24875, samples.
[2m[36m(RolloutWorker pid=40792)[0m DatasetReader 3 has 24875, samples.
[2m[36m(RolloutWorker pid=40790)[0m DatasetReader 1 has 24875, samples.


[2m[36m(CRR pid=40772)[0m 2022-07-26 12:03:10,608	INFO trainable.py:160 -- Trainable.setup took 25.543 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(CRR pid=40772)[0m   torch.from_numpy(self.action_space.low).to(target_a_next),


KeyboardInterrupt: 

#### 3) Let's record our trained algorithm on a live Pendulum environment

In [4]:
import gym
from gym.wrappers import RecordVideo
from IPython.display import Video

env = RecordVideo(gym.make("Pendulum-v1"), "crr_video")
obs = env.reset()

crr = config.build()
crr.restore("results/CRR/CRR_Pendulum-v1_11bef_00000_0_2022-07-26_12-02-34/checkpoint_000018/checkpoint-18")

while True:
    a = crr.compute_single_action(observation=obs)
    obs, reward, done, _ = env.step(a)
    # Is the episode `done`? -> Quit.
    if done:
        break
        
env.close()

Video("crr_video/rl-video-episode-0.mp4", width=500)

  logger.warn(


Checking /Users/sven/Dropbox/Projects/ray-summit-2022-training/ray-rllib/offline_rl_data/pendulum_replay_v1.1.0.zip ...
fpath=/Users/sven/Dropbox/Projects/ray-summit-2022-training/ray-rllib/offline_rl_data/pendulum_replay_v1.1.0.zip ...


2022-07-26 12:27:54,074	INFO services.py:1477 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8266[39m[22m


[dataset]: Run `pip install tqdm` to enable progress reporting.




[2m[36m(RolloutWorker pid=41351)[0m DatasetReader 3 has 24875, samples.
[2m[36m(RolloutWorker pid=41349)[0m DatasetReader 1 has 24875, samples.
[2m[36m(RolloutWorker pid=41352)[0m DatasetReader 4 has 24875, samples.
[2m[36m(RolloutWorker pid=41350)[0m DatasetReader 2 has 24875, samples.


2022-07-26 12:28:14,442	INFO trainable.py:160 -- Trainable.setup took 23.308 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2022-07-26 12:28:14,463	INFO trainable.py:654 -- Restored on 127.0.0.1 from checkpoint: results/CRR/CRR_Pendulum-v1_11bef_00000_0_2022-07-26_12-02-34/checkpoint_000018
2022-07-26 12:28:14,464	INFO trainable.py:663 -- Current state after restoring: {'_iteration': 18, '_timesteps_total': None, '_time_total': 181.0962312221527, '_episodes_total': 0}


### References

* 