# Exercise 04. Introduction to Offline RL

© 2019-2022, Anyscale. All Rights Reserved <br>
📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb) <br>
➡️ [Next notebook](./ex_05_rllib_and_ray_serve.ipynb) <br>
⬅️  [Previous notebook](./ex_03_train_tune_rllib_model.ipynb) <br>

### Learning objectives
In this this tutorial, you will learn:
 * [What's offline RL (aka "batch RL")?](#offline_rl)
 * [How to configure RLlib for offline RL](#offline_rl_with_rllib)

In [1]:
# Import required packages.

import gym
import gym
from gym.wrappers import RecordVideo
from IPython.display import Video
import os

import ray
# Import the config class of the algorithm, we would like to train with: CRR.
from ray.rllib.algorithms.crr import CRRConfig
from ray import tune

print(f"gym: {gym.__version__}")
print(f"ray: {ray.__version__}")

# !ale-import-roms --import-from-pkg atari_py.atari_roms

gym: 0.21.0
ray: 3.0.0.dev0


## What's offline RL (aka "batch RL")? <a class="anchor" id="offline_rl"></a>

So far, we have dealt with a so-called "online" setting for RL, in which we had direct control over a live environment (or a simulator). We were able to send arbitrary actions to this simulator and collect its responses (rewards and observations), thereby learning "as we go". This setup is called "online" RL:

<img src="images/online_rl.png" width="80%"></img>

However, often and especially in real-life industry settings, we are faced with the problem of not having a simulator at hand.
In this case, we need to fall back to offline RL:

<img src="images/offline_rl.png" width="70%"></img>


**Note:** Due to the dynamic nature of adversarial multi-agent scenarios, we will cover the topic of
of offline RL here only for the single-agent case.
Research on multi-agent offline RL is bleeding edge and not well explored by RLlib thus far (see references).

### Offline RL comes in two flavours:

#### 1) Pure imitation learning

The agent will try to imitate 100% the actions/behavior that it finds in the offline data).
This setup is nothing else but supervised learning with a `-log(p)` loss function.

#### 2) Imitation learning plus improvement over the recorded behavior

The agent will partly imitate the offline, recorded behavior, but also try to improve over it, learning a policy that will
perform better in the actual environment. This is achieved by focusing on those actions within the distribution that seem more 
promising, e.g. via weighting based on the received rewards.

### If we don't have a live environment, how do we know, how well our trained policy will perform?

One of the challenges in offline RL is the evaluation of the trained policy. In online RL (when a simulator
is available), one can either use the data collected for training to compute episode total rewards. Remember
that observations, actions, rewards, and done flags are all part of this training data. Alternatively,
one could run a separate worker (with the same trained policy) and run it on a fresh evaluation-only environment.
In this latter case, we would also have the chance to switch off any exploratory behavior (e.g. stochasticity used
for better action entropy).

In offline RL, no such data from a live environment is available to us. There are two common ways of addressing this dilemma:

1) We deploy the learnt policy/ies into production, or maybe just a portion of our production system (similar to A/B testing), and see what happens.

2) We use a method called "off policy evaluation" (OPE) to compute an estimate on how the new polcy would perform if we were to deploy it into a real environment. There are different OPE methods available in RLlib off-the-shelf. In one of the exercises below, we will ask you to look

3) The third option - which we will use here - is kind of cheating and only possible if you actually do have a simulator available (but you only want to use it for evaluation, not for training, because you want to see how cool offline RL is :) )    "### Using an (offline) input file with an offline RL algorithm.\n",
    "\n",
    "We will now pretend that we don't have a simulator for our problem (same recommender system problem as above) available, however, let's assume we possess a lot of pre-recorded, historic data from some legacy (non-RL) system.\n",
    "\n",
    "Assuming that this legacy system wrote some data into a JSON file (we'll simply use the same JSON file that our SlateQ algo produced above), how can we use this historic data to do RL either way?"



### An example offline RL experiment
Modern offline RL algorithms are capable of learning to perfectly play e.g. the Pendulum environment, when only behavioral data from a randomly acting agent is available! We'll explore this right now using RLlib's new CRR algorithm.

The Pendulum-v1 environment looks as follows:
- Continuous actions between -2.0 and 2.0 encode torques that will be applied to the hinge of a freely rotating pole.
- The observations are x- and y- positions as well as the angular velocity.
- The goal is to apply torques to the hinge such that the pendulum balances in an upright position.


We will now pretend that we don't have a simulator for our problem available (the Pendulum-v1 problem), however, let's assume we possess a lot of pre-recorded, historic data from two legacy (non-RL) systems:
- A **beginner system** that only knew how to get to a low episode reward.
- An **expert system** that was able to solve the Pendulum-v1 environment perfectly.


In [None]:
# Learning a decent policy using offline RL requires specialized RL algorithms.
# Examples of offline RL algos are RLlib's "CRR", "MARWIL", or "CQL".
# For this example, we'll use the "Pendulum-v0" environment and have the "CRR"
# (critic regularized regression) algorithm learn how to solve this environment, purely from
# data recorded from a random/beginner agent.

# Create a defaut CRR config:
config = CRRConfig()

# Set it up for the correct environment:
# NOTE: We said above that we wouldn't really have an environment available (so how can
# we set one up here??).
# The following is only to tell the algorithm, which environment our offline data was actually taken from.
config.environment(env="Pendulum-v1")
# If you really really don't have an environment, set `env=None` here and additionally define your action- and
# observation spaces.
# config.environment(env=None, action_space=..., observation_space=...)

#################################################
# This is the most important piece of code 
# in this notebook:
# It explains how to point your 
# algorithm to the correct offline data file
# (instead of a live-environment).
#################################################
config.offline_data(
    input_="dataset",
    input_config={
        # If you feel daring here, use the `pendulum_beginner.json` file instead of the expert one here.
        # You may need to train a little longer, then, in order to get a decent policy.
        # But since you have the actual Pendulum environment available for evaluation, you should be able
        # to perfectly stop learning once a good episode reward (> -300.0) has been reached.
        "paths": os.path.join(os.getcwd(), "offline_rl_data/pendulum_expert.json"),
        "format": "json",
    },
    # The (continuous) actions in our input files are already normalized
    # (meaning between -1.0 and 1.0) -> We don't have to do anything with them prior to
    # computing losses.
    actions_in_input_normalized=True,
)

# RLlib's CRR is a very new algorithm (since 1.13) and only supports
# the PyTorch framework thus far. We'll provide a tf version in the near future.
config.framework("torch")

# Set up evaluation as follows:
config.evaluation(
    # Run evaluation once per `train()` call.
    evaluation_interval=1,
    # Use a separate resource ("RLlib rollout worker")
    evaluation_num_workers=1,

    # Use separate resources (RLlib rollout workers).
    evaluation_num_workers=2,

    # Run 20 episodes per evaluation (per iteration) -> 10 per eval worker (we have 2 eval workers).
    evaluation_duration=20,
    evaluation_duration_unit="episodes",

    # Use a slightly different config for the evaluation:
    evaluation_config={
        # - Use a real environment (so we can fully trust the evaluation results, rewards, etc..)
        "input": "sampler",
        # - Switch off exploration for better (less stochastic) action computations.
        "explore": False,
    },

    # Run evaluation alternatingly with training (not in parallel).
    evaluation_parallel_to_training=False,
)


### Summary

**TODO**

### Exercises

#### 1) Finish CRR configuration

Keep configuring our CRR algorithm by calling the config object's `training()` method and passing the following settings into that call:

```
gamma: 0.99
train_batch_size: 1024
target_network_update_freq: 1
tau: 0.0001
weight_type: "exp"
```

In [None]:
# Make the `training()` call on your config here in this cell:
config.training(
    gamma=0.99,
    # <- complete the other arguments to configure our CRR algo
)

#### 2. Use `tune.run()` to kick off the experiment

Similar to how we did it in the previous notebook, use `tune.run()` to kick off our offline RL learning experiment.
Let's see how fast CRR can learn to play pendulum from scratch (from beginner's data)!

- As stopping criteria, use `timesteps_total=2000000` and `evaluation/episode_reward_mean=-300`.
- Also, make sure checkpoints are created every iteration (`checkpoint_freq=1`).
- Set the output directory (`local_dir` arg) to "results".

In [None]:
# Perform the `tune.run()` call here:
tune.run(
    "CRR",
    # config=...  <- check out the previous notebook on how to use tune.run() with an RLlib config object
    # ...
)

In [None]:
#TODO: Remove
tune.run(
    "CRR",
    config=config.to_dict(),
    checkpoint_freq=1,
    checkpoint_at_end=True,
    local_dir="results",
    verbose=1,
)


#### 3) Let's record our trained algorithm on a live Pendulum environment

Analogous to how episode recording for CartPole was done in a previous notebook here, we will now
restore a CRR Algorithm from one of the checkpoints created during the above `tune.run()` experiment (we will chose
a checkpoint that was showing good mean rewards on the evaluation live-environment).

In [None]:
# Build a brand new CRR Algorithm using our existing config.
crr = config.build()
# Override the new CRR's state by restoring from one of our checkpoints.
# Here, we use checkpoing 18, but you should simply pick the one that performed best on the
# evaluation track (using the live environment).
crr.restore("results/CRR/CRR_Pendulum-v1_e47f1_00000_0_2022-07-26_18-35-00/checkpoint_000038/checkpoint-38")

print("CRR Algorithm restored from checkpoint")

Using this
restored algorithm, we will record a single episode as follows:

In [None]:
# Wrap a new Pendulum-v1 env with the gym VideoRecorder.
env = RecordVideo(gym.make("Pendulum-v1"), "crr_video")
# Reset the env.
obs = env.reset()

# Run a single episode using actions computed by our trained CRR.
while True:
    action = crr.compute_single_action(observation=obs)
    obs, reward, done, _ = env.step(action)
    if done:
        break
        
env.close()

# Play the recorded video.
Video("crr_video/rl-video-episode-0.mp4", width=500)

### References

* [Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems (by Sergey Levine, Aviral Kumar, George Tucker, Justin Fu, 2020)](https://arxiv.org/abs/2005.01643)
* [Batch Reinforcement Learning (by Sascha Lange, Thomas Gabel, Martin Riedmiller, 2012)](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.229.787)

##### Early Work
* [Least-squares policy iteration (by Michail G. Lagoudakis, Ronald Parr, 2003)](http://www.jmlr.org/papers/v4/lagoudakis03a.html)
* [Tree-based batch mode reinforcement learning (by Damien Ernst, Pierre Geurts, Louis Wehenkel, 2005)](https://www.jmlr.org/papers/volume6/ernst05a/ernst05a.pdf)


📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb)<br>
➡️ [Next notebook](./ex_05_rllib_and_ray_serve.ipynb) <br>
⬅️ [Previous notebook](./ex_03_train_tune_rllib_model.ipynb) <br>