## Offline RL

In [9]:
# HIDDEN
import gym
import numpy as np

#### Is this realistic?

- So far we've built a simulation of user behavior
- In some applications, we may be able to build accurate simulations:
  - physics simulations (e.g. robots)
  - games
  - economic/financial simulations?
- However, for user behavior, this is hard

#### Is this realistic? 

- Best would be to deploy RL live, but not practical
- Another possibility: learn from user data?
- We can do this with **offline reinforcement learning**

#### Offline RL

- What is offline RL?
- Recall our RL loop:

![](img/RL-loop-3.png)

#### Offline RL

- In offline RL we don't have an environment to interact with in a feedback loop:

![](img/offline-RL-loop.png)

This historic data was generated by some other, unknown policy/policies.

Notes:

Could be generate by real users, or by a different source (random, or RL agent!)

#### Challenge of offline RL

- Can't answer "what if" questions
- We can only see the results of actions attempted in the dataset

Notes:

Perhaps this makes us appreciate how valuable/awesome it is to actually have an env available, which we have had for all the rest of the course. It allows us to try anything with no cost except computational cost (assuming it's a simulator, not a real world environment). 

#### Types of offline RL

Two main categories:

1. No rewards available: try to _imitate_ historic policy

This boils down to supervised learning of observations -> actions
 
2. Rewards available: try to _improve upon_ historic policy

Use rewards to improve policy. This is what we want, ideally!

#### Recommender dataset

- Let's explore an offline dataset that we can learn from.
- We'll need a bit of code to read all the JSON objects in the file:

Notes:

The file is in the format that RLlib learns from.

In [188]:
import json

json_dataset_file = "data/offline/recommender_offline.json"

rollouts = []
with open(json_dataset_file, "r") as f:
    for line in f:
        data = json.loads(line)
        rollouts.append(data)

In [164]:
len(rollouts)

50

We have 50 "rollouts" of data.

#### Recommender dataset

Each rollout is a dict containing info about the time step:

In [166]:
from ray.rllib.utils.compression import unpack, pack

obs = unpack(rollouts[0]["obs"])
obs.shape

(200, 2)

- We have 200 time steps worth of data in each rollout
- Each is an observation:

Notes:

This number 200 is set by the "rollout_fragment_length" algorithm config parameter.

In [167]:
obs[:3]

array([[0.6545137 , 0.29728338],
       [0.5238871 , 0.5144319 ],
       [0.6741674 , 0.10163702]], dtype=float32)

- Here are the first 3 observations
- We can see `num_candidates` was set as 2

#### Recommender dataset

We can also look at the first 3 actions, rewards, dones:

In [168]:
rollouts[0]["actions"][:3]

[0, 0, 1]

In [169]:
rollouts[0]["rewards"][:3]

[0.6545137166976929, 0.3524414300918579, 0.05838315561413765]

In [170]:
rollouts[0]["dones"][:3]

[False, False, False]

Notes:

So, first the agent saw the observation [0.65, 0.297] from the previous slide, then it took action 0, got a reward of 0.65, and the episode was not done.

There is more information stored in the dataset than just the above, but these are the key points.

#### Offline RL training

- Lots of info on offline RL with RLlib [here](https://docs.ray.io/en/latest/rllib/rllib-offline.html)
- First we need our trainer config.

In [207]:
num_candidates = 2

offline_trainer_config = {
    # These should look familiar:
    "framework"             : "torch",
    "create_env_on_driver"  : True,
    "seed"                  : 0,
    "model"                 : {
        "fcnet_hiddens"     : [64, 64]
    },
    
    # These are new for offline RL:
    "input": [json_dataset_file],
    "observation_space": gym.spaces.Box(low=0, high=1, shape=(num_candidates,)),
    "action_space": gym.spaces.Discrete(num_candidates),
}

Notes:

- The config items on the top should look familiar. On the second half, things are a bit different:
  - We need to give it the path to the dataset file
  - Because there is no env, we need to manually specify the observation and action spaces
- We don't have an environment config because there is no environment!

#### Training

- For offline RL we can't use `PPO`.
- We'll use the `MARWIL` algorithm that is included with RLlib.

In [194]:
from ray.rllib.agents.marwil import MARWILTrainer

In [195]:
trainer = MARWILTrainer(config=offline_trainer_config)

In [205]:
for i in range(10):
    out = trainer.train()

Ok, we apparently did some training. How do we evaluate without a simulator?

In [206]:
out

{'episode_reward_max': nan,
 'episode_reward_min': nan,
 'episode_reward_mean': nan,
 'episode_len_mean': nan,
 'episode_media': {},
 'episodes_this_iter': 0,
 'policy_reward_min': {},
 'policy_reward_max': {},
 'policy_reward_mean': {},
 'custom_metrics': {},
 'hist_stats': {'episode_reward': [], 'episode_lengths': []},
 'sampler_perf': {},
 'off_policy_estimator': {'is': {'V_prev': 16.231008984326884,
   'V_step_IS': 16.923409830694006,
   'V_gain_est': 1.0409941273679744},
  'wis': {'V_prev': 16.231008984326884,
   'V_step_WIS': 16.741032437886666,
   'V_gain_est': 1.0298169594045956}},
 'num_healthy_workers': 0,
 'timesteps_total': 3200,
 'timesteps_this_iter': 2000,
 'agent_timesteps_total': 3200,
 'timers': {'sample_time_ms': 1640.838,
  'sample_throughput': 1218.89,
  'learn_time_ms': 3.779,
  'learn_throughput': 529192.957},
 'info': {'learner': {'default_policy': {'learner_stats': {'allreduce_latency': 0.0,
     'policy_loss': 1.916463851928711,
     'total_loss': 54.555114746

#### Let's apply what we learned!