## Offline RL

In [9]:
# HIDDEN
import gym
import numpy as np

#### Is this realistic?

- So far we've built a simulation of user behavior
- In some applications, we may be able to build accurate simulations:
  - physics simulations (e.g. robots)
  - games
  - economic/financial simulations?
- However, for user behavior, this is hard

#### Is this realistic? 

- Best would be to deploy RL live, but not practical
- Another possibility: learn from user data?
- We can do this with **offline reinforcement learning**

#### Offline RL

- What is offline RL?
- Recall our RL loop:

![](img/RL-loop-3.png)

#### Offline RL

- In offline RL we don't have an environment to interact with in a feedback loop:

![](img/offline-RL-loop.png)

This historic data was generated by some other, unknown policy/policies.

Notes:

Could be generate by real users, or by a different source (random, or RL agent!)

#### Challenge of offline RL

- Can't answer "what if" questions
- We can only see the results of actions attempted in the dataset

Notes:

Perhaps this makes us appreciate how valuable/awesome it is to actually have an env available, which we have had for all the rest of the course. It allows us to try anything with no cost except computational cost (assuming it's a simulator, not a real world environment). 

#### Types of offline RL

Two main categories:

1. No rewards available: try to _imitate_ historic policy

This boils down to supervised learning of observations -> actions
 
2. Rewards available: try to _improve upon_ historic policy

Use rewards to improve policy. This is what we want, ideally!

#### Recommender dataset

- Let's explore an offline dataset that we can learn from

In [118]:
# HIDDEN
offline_recsys_file = "data/out1/output-2022-07-14_09-56-03_worker-2_0.json"

import json
rollouts = []
with open(offline_recsys_file, "r") as f:
    for line in f.readlines():
        data = json.loads(line)
        rollouts.append(data)

In [121]:
len(rollouts)

50

In [125]:
rollouts[0].keys()

dict_keys(['type', 'obs', 'new_obs', 'actions', 'prev_actions', 'rewards', 'prev_rewards', 'dones', 'infos', 'eps_id', 'unroll_id', 'agent_index', 't', 'vf_preds', 'action_dist_inputs', 'action_prob', 'action_logp', 'advantages', 'value_targets'])

In [None]:
# HIDDEN
from ray.rllib.utils.compression import unpack, pack

In [133]:
unpack(rollouts[0]["obs"])[:3]

array([[0.6545137 , 0.29728338],
       [0.5238871 , 0.5144319 ],
       [0.6741674 , 0.10163702]], dtype=float32)

In [134]:
unpack(rollouts[0]["new_obs"])[:3]

array([[0.5238871 , 0.5144319 ],
       [0.6741674 , 0.10163702],
       [0.74463004, 0.06117886]], dtype=float32)

In [142]:
rollouts[0]["actions"][:3]

[0, 0, 1]

In [143]:
rollouts[0]["prev_actions"][:3]

[0, 0, 0]

In [136]:
rollouts[0]["rewards"][:3]

[0.6545137166976929, 0.3524414300918579, 0.05838315561413765]

In [138]:
rollouts[0]["dones"][:3]

[False, False, False]

In [141]:
rollouts[0]["action_prob"][:3]

[0.498020201921463, 0.4979872703552246, 0.5016213059425354]

https://docs.ray.io/en/latest/rllib/rllib-offline.html

In [23]:
from ray.rllib.agents.marwil import BCTrainer

In [24]:
from envs import BasicRecommender

In [50]:
obs = """BCJNGGhAwAYAAAAAAAA6wAYAgIAFlbUGAAAAAAAAjBJudW1weS5jb3JlLm51bWVyaWOUjAtfZnJvbWJ1ZmZlcpSTlCiWQAYAAAAAAABE4ZY+khK3PW97dz8Vpgw/8iMGP95UCz++jmE/XiYpPgcgrT6Icho/fObkPqq7FT/YvOg+8sRkP9dvFT/3EVU/+0/OPnZdAT5HdiY/qkPfPqCwWT+AUOc9ruhMPwNm2Dv6Fn8/NThUP3faUj9673I/Poz3Ph4cZT1IIBs/56YvPxP0Jj+v5Ms+//llPs5maD9FQPw9TJ0OPlqmET9uIos9EPeZPmnm5z7LBBo//P1VP2uqEj/K8fA+z05IPeFdOT0DFA8/adk7PeKZfD9IWN49HpUQPzgm2j32VV8/xKG8PhR/8T1xBTg8PmdNPsCJND/lYOY+SqRHPunAkT7r/I8+apY+P80LWj9iYEY/uo8vP8Bq1Tym3Ck/JtMMPy913z7Qg34/IMJUP9Biiz56y54+DGYeP1LGJz+G1wU+/DDGPl8E+z6j7O095c4zPz0Hbj/gXos+R8IRPtOhMz8K96c+Q7MXPq+47z7qtvg9LG6bPnqOEz1D/sU+Q+oqP4Zqaj9ccQU/Y1WfPnSbrj6MLZ8+kXCiPt2aWz4Gvzs/Hac2P3yuzj3TI08/K9pjPw9Zbj9gdjg/2h4TP6tl6D2VStI+KzxNPhYKVj9YNN8+V9nVPjq0RT81EBo/qstZP+/PCT8bGFE/BSbtPsI3Vj+XXm49KCgQP/T93j1xEA0+EhmgPgLU6D4sJUk/0vN6PxrxUT9Mb1g+BnA2P4rljT4Psm4+23HiPpCO9D4mOKM+F56MPi9aOT/4uAc+7Qk/PuIB4z7Tv28/CnVTP/YKKj4f300/4YDcPpzKPT5HpBs/uBIIP0Z2BT+usYA6OLzBPFmFIT+uU7I+12tyPzg9oz7QGIw+UyIKP3qnUz/H46Q+ePY4PS9mlD5pmf8+T95bP+X7cz/lM4w+KK50Py+tZT/moxo/gwo+Phvjdz9v5WU/Wb85PzREYz6Gtgs/j32wPm9aMz4ytTA/CTjWPlYvKj1Cqf0+aiUBP9oBuD05dx0/N8JdPwQvID9DGkI/nd2qPqnpBz5Tm3A/U09PPzjmZz/RzyE/lgWNPmjyID+6taU+4GzSPqWbrj4RWBU/DfUhPysiSz4VEcc9DJ06PzUVCD/BOHQ/kGtgP4CZYD8jrE8/ey05Pjqtsj5CoYk+P0xqP0S5Fj8G9yw/foIlPuBocj9k0p0+Eb4gPrUWUz4+hcw+FhlBPo6ePz/5pkE/oVluPqambT+r6Bs/xA0/P94Faz4l0uk+gk6rPrE/bD+poKw+5t87P6tfdj9h6VM/y59EPzmjnj5I+GI/AFkQP3jRqj7ANpk+zYzePfbgDD9GqWA+L9mhPt5DST+2YfA+8gEGP3utCz9T8SE/lPx/P+PzmT0RDlY+uF0ZPza5eD5m+Gw/pQ9MPjgSBj/tI1o/yhvYPjOY2D5JTu8+rrOAPeT8Nz5kQbM+2hniPl6e3j5DIg8/eMUqPQbBBT21iQ0/8F6PPE/wcD8SlFI/IlT3PnaJOD+42Gg/ntwKPy3ebT4Nfnc/aMl9P0lCkT6QR+M8sBMmP4ghYT80oCI/ZJPiPoIQSz8sAuo9r0AXP6z0tz4r00Y/zRDdPuYhYT56Kcs+GNtxP3oFJj8ZQUY+tKUVPqlOnz7ql8Q9f/L+PkSecz4rRYk9xPyiPjw2AT6ivWU/iA4MPuvUnD7t8Uo/jyDXPod0oT778Qw/SeCKPVa8FT/ukR8/rjunPigjLj9uFRI/BtEiP2x5uD7Flng/LQxEP1HIZD/oSXE/tB8DPkNiNT9LlcA+VHQxPpcj8z4fbR0/aNqMPd48dT88GfM9/rFqPxKe8z674SI/u01RPviCTz+yK2E+sV84P+185D79Ulw9lqpeP9XNjj66icE+JcsyPOVwqT5+Rto75WAZP2iNdz8MpMs+WCocPfkhEz/HKU0+NEk1P3GfaT/5PQQ/r3eqPn5nkT5LtRI/QW3HPpt1XT97h/I99l62PthWQj9CXvk+onRSP/PJFT+Y8C0/YAIqP8DDHj3PXEE/4DPAPU+g8T7E9M89f4UJP6LbUj/gN9k9wxR/P+Gwoj6P/hM/M1vhPoeHLz0KAD0/SzekPnwtNj9Wxks/4aNhP4ZPrz2jtUY/pXxdPmnM1z2dzfU+Qr6PPkIlBT8Ducg9lIwFbnVtcHmUjAVkdHlwZZSTlIwCZjSUiYiHlFKUKEsDjAE8lE5OTkr/////Sv////9LAHSUYkvISwKGlIwBQ5R0lFKULgAAAAA=", "new_obs": "BCJNGGhAwAYAAAAAAAA6wAYAgIAFlbUGAAAAAAAAjBJudW1weS5jb3JlLm51bWVyaWOUjAtfZnJvbWJ1ZmZlcpSTlCiWQAYAAAAAAABve3c/FaYMP/IjBj/eVAs/vo5hP14mKT4HIK0+iHIaP3zm5D6quxU/2LzoPvLEZD/XbxU/9xFVP/tPzj52XQE+R3YmP6pD3z6gsFk/gFDnPa7oTD8DZtg7+hZ/PzU4VD932lI/eu9yPz6M9z4eHGU9SCAbP+emLz8T9CY/r+TLPv/5ZT7OZmg/RUD8PUydDj5aphE/biKLPRD3mT5p5uc+ywQaP/z9VT9rqhI/yvHwPs9OSD3hXTk9AxQPP2nZOz3imXw/SFjePR6VED84Jto99lVfP8ShvD4Uf/E9cQU4PD5nTT7AiTQ/5WDmPkqkRz7pwJE+6/yPPmqWPj/NC1o/YmBGP7qPLz/AatU8ptwpPybTDD8vdd8+0IN+PyDCVD/QYos+esuePgxmHj9Sxic/htcFPvwwxj5fBPs+o+ztPeXOMz89B24/4F6LPkfCET7ToTM/CvenPkOzFz6vuO8+6rb4PSxumz56jhM9Q/7FPkPqKj+Gamo/XHEFP2NVnz50m64+jC2fPpFwoj7dmls+Br87Px2nNj98rs490yNPPyvaYz8PWW4/YHY4P9oeEz+rZeg9lUrSPis8TT4WClY/WDTfPlfZ1T46tEU/NRAaP6rLWT/vzwk/GxhRPwUm7T7CN1Y/l15uPSgoED/0/d49cRANPhIZoD4C1Og+LCVJP9Lzej8a8VE/TG9YPgZwNj+K5Y0+D7JuPttx4j6QjvQ+JjijPheejD4vWjk/+LgHPu0JPz7iAeM+079vPwp1Uz/2Cio+H99NP+GA3D6cyj0+R6QbP7gSCD9GdgU/rrGAOji8wTxZhSE/rlOyPtdrcj84PaM+0BiMPlMiCj96p1M/x+OkPnj2OD0vZpQ+aZn/Pk/eWz/l+3M/5TOMPiiudD8vrWU/5qMaP4MKPj4b43c/b+VlP1m/OT80RGM+hrYLP499sD5vWjM+MrUwPwk41j5WLyo9Qqn9PmolAT/aAbg9OXcdPzfCXT8ELyA/QxpCP53dqj6p6Qc+U5twP1NPTz845mc/0c8hP5YFjT5o8iA/urWlPuBs0j4/t7494CXjPg31IT8rIks+FRHHPQydOj81FQg/wTh0P5BrYD+AmWA/I6xPP3stOT46rbI+QqGJPj9Maj9EuRY/BvcsP36CJT7gaHI/ZNKdPhG+ID61FlM+PoXMPhYZQT6Onj8/+aZBP6FZbj6mpm0/q+gbP8QNPz/eBWs+JdLpPoJOqz6xP2w/qaCsPubfOz+rX3Y/YelTP8ufRD85o54+SPhiPwBZED940ao+wDaZPs2M3j324Aw/RqlgPi/ZoT7eQ0k/tmHwPvIBBj97rQs/U/EhP5T8fz/j85k9EQ5WPrhdGT82uXg+ZvhsP6UPTD44EgY/7SNaP8ob2D4zmNg+SU7vPq6zgD3k/Dc+ZEGzPtoZ4j5ent4+QyIPP3jFKj0GwQU9tYkNP/BejzxP8HA/EpRSPyJU9z52iTg/uNhoP57cCj8t3m0+DX53P2jJfT9JQpE+kEfjPLATJj+IIWE/NKAiP2ST4j6CEEs/LALqPa9AFz+s9Lc+K9NGP80Q3T7mIWE+einLPhjbcT96BSY/GUFGPrSlFT6pTp8+6pfEPX/y/j5EnnM+K0WJPcT8oj48NgE+or1lP4gODD7r1Jw+7fFKP48g1z6HdKE++/EMP0ngij1WvBU/7pEfP647pz4oIy4/bhUSPwbRIj9sebg+xZZ4Py0MRD9RyGQ/6ElxP7QfAz5DYjU/S5XAPlR0MT6XI/M+H20dP2jajD3ePHU/PBnzPf6xaj8SnvM+u+EiP7tNUT74gk8/sithPrFfOD/tfOQ+/VJcPZaqXj/VzY4+uonBPiXLMjzlcKk+fkbaO+VgGT9ojXc/DKTLPlgqHD35IRM/xylNPjRJNT9xn2k/+T0EP693qj5+Z5E+S7USP0Ftxz6bdV0/e4fyPfZetj7YVkI/Ql75PqJ0Uj/zyRU/mPAtP2ACKj/Awx49z1xBP+AzwD1PoPE+xPTPPX+FCT+i21I/4DfZPcMUfz/hsKI+j/4TPzNb4T6Hhy89CgA9P0s3pD58LTY/VsZLP+GjYT+GT689o7VGP6V8XT5pzNc9nc31PkK+jz5CJQU/A7nIPcDohD4uhlo/lIwFbnVtcHmUjAVkdHlwZZSTlIwCZjSUiYiHlFKUKEsDjAE8lE5OTkr/////Sv////9LAHSUYkvISwKGlIwBQ5R0lFKULgAAAAA="""

In [57]:
import ray.utils

In [62]:
unpack(obs).shape

(200, 2)

In [25]:
env_config = {
    "num_candidates" : 2,
    "alpha"          : 0.5,
    "seed"           : 42
}

In [63]:
env = BasicRecommender(env_config)

# trainer_config = {
#     "lr"                    : 0.001,
#     "model"                 : {
#         "fcnet_hiddens"     : [64, 64]
#     },
#     "env_config"            : env_config
# }

# Configuring the BCTrainer:
offline_rl_config = {
    "framework"             : "torch",
    "create_env_on_driver"  : True,
    "seed"                  : 0,
    "env_config" : env_config,

    # Specify your offline RL algo's historic (JSON) inputs:
    "input": [json_output_file],
    "actions_in_input_normalized": True,
    # Note: For non-offline RL algos, this is set to "sampler" by default.
    #"input": "sampler",

    # Since we don't have an environment and the obs/action-spaces are not defined in the JSON file,
    # we need to provide these here manually.
    "env": None,  # default
    "observation_space": env.observation_space,
    "action_space": env.action_space,

    # Perform "off-policy estimation" (OPE) on train batches and report results.
    "input_evaluation": ["is", "wis"],
}

In [64]:
# Create a behavior cloning (BC) Trainer from our config.
bc_trainer = BCTrainer(config=offline_rl_config)

In [66]:
out = bc_trainer.train()

TRY MARWIL OR CWL - these are "real offline RL" rather than imitation learning

In [69]:
from ray.rllib.agents.marwil import MARWILTrainer

In [72]:
marwil_trainer = MARWILTrainer(config=offline_rl_config)

In [75]:
out = marwil_trainer.train()

In [74]:
# from ray.rllib.agents.cql import CQLTrainer # DOES NOT SUPPORT DISCRETE ACTIONS

there are tuned examples for CQL

#### Let's apply what we learned!