Introduction to Offline Reinforcement Learning
---

Reinforcement learning algorithms primarily rely on an online learning approach, which poses a significant challenge to their widespread use. RL typically involves a continuous process of gathering experience by engaging with the environment using the latest policy, and then using this experience to enhance the policy. **In many situations, this online interaction is neither practical nor safe due to costly or risky data collection, as seen in fields such as robotics, healthcare, etc.**.

**Even in cases where online interaction is viable, there is often a preference to leverage existing data**. This is particularly true in complex domains where substantial datasets are essential for effective generalization.

**Offline Reinforcement Learning(RL), also known as Batch Reinforcement Learning, is a variant of RL that effectively leverages large, previously collected datasets for large-scale real-world applications**. The use of static datasets means that during the training process of the agent, offline RL does not perform any form of online interaction and exploration, which is also the most significant difference from online reinforcement learning methods.


### What is the problem to be solved?

The offline RL problem can be defined as a data-driven approach to the previously seen online RL methods. As before, the goal is to minimize the discounted expected reward:

$$
\begin{equation}
J (\pi) = \mathbb{E}_{\tau \sim D}  \left[ \sum_{t = 0}^{\infty} \gamma^t r (s_t, a_t) \right]
\tag{1}
\end{equation}
$$

But this is done directly on the dataset $D$:

$$
D = \{(s_0, a_0, r_0), (s_1, a_1, r_1), \ldots, (s_T, a_T, r_T)\} \quad \tag{Dataset}
$$

comprising state/action/reward values. This is in contrast to online RL, where the trajectories $\tau$ are collected through interaction with the environment. **Note that $D$ doesn't necessarily need to be related to the specific task at hand, but it should be sufficiently representative (i.e., containing high-reward regions of the problem). If $D$ is derived from data collected from unrelated tasks, we will need to design our own reward function, just as we did in the online RL setting.**

The dataset can be collected from a suboptimal policy, a random policy, provided by a human expert, or a mixture of them. Such a policy is called a behavior policy (or expert policy), denoted as $\pi_b$.

As an example of a behavior policy ($\pi_b$), when training a robot to navigate a room, $\pi_b$ might involve rules like "avoid obstacles," "move forward," or "stop." It could be operated manually by a human who controls the robot based on sensor data, such as lidar for collision detection, or it could be a combination of suboptimal policies.

**The goal of offline RL is to derive an optimal or near-optimal policy directly from $D," without requiring further interactions with the environment**.

Other important points:

- Offline RL differs from "imitation learning" (discussed later), which is its supervised learning counterpart. **In imitation learning, the policy closely mirrors the behavior policy, while offline RL aims for a superior policy, ideally near the optimal one.**

- Offline RL brings complexity because the offline policy differs in state-action distributions from the behavior policy, leading to **distributional shift challenges**. Similar challenges may arise in behavioral cloning, discussed later.

Some examples where offline RL could be highly benefitial:
    
**Decision Making in Health Care**: In healthcare, we can use Markov decision processes to model the diagnosis and treatment of patients. Actions are interventions like tests and treatments, while observations are patient symptoms and test results. Offline RL is safer and more practical than active RL, as treating patients directly with partially trained policies is risky.

**Learning Robotic Manipulation Skills**: In robotics, we can use active RL for skill learning, but generalizing skills across different environments is challenging. Offline RL allows us to reuse previously collected data from various skills to accelerate learning of new skills. For example, making soup with onions and carrots can build on past experiences with onions and meat or carrots and cucumbers, reducing the need for new data collection.

**Learning Goal-Directed Dialogue Policies**: Dialogue systems, like chatbots helping users make purchases, can be modeled as MDPs. Collecting data by interacting with real users can be costly. Offline data collection from humans is a more practical approach for training effective conversational agents.

**Autonomous Driving**: Training autonomous vehicles in real-world environments can be dangerous and costly. Offline RL can use data from past driving experiences to improve vehicle control and decision-making, making it safer and more efficient.

**Energy Management**: Optimizing energy consumption in buildings or industrial processes can be a critical task. Offline RL can analyze historical energy usage data to develop efficient control strategies, reducing energy costs and environmental impact.

**Finance**: Portfolio management and trading strategies often require learning from historical financial data. Offline RL can help develop and refine investment policies by utilizing past market data.

In summary, the process is clear:

**Phase A**: Collect data set, $D$, of state-action paird through a behavior policy $\pi_b$: e.g. a robot randomly moving (or human controlled) in a given space, data collected from an autonomous vehicle, etc. The data collected doesn't need to come from an expert (typically the case in real situations) and during this phase we are not worry in general about a specific task (i.e. rewards). In fact it could be that the data is collected from a robot doing a different task that the one we are interested in. We want just a set of allowed state-action pairs that could be usable and representative for the task in mind.

**Phase B**: In this phase we want to solve a given task (so we need to design rewards) only through the provided initial data without any interaction with the environment but still be able to find an optimal or near-optimal policy.

Today, we will explore different points in offline Reinforcement Learning and in particular:

    1. The technical challenges in offline RL.
    
    2. Differences between approaches like imitation learning and online vs. offline RL, 
       and what we can adapt from online methods for the offline setting.

       
    3. Standard data collection approaches and libraries used in the offline RL 
       community

For our exercises, we'll primarily use a 2D grid environment for two important reasons:

    1. It allows for quick training and data collection, making trainings and 
       data-collection quite fast giving you the possibility to play around in the 
       workshop!
 
    2. Simplifying the environment helps us focus on the fundamental concepts and 
       benefits of offline RL, which can be challenging in high-dimensional spaces.

However, keep in mind that both the provided library and the exercises can be adapted to handle more complex scenarios so feel free to give them a try!"

### References

[ Levine et.al '2021 - Offline Reinforcement Learning: Tutorial, Review,
and Perspectives on Open Problems ](https://arxiv.org/pdf/2005.01643.pdf)

[Prudencio et. al.' 2023 - A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems ](https://arxiv.org/pdf/2203.01387.pdf)

## ToDo or not ToDo: 

- Add agenda!!
- Offline BCQ-CQL exercises TODO!!
- Exercise 6a --> improve it!! TODO!!

- Create an enum with: POLICY_FILE = "policy_final.pth"
- Move notebooks to tfl repo --> PUSH ASAP!


- Let's give a look to the structure of the code ...
- Give a look to BC with transformers
- How to measure safety when o.o.d data?
- Create an enum with: POLICY_FILE = "policy_final.pth"
- Add titles to the state-action distribution histograms!
- Check all arguments in tianshou trainings like: update_per_epoch ...?
- Use the code for more complicated training. Use BCQ as expert like 25 and behavioral 
    clonning on the 25 trajectories.
- Fix all references

** NOTEBOOK 2 - PHASE A **: This could be a notebook itself

But let's explore first the much simpler case of imitation learning.

I should introduce behavioral clonnig and give an excercise!! --> and show distributional shift on this situation too !!!

Maybe this should be here?:  [example](https://docs.google.com/presentation/d/1-cfO7MNcH6iyN4EwyFjI9F3cVu4hYn5ihCqmZHBeeSk/edit#slide=id.g28654ea4ec6_0_8)



Exercise 1 - 

Create behavior policy and minari dataset. Train IL algorithm. See o.o.d data, etc.

Imitation learning - behavioral clonnig - Dagger?

Understand o.o.d in imitation learning:

 0 0 0 0 0 0 0 0
 
 0 0 0 0 0 0 0 0
 
 0 0 1 1 1 1 0 0
 
 0 0 1 1 1 1 0 0
 
 0 0 1 1 1 1 0 0
 
 0 0 1 1 1 1 0 0
 
 0 0 0 0 0 0 0 0
 
 0 0 0 0 0 0 0 0
 
random policy --> save data

Imitation learning will produce actions on the obstacles and so during evaluation the policy could enter in the obstacle side and what happens there could be totally unexpected increasind the o.o.d

Explain what happens with a plot.

NOTEBOOK III - Offline RL theory.

My notes: (based on 2106.08909 and reviews)

A - Distributional shift (Same as behavioral clonning but even worse as there will be also distributional shift 
even without o.o.d actions.) --> plots and pedagogical explanations. (see in exercises later)

B - Exploration errors (Exercise ??) : Seems that Levine doesn't give a shit about this!!!! --> I think it could be 
important

NOTEBOOK IV: Minari datasets (overview Minari plus D4RL) - generate minari dataset - combine datasets --etc

In [None]:
NOTEBOOK VI: More theory about algorithms ???

In summary, the process is simple: 

Phase A - Collect data set, $D$, of state-action paird through a behavior policy (also known as expert policy) $\pi_b$: e.g. a robot randomly moving (or human controlled) in a given space, data collected from an autonomous vehicle, etc. The data collected doesn't need to come from an expert (typically the case in real situations) and during this phase we are not worry in general about a specific task (i.e. rewards). In fact it could be that the data is collected from a robot doing a different task that the one we are interested in. We want just a set of allowed state-action pairs that could be usable for the task in mind.

Phase B - In this phase we want to solve a given task (so we need to design rewards) only through the provided initial data without any interaction with the environment but still be able to find an optimal or near-optimal policy.

Let's see phase A and B with a simple [example](https://docs.google.com/presentation/d/1-cfO7MNcH6iyN4EwyFjI9F3cVu4hYn5ihCqmZHBeeSk/edit#slide=id.g28654ea4ec6_0_8)

Let's start with phase A.

We have created a set of 2D environments as the one shown before. In all of them agent starts from the position (0,0) and needs to reach the position (grid_size-1, grid_size-1) and there are two version: V0 means free of obstacles and V1 with obstacles (see custom_envs_registration.py for more details). Below you can find the code running the environment.

In [None]:
#ToDo: Embedded the animation in jupyter cell.
# Press 'q' to quit the animation

ENV_NAME = CustomEnv.Grid_2D_8x8_discrete_V1 #"Grid_2D_5x5_discrete-V1"
NUM_STEPS = 1000
RENDER_MODE = RenderMode.RGB_ARRAY_LIST


register_grid_envs()
env = gym.make(ENV_NAME, render_mode=RENDER_MODE)

BEHAVIOR_POLICY =  BehaviorPolicyType.random

render_custom_policy_simple_grid(
    env_name=ENV_NAME,
    render_mode=RENDER_MODE,
    behavior_policy_name=BEHAVIOR_POLICY,
    num_steps=NUM_STEPS,
)

Exercise I: Create a suboptimal policy for the grid env. and visualize it

**hint**: in the 8x8 grid env. we will have 64 states where every state i=0,..,63 is mapped to (x,y) for x,y=0,..,7 . So the suboptimal_policy should be function that accept a np.array shape=(64,) and returns an integer 
between 0 and 4. 


ACTIONS

    0: (-1, 0), #UP
    1: (1, 0),  #DOWN
    2: (0, -1), #LEFT
    3: (0, 1)   #RIGHT


In [None]:
from examples.offline_RL_workshop.utils import one_hot_to_integer
import numpy as np

state , _ = env.reset()
state_value = one_hot_to_integer(state)


def suboptimal_policy(state: np.ndarray, env:gym.Env) -> int:
    state_index = one_hot_to_integer(state)
    state_xy = env.to_xy(state_index)
    
    possible_directions = [2, 3, 1]
    weights = [1, 1, 3]
    random_directions = random.choices(possible_directions, weights=weights)[0]
    
    if state_xy[0] == 7:
        possible_directions = [0, 1, 3]
        weights = [1, 1, 3]
        random_directions = random.choices(possible_directions, weights=weights)[0]
    
    return random_directions
  
    
BEHAVIOR_POLICY =  suboptimal_policy
render_custom_policy_simple_grid(
    env_name=ENV_NAME,
    render_mode=RENDER_MODE,
    behavior_policy=BEHAVIOR_POLICY,
    num_steps=NUM_STEPS,
)

With the random policy and the obstacle in the environment we are able to mimic the example introduced before: the obstacle could be a fragile object or a wall and the agent moves randomly. So let's generate the datset to be use in Phase B (i.e. RL training). 

In order to deal with datasets for offline RL we will follow the [minari](https://minari.farama.org/main/) library that is becoming an standard in the field (Previously D4RL but migrating soon to Minari). Minari is a Python API that hosts a collection of popular Offline Reinforcement Learning datasets that are use for offline RL benchmarkings in the academia. 

Note: Give a look to Minari library.

a - [Datasets](https://minari.farama.org/datasets/door/)

b - [basic usages](https://minari.farama.org/main/content/basic_usage/)

    Exercise II: Create your own minari datase following the tutorial and your custom policies.
    Remember to use the DataCollectorV0 wrapper provided in the library. Load the dataset.


**Note: See later D4RL and MINARI dataset rationale and OPE   TODO !!!**

## Todo: Create two datasets solving different goals and combine them. 
Exercise: combine two minari datasets and show how two datasets intended for different tasks can be reused for a new one.

ToDo: Solve error with delete minari dataset ....

In [None]:
#Exercise: Create two datasets with the previous custom behavior policies and export to minari format.

##### Exercise HERE !!!! #####

In [None]:
#Let's create the data with our code (escentially Exercise II with a bit of more structure.). See how this works.

DATASET_CONFIG = {
    "env_name": ENV_NAME,
    "data_set_name": ENV_NAME + "_data",
    "num_steps": 3000,
    "behavior_policy": BehaviorPolicyType.behavior_suboptimal_2d_grid_discrete,
}

create_minari_datasets_from_dict(DATASET_CONFIG)


# Create Random dataset --> visualize distributions

# Comment about distribution shift -- pedagogically ...

In [None]:
# Load minar data: as we are going to use Tianshou we will load the minari data in a tianshou ReplyBuffer

data = load_buffer_minari(DATASET_CONFIG["data_set_name"])

# Visualize state-action pair data distribution
state_action_count_data, _ = get_state_action_data_and_policy_grid_distributions(data, env, policy=None)
state_action_histogram(state_action_count_data)

#state_action_histogram(state_action_count_policy)
#compare_state_action_histograms(state_action_count_data, state_action_count_policy)


Phase B:

RL policy: Clonning, Dagger:

look at distribution policy vs distribution data

Exercise III: Extract conclusions in teams....


offline SAC, DQN: 

Distribution shift plus ood (training - inference)

Exercise IV: Extract conclusions in teams....

Why is offline RL a difficult problem?


1 - **Not possible to improve exploration**: As the learning algorithm must rely entirely on the static
dataset $D$, there is no possibility of improving exploration: if D does not contain transitions that illustrate high-reward regions for our task it may be impossible to discover those regions. If we explore beyond our data we could have severe problems as there could be a good reason why this data is not in our dataset: maybe there is an obstacle that could damage the robot or a fragile object that could be damaged by the robot!

Note that this is opposite to online RL where you explore by interacting with the environment. 

This is why the collecting data phase is so important!!

2 - **Distributional shift**: state-action pair distribution in $D$ does not accurately represent the distribution of states-actions of the trained policy. This challenges many existing machine learning methods, which assume that data is independent and identically distributed (i.i.d.). In standard supervised learning, we aim to train a model that performs well on data from the same distribution as the training data. In offline RL, our goal is to learn a policy that behaves differently (hopefully better) than what's seen in the dataset $D$. As a consequence (see later) the RL algorithms will tend to generate actions not included in $D$ and so generate **out of distribution actions data**. This could be dangerous as during inference these actions could bring the system to unexplored states (i.e. not included in $D$).

ToDo: See later some example about it. It could be a 2D grid where the data was collected considering an obstacle but in inference we use the same grid without obstacles and we start from an obstacle zone??? ).

- Theory about ood/extrapolation errors. 

- Theory about typology of offline RL algorithms.

**Important observation**:
$
Q_{\theta^*}(s, a) \approx r(s, a) + \gamma \mathbb{E}_{s_0 \sim p(\cdot|s,a), a_0 \sim \pi_{\beta}(\cdot|s)} [Q_{\hat{\theta}}(s_0, a_0)]
$

This restrict everything to the dataset but in this way you don't use multi-step dynamic programming (So in reality some task could fail)...see Levine paper IQL (https://arxiv.org/pdf/2110.06169.pdf)


Three main categories:

I - Policy constraint
II - Policy Regularization
III - Importance sampling


I: Policy constraint: 

**Non-implicit or Direct** (e.g. BCQ): Access to $\pi_\beta$. Problem when bad estimation: e.g. when we fit a unimodal policy into multimodal data. In that case, policy constraint methods can fail dramatically.

Another issue with policy constraints is that these methods can often be too pessimistic, which is always undesirable. For instance, if we know that a certain state has all actions with zero reward, we should not care about constraining the policy in this state once it can inadvertently affect our neural network approximator while forcing the learned policy to be close to the behavior policy in this irrelevant state.

**Implicit** (e.g. AWR): Work directly with data. 

Suppose you have a behavioral policy $\mu$ and you want to find a new better one $\pi$ what yuo could do is to maximize the diference reward:

$\eta(\pi) = J(\pi) - J(\mu)$ .

It can be shown that given two policies $\pi$ and $\mu$ the following general result holds:

$\eta(\pi) = \mathbb{E}_{s \sim d^{\pi}(s)} \mathbb{E}_{a \sim \pi(a|s)} [A^{\mu}(s, a)] = \mathbb{E}_{s \sim d^{\pi}(s)} \mathbb{E}_{a \sim \pi(a|s)} \left[ R^{\mu}_{s,a} - V^{\mu}(s) \right]
$ (See https://arxiv.org/pdf/1502.05477.pdf)

$
\begin{align*}
\underset{\pi}{\text{arg max}} \int \int \frac{d\mu(s)}{da} \frac{d\pi(a|s)}{ds} (R^{\mu}_{s,a} - V^{\mu}(s)) \, da \, ds \tag{5}\\
\text{s.t.} \int \frac{d\mu(s)}{ds} \text{DKL}(\pi(\cdot|s) || \mu(\cdot|s)) \, ds \leq \epsilon
\end{align*}
$


$
L(\pi, \beta) = \int \int \frac{d\mu(s)}{ds} \frac{d\pi(a|s)}{da} (R^{\mu}_{s,a} - V^{\mu}(s)) \, da \, ds + \beta \left( \epsilon - \int \frac{d\mu(s)}{ds} \text{DKL}(\pi(\cdot|s) || \mu(\cdot|s)) \, ds \right)
$

and solving for max of $\pi$:

$
\pi^*(a|s) = \frac{1}{{Z(s)}} \mu(a|s) \exp\left(\frac{1}{\beta} \left(R^{\mu}_{s,a} - V^{\mu}(s)\right)\right)
$

Let $\rho_{\pi}$ be the (unnormalized) discounted visitation frequencies:

$\rho_{\pi}(s) = P(s_0 = s) + \gamma P(s_1 = s) + \gamma^2 P(s_2 = s) + \ldots$




$\eta(\tilde{\pi}) = \eta(\pi) + \sum_{t=0}^{\infty} \sum_{s} P(s_t = s | \tilde{\pi}) \sum_{a} \tilde{\pi}(a|s) \gamma^t A^{\pi}(s, a) 
= \eta(\pi) + \sum_{s} \sum_{t=0}^{\infty} \gamma^t P(s_t = s | \tilde{\pi}) \sum_{a} \tilde{\pi}(a|s) A^{\pi}(s, a) 
= \eta(\pi) + \sum_{s} \rho_{\pi}^{\tilde{\pi}}(s) \sum_{a} \tilde{\pi}(a|s) A^{\pi}(s, a).$

where $ R^{\mu}_{s,a}$ is the cumulative reward in a trajectory with initial state $s$ taking action $a$ and following the policy $\pi$ from that point on (Note that this is the usual Advantage function $Q(s,a) - V(s)$).

As it is so hard to optimize because the expectation is with respect to $d^{\pi}(s)$ we can approximate in the same way as we did for the Trust Region Policy Optimization (TRPO): Remember that $\hat{\eta}(\pi)$ and ${\eta}(\pi)$ match to first order (

$\hat{\eta}(\pi) = \mathbb{E}_{s \sim d^{\mu}(s)} \mathbb{E}_{a \sim \pi(a|s)} [R^{\mu}_{s,a} - V^{\mu}(s)]
$


In exact policy iteration for instance (Q_learning), you choose actions deterministically based on the advantage function, i.e. $ \tilde{\pi}(s \middle a) = argmax_a A_\pi(s,a) $. This policy improvement technique works as follows:

If there's at least one state-action pair with a positive advantage value and a nonzero state visitation probability, then using exact policy iteration will improve the policy.
If there are no state-action pairs with a positive advantage value and nonzero state visitation probability, it means the algorithm has converged to the optimal policy.)




We can constraint the learned and behavioral policy through:

\begin{align*}
D_{KL}(\pi(a|s)||\pi_{\beta}(a|s)) \leq \epsilon
\end{align*}

and as shown in (ref.1 )we can bound $D_{KL}(d_{\pi}(s)||d_{\pi_{\beta}}(s))$ by $\delta$, which is $O\left(\frac{\epsilon}{{(1 - \gamma)}^2}\right)$


In a Bellman kind approach the distributional shift only affects actions during training but it could affect states during inference (ToDo: Give an example of this!!!)

Cons: final performance not quite good, as the behavior policy – and any policy that is close to it – may be much worse than the best policy that can be learned from the offline data. (This is not totally what happend in BCQ)

(Note: BCQ seems more about constraint the support rather than the DKL 'overlap'.)

Type of policy constraint approaches: 

**explicit f-divergence constraints**

**implicit f-divergence constraints**

**integral probability metric (IPM)constraints**

See this with examples!!! : The constraints can be enforced either as direct policy constraints on
the actor update, or via a policy penalty added to the reward function or target Q-value


1 - Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International conference on machine learning, pages 1889–1897.

IQL (implicit constraint and regularization):



- Offline RL famous algorithms: Theory (BCQ, IQL, CQL, etc.)

- Apply RL famous algorithms to 2D grid: 
a - full analysis of OOD, 
b - balance in-distribution vs ood -- performance relationship.
c - Analyse different RL typologies to random and suboptimal policies.

- See Mujoco example and others (this should be done in parallel with the 2D grids) : 
  For instance: see suboptimal policy vs trained policy (IL. BCQ, CQL, etc.)

In [None]:
!pdfexport presentation.pdf

. See Paine et al. [2020] for a discussion of the active area of
research on hyperparameter tuning for offline RL. We also discuss this further in Appendix C.