## Offline RL challenges

The idea of offline RL is simple: find the optimal policy from a given dataset. The dataset doesn't need to be collected from experts as in imitation learning and neither from the same task but in the last case it should be representative of our problem at hand. So offline RL covers most of the realistic scenarios but also it could be applied to a lot of data coming from different sources as we saw before ....


The key distinction between online RL and offline RL lies in their exploration capabilities. In online RL, we actively explore the state-action space, while offline RL operates solely within a fixed dataset (denoted as $D$). Going "out of distribution" beyond this dataset in offline RL can lead to severe issues.

<img src="Images/offline_RL.jpg" alt="offline_rl" style="height:400px;">

Online RL involves interactive exploration to discover the highest-reward regions by gathering environmental feedback. In contrast, offline RL imposes strict limitations on exploration beyond the dataset. This constraint results in algorithms overestimating unknown areas and attempting to navigate beyond the dataset, as illustrated in the figure below where the dataset doesn't fully represent high-reward regions.

<img src="Images/offline_RL_3.jpg" alt="offline_rl_1" style="height:400px;">

As shown in the figure above on the right, once you are out of distribution (o.o.d) (states $s$ and $s'$ in red in the figure), as you don't have any feedback it will be hard to come back to $D$, as the o.o.d errors will propagate. As we will see this is one of the main challenges of offline RL and there are different techniques to mitigate this wrong behavior.

The o.o.d. issues are not the only distributional shift effect in offline RL.
After computing the optimal policy, it typically operates within a subset of the original dataset distribution, creating a distinct form of distributional shift (D' subset in green in the figure below). Evaluating a policy substantially different from the behavior policy reduces the effective sample size (from D to D'), resulting in increased variance in the estimates. In simpler terms, the limited number of data points may not accurately represent the true data distribution. 

<img src="Images/offline_RL_2.jpg" alt="offline_rl_2" style="height:400px;">


Can we apply techniques from online RL, known for its effectiveness in solving complex problems, to offline RL? Yes, we can. We can adapt concepts from online RL, particularly off-policy RL algorithms.

In both offline and off-policy RL, we train a policy using a batch of data from a replay buffer. The key difference is that in online RL, this data is constantly updated with new data generated by an improving policy through exploration. In offline RL, we cannot change our replay buffer; we must use the same dataset throughout training. This fundamental distinction in data collection and utilization presents both challenges and opportunities in offline RL:



**Challenges**: ..... ?????


**Opportunities**: ..... ????

**EXERCISE I**: Distributional shift due to out of distribution points.

Let's give a look to the exercise in notebook 6a. We will apply a simple offpolicy algorithm, Deep Q-Network (DQN), replacing the environment we typically use in online settings with a ReplyBuffer of state-action pairs that we will collect with a suboptimal policy as we did in the previous notebooks. 

**Important**: In contrast to Imitation Learning, now we will be interested in a reward function as in DQN we will compute the TD error, i.e.:

$$ Q(s, a) = Q(s, a) + α * [r + γ * max_a'(Q(s', a')) - Q(s, a)]$$

We will see how easily the DQN goes out of distribution. 


In this particular example: 

A - We are close to the highest reward region.

B - I should include movements also in two rows -> see if still out of distribution!!


Why is DQN going out of distribution? The reason is more general than DQN and is a pathology that affects many RL algorithms and in particular the ones following an approximate dynamical programming approach where an evaluation-improvement iterative process is used to reach the optimal policy, i.e.:

$$
{\hat Q}^{\pi}_{k+1} \leftarrow \arg \min_Q \mathbb{E}_{(s,a,s')\sim D} \left[ Q(s, a) - \left( r(s, a) + \gamma \mathbb{E}_{a' \sim\pi_k(a'|s')}[{\hat Q}^{\pi}_k(s', a')] \right)^2 \right]  \tag{Evaluation}
$$

$$
\pi_{k+1} \leftarrow \arg \max_{\pi} \mathbb{E}_{s\sim D} \left[ \mathbb{E}_{a \sim\pi(a|s)} Q^{\hat{\pi}_{k+1}}(s, a) \right] \tag{Improvement}
$$

Observe that in an offline setup we restrict the states $s$, $s'$ and $a$ to belong to the dataset, $D$, but the action $a'$ sampled from the policy $\pi_k$ in the evaluation step could be outside of our dataset. If we don't have enough information around the state $s'$ (as is typically the case in offline RL) the uncertainty about those states could make $\pi_k(a'|s')$ select actions not covered in $D$ and this error will propagate in the evalaution iteration process. In online RL such an overestimation will be fixed through exploration but this not an option in offline methods. This is perhaps the most important distributional shift issue that suffers offline RL, as already mentioned in the introduction.  

Note: We could have also DNN function approximation errors that could also produce overestimation of actions bringing the system to out of distribution data:

Q1. This should also happend in online RL. It happens in the imitation learning problem. Need more thinking!!

Q2. Also see: An example in page 8 of https://arxiv.org/pdf/2106.08909.pdf.

**EXERCISE II**: Distributional shift due to effective sample size reduction ($D\rightarrow D'$).

After each evaluation-improvement iteration, the policy seeks to maximize its performance within the dataset, which can inadvertently reduce the effective sample size. However, this isn't always desirable because there may be data points within the dataset, albeit infrequent or under-represented, that hold critical information. Ensuring that such data is appropriately considered during inference is a vital challenge in offline reinforcement learning.

In a second notebook exercise, we'll create two datasets using different behavior policies: one suboptimal and the other optimal, resembling human expertise. Since human expertise data is scarce, the combined dataset contains only a few episodes for that policy, resulting in underrepresentation of high-reward states. The challenge will be to accurately capture these infrequent yet crucial points for off-policy algorithms.

Let's give a look to the [notebook](http://localhost:8888/notebooks/examples/offline_RL_workshop/notebooks/Offpolicy_example_2.ipynb) !!

**EXERCISE III**: Another desiderable property of offline RL as discussed in the Minari section is to reuse pieces of trajectories from different datasets in order to create an optimal one for that task in question!

### How to solve the distributional shift problem

There are different approaches but the main idea is to find a balance where the policy distribution is not too far away from the behavioral one but still able to outperform it. So it means to create some distributional shift (in order to improve the policy) but without goind out of distribution and also keeping the efefctive sample size large enough that still be representative in inference. So, not an easy task at all and it is a active field of research in the RL community.


In order to achieve the previous goal we can classifiy the offline RL algorithms into three main categories:

I - **Policy constraint**

II - **Policy Regularization**

III - **Importance sampling**

### I: **Policy constraint:** 

#### **Non-implicit or Direct** 

(e.g. BCQ - see later): We have access to $\pi_\beta$. For instance it could be a suboptimal classical (i.e. non RL) policy or computed from behavioral clonning on a given dataset. Note that still if we have data to get a clone policy representing that data is not a simple task in complex (high dimensional) environments.

As we already have $\pi_\beta$ we can constraint the learned and behavioral policy through:

\begin{align*}
D_{KL}(\pi(a|s)||\pi_{\beta}(a|s)) \leq \epsilon
\end{align*}

and as shown in (ref.1 )we can bound $D_{KL}(d_{\pi}(s)||d_{\pi_{\beta}}(s))$ by $\delta$, which is $O\left(\frac{\epsilon}{{(1 - \gamma)}^2}\right)$ . Here $d_{\pi}(s)$ is the state visitation frequency induced by the policy $\pi$. In summary if $d_{\pi}(s)$ and $d_{\pi_{\beta}}(s)$ are close enough this will guarantee that the state distributions will be similar and so the space of states that we visit during data collection will be similar to the one we will encounter in inference.

Basically we will use this constraint in actor-critic algorithms, i.e.:

$$
{\hat Q}^{\pi}_{k+1} \leftarrow \arg \min_Q \mathbb{E}_{(s,a,s')\sim D} \left[ Q(s, a) - \left( r(s, a) + \gamma \mathbb{E}_{a' \sim\pi_k(a'|s')}[{\hat Q}^{\pi}_k(s', a')] \right)^2 \right]
$$

$$
\pi_{k+1} \leftarrow \arg \max_{\pi} \mathbb{E}_{s\sim D} \left[ \mathbb{E}_{a \sim\pi(a|s)} Q^{\hat{\pi}_{k+1}}(s, a) \right] \\
\text{s.t. } D(\pi, \pi_{\beta}) \leq \epsilon.
$$


We could also add the constraint in the evaluation an improvement steps, i.e. ( What is the difference?):

$$
{\hat Q}^{\pi}_{k+1} \leftarrow \arg \min_Q \mathbb{E}_{(s,a,s')\sim D} \left[ Q(s, a) - \left( r(s, a) + \gamma \mathbb{E}_{a' \sim\pi_k(a'|s')}[{\hat Q}^{\pi}_k(s', a')] -\alpha\gamma D(\pi_k(\cdot|s'), \pi_\beta(\cdot|s')) \right)^2 \right]
$$

$$
\pi_{k+1} \leftarrow \arg \max_{\pi} \mathbb{E}_{s\sim D} \left[ \mathbb{E}_{a \sim\pi(a|s)} Q^{\hat{\pi}_{k+1}}(s, a) -\alpha\gamma D(\pi_k(\cdot|s), \pi_\beta(\cdot|s)) \right] \\
$$


To overcome these issues other approach is to constraint the policies but in their support, i.e. in the space of action where they are defined, as see in the figure below.

<img src="Images/policy_constraint_vs_support.png" alt="offline_rl_4" width=500cm>

ToDo: Give an example!!


#### **Implicit** 

(e.g. AWR - see later): Don't need $\pi_\beta$ and we can work directly with our data $D$. 

Suppose you have a behavioral policy $\mu$ and you want to find a new better one $\pi$ what yuo could do is to maximize the diference reward:

$\eta(\pi) = J(\pi) - J(\mu)$ .

It can be shown that given two policies $\pi$ and $\mu$ the following general result holds:

$\eta(\pi) = \mathbb{E}_{s \sim d^{\pi}(s)} \mathbb{E}_{a \sim \pi(a|s)} [A^{\mu}(s, a)] = \mathbb{E}_{s \sim d^{\pi}(s)} \mathbb{E}_{a \sim \pi(a|s)} \left[ R^{\mu}_{s,a} - V^{\mu}(s) \right]
$ (See https://arxiv.org/pdf/1502.05477.pdf)

$
\begin{align*}
\underset{\pi}{\text{arg max}} \int \int \frac{d\mu(s)}{da} \frac{d\pi(a|s)}{ds} (R^{\mu}_{s,a} - V^{\mu}(s)) \, da \, ds \tag{5}\\
\text{s.t.} \int \frac{d\mu(s)}{ds} \text{DKL}(\pi(\cdot|s) || \mu(\cdot|s)) \, ds \leq \epsilon
\end{align*}
$


$
L(\pi, \beta) = \int \int \frac{d\mu(s)}{ds} \frac{d\pi(a|s)}{da} (R^{\mu}_{s,a} - V^{\mu}(s)) \, da \, ds + \beta \left( \epsilon - \int \frac{d\mu(s)}{ds} \text{DKL}(\pi(\cdot|s) || \mu(\cdot|s)) \, ds \right)
$

and solving for max of $\pi$:

$
\pi^*(a|s) = \frac{1}{{Z(s)}} \mu(a|s) \exp\left(\frac{1}{\beta} \left(R^{\mu}_{s,a} - V^{\mu}(s)\right)\right)
$

Let $\rho_{\pi}$ be the (unnormalized) discounted visitation frequencies:

$\rho_{\pi}(s) = P(s_0 = s) + \gamma P(s_1 = s) + \gamma^2 P(s_2 = s) + \ldots$




$\eta(\tilde{\pi}) = \eta(\pi) + \sum_{t=0}^{\infty} \sum_{s} P(s_t = s | \tilde{\pi}) \sum_{a} \tilde{\pi}(a|s) \gamma^t A^{\pi}(s, a) 
= \eta(\pi) + \sum_{s} \sum_{t=0}^{\infty} \gamma^t P(s_t = s | \tilde{\pi}) \sum_{a} \tilde{\pi}(a|s) A^{\pi}(s, a) 
= \eta(\pi) + \sum_{s} \rho_{\pi}^{\tilde{\pi}}(s) \sum_{a} \tilde{\pi}(a|s) A^{\pi}(s, a).$

where $ R^{\mu}_{s,a}$ is the cumulative reward in a trajectory with initial state $s$ taking action $a$ and following the policy $\pi$ from that point on (Note that this is the usual Advantage function $Q(s,a) - V(s)$).

As it is so hard to optimize because the expectation is with respect to $d^{\pi}(s)$ we can approximate in the same way as we did for the Trust Region Policy Optimization (TRPO): Remember that $\hat{\eta}(\pi)$ and ${\eta}(\pi)$ match to first order (

$\hat{\eta}(\pi) = \mathbb{E}_{s \sim d^{\mu}(s)} \mathbb{E}_{a \sim \pi(a|s)} [R^{\mu}_{s,a} - V^{\mu}(s)]
$


In exact policy iteration for instance (Q_learning), you choose actions deterministically based on the advantage function, i.e. $ \tilde{\pi}(s \middle a) = argmax_a A_\pi(s,a) $. This policy improvement technique works as follows:

If there's at least one state-action pair with a positive advantage value and a nonzero state visitation probability, then using exact policy iteration will improve the policy.
If there are no state-action pairs with a positive advantage value and nonzero state visitation probability, it means the algorithm has converged to the optimal policy.)







In a Bellman kind approach the distributional shift only affects actions during training but it could affect states during inference (ToDo: Give an example of this!!!)

Cons: final performance not quite good, as the behavior policy – and any policy that is close to it – may be much worse than the best policy that can be learned from the offline data. (This is not totally what happend in BCQ)









. Problem when bad estimation: e.g. when we fit a unimodal policy into multimodal data. In that case, policy constraint methods can fail dramatically.



Cons: These methods can often be too pessimistic, which is always undesirable. For instance, if we know that a certain state has all actions with zero reward, we should not care about constraining the policy in this state once it can inadvertently affect our neural network approximator while forcing the learned policy to be close to the behavior policy in this irrelevant state. We effectively limit how good of a policy we can learn from our dataset by being
too pessimistic.

## ToDo: specify when is better to use BCQ - CQL - IQL - AWAC - etc.

Exercise I: try a DQN (offpolicy) approach 

(ToDo -> give the possibility to change the reward to participants without har-coding!!!??)

a - Create environment with obstacle in the middle. Start (0,0) - Target (7,7)
b - reward = -0.1 everywhere 0 at target (as the agent cannot go over the obstacles - Again the obstacle could be a fence to protect the area from the agent)
c - Collect data
d - train DQN 
e - look to o.o.d without the obsacle --> should find the optimal trajectory over obstacle?

Explanation: Q(s,a) -> ~ Q(s',a') but a' not in dataset as produced by policy --> tendency to overestimate Q on unseen region --> explain this much better!!

Why is offline RL a difficult problem?


1 - **Not possible to improve exploration**: As the learning algorithm must rely entirely on the static
dataset $D$, there is no possibility of improving exploration: if D does not contain transitions that illustrate high-reward regions for our task it may be impossible to discover those regions. If we explore beyond our data we could have severe problems as there could be a good reason why this data is not in our dataset: maybe there is an obstacle that could damage the robot or a fragile object that could be damaged by the robot!

Note that this is opposite to online RL where you explore by interacting with the environment. 

This is why the collecting data phase is so important!!

2 - **Distributional shift**: state-action pair distribution in $D$ does not accurately represent the distribution of states-actions of the trained policy. This challenges many existing machine learning methods, which assume that data is independent and identically distributed (i.i.d.). In standard supervised learning, we aim to train a model that performs well on data from the same distribution as the training data. In offline RL, our goal is to learn a policy that behaves differently (hopefully better) than what's seen in the dataset $D$. As a consequence (see later) the RL algorithms will tend to generate actions not included in $D$ and so generate **out of distribution actions data**. This could be dangerous as during inference these actions could bring the system to unexplored states (i.e. not included in $D$).