In [None]:
%%capture
%load_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext training_rl
%set_random_seed 12

In [None]:
%presentation_style

In [None]:
%load_latex_macros

<img src="_static/images/aai-institute-cover.svg" alt="Snow" style="width:100%;">
<div class="md-slide title"> Offline RL Part I </div>

# Offline RL challenges

The idea of offline RL is simple: find the optimal policy from a given dataset. The dataset doesn't need to be collected from experts as in imitation learning and neither from the same task but in the last case it should be representative of our problem at hand. So offline RL covers most of the realistic scenarios, but also it could be applied to a lot of data coming from different sources as we saw before ....


The key distinction between online RL and offline RL lies in their exploration capabilities. In online RL, we actively explore the state-action space, while offline RL operates solely within a fixed dataset (denoted as $D$). Going "out of distribution" beyond this dataset in offline RL can lead to severe issues.

<img src="_static/images/offline_RL.jpg" alt="offline_rl" style="height:400px;">

Online RL involves interactive exploration to discover the highest-reward regions by gathering environmental feedback. In contrast, offline RL imposes strict limitations on exploration beyond the dataset. This constraint results in algorithms overestimating unknown areas and attempting to navigate beyond the dataset, as illustrated in the figure below where the dataset doesn't fully represent high-reward regions.

<img src="_static/images/offline_RL_3.jpg" alt="offline_rl_1" style="height:400px;">

As shown in the figure above on the right, once you are out of distribution (o.o.d) (states $s$ and $s'$ in red in the figure), as you don't have any feedback it will be hard to come back to $D$, as the o.o.d errors will propagate. As we will see this is one of the main challenges of offline RL and there are different techniques to mitigate this wrong behavior. As we saw before in an imitation learning approach you don't have a way to correct 
this missbehavior unless you have an expert that provides some feedback as we did in the DAGGER example. 

**The o.o.d. issues are not the only distributional shift effect in offline RL**.
After computing the optimal policy, it typically operates within a subset of the original dataset distribution, creating a distinct form of distributional shift (D' subset in green in the figure below). Evaluating a policy substantially different from the behavior policy reduces the effective sample size (from D to D'), resulting in increased variance in the estimates. In simpler terms, the limited number of data points may not accurately represent the true data distribution. 

<img src="_static/images/offline_RL_2.jpg" alt="offline_rl_2" style="height:400px;">


Can we apply techniques from online RL, known for its effectiveness in solving complex problems, to offline RL? Yes, we can. We can adapt concepts from online RL, particularly off-policy RL algorithms.

In both offline and off-policy RL, we train a policy using a batch of data from a replay buffer. The key difference is that in online RL, this data is constantly updated with new data generated by an improving policy through exploration. In offline RL, we cannot change our replay buffer; we must use the same dataset throughout training. This fundamental distinction in data collection and utilization presents both challenges and opportunities in offline RL.

**EXERCISE 1 in nb_95**:  Let's check how efficient are online algorithms with finite amount of data.


Let's give a look to the exercise in notebook 6a. We will apply a simple offpolicy algorithm, Deep Q-Network (DQN), replacing the environment we typically use in online settings with a ReplyBuffer of state-action-rewards pairs that we will collect with two suboptimal policies that will be the case in realistic scenarios (i.e. the data could come from many different sources). 

**Important**: In contrast to Imitation Learning, now we will be interested in a reward function as in DQN we will compute the TD error, i.e.:

Remember: 

$$ Q^\pi(s, a) = \mathbb{E}_\pi \left[ r_0 + \gamma r_1 + \gamma^2 r_2 + \ldots \mid s_0 = s, a_0 = a \right]
$$

<img src="_static/images/q_value_iterative.png" alt="offline_rl" style="height:200px;">


DQN is a simple method to reach the optimal policy by iteratively compute:

$$ Q(s, a) = Q(s, a) + α * [r + γ * max_a'(Q(s', a')) - Q(s, a)]$$


Let's go to the notebook exercise! .

Summary of results:

1 - **Without too much data, online algorithms in general will struggle to find a good policy as they rely too much 
    on exploration.**

2 - **Why is DQN going out of distribution?** The reason is more general than DQN and is a pathology that affects many RL algorithms and in particular the ones following an approximate dynamical programming approach where an evaluation-improvement iterative process is used to reach the optimal policy, i.e.:

$$
{\hat Q}^{\pi}_{k+1} \leftarrow \arg \min_Q \mathbb{E}_{(s,a,s')\sim D} \left[ Q(s, a) - \left( r(s, a) + \gamma \mathbb{E}_{a' \sim\pi_k(a'|s')}[{\hat Q}^{\pi}_k(s', a')] \right)^2 \right]  \tag{Evaluation}
$$

$$
\pi_{k+1} \leftarrow \arg \max_{\pi} \mathbb{E}_{s\sim D} \left[ \mathbb{E}_{a \sim\pi(a|s)} Q^{\hat{\pi}_{k+1}}(s, a) \right] \tag{Improvement}
$$

As we can see, from the (Evaluation) step, the only place where could be o.o.d is when we compute the action $a'$ as all the other values $s$, $a$ and $s'$ are taken from the dataset $D$. 


<img src="_static/images/94_dqn_ood_case.png" alt="offline_rl" style="height:40%">


However if the o.o.d. action $a'=a'_4$ in the figure above, is the one that maximizes $Q(s',a')$ this will give us a policy $\pi(s') = argmax_a Q(s',a)$ with probability one to go o.o.d, i.e., into the state $s''$, that we shouldn't visit. The same holds if your policy is not deterministic and you will have a non-zero probability to jump to the state $s''$. If this happens, during inference we will have the possibility to move to the state $s''$ where we will need to compute $\pi(s'') = argmax_a Q(s'',a)$, but as $Q(s'',a)$  is a  Q-value that didn't appear during training we will have an unpredicte behavior as the policy doesn't have a clue about what to do and how to bring us to in-distribution states again. Note that this is similar to the situation we observed in the imitation learning approach. In the figure below we can see how bad is the overestimation of the Q-values for the offpolicy Soft Actor Critic (SAC) algorithm on the half-cheetah environment:


<img src="_static/images/94_offpolicy_Q_values_overestimation.png" alt="offline_rl" style="height:30%">


In online RL such an overestimation will be fixed through exploration but this is not an option in offline methods. **This is perhaps the most important distributional shift issue that suffers offline RL, as already mentioned in the introduction.**

**Note 1: We could do the evaluation step with the behavioral policy directly (for instance by getting it from BC on the dataset) to learn $Q^\beta(s,a)$ and after that get the optimal policy with the standard policy improvement. However this is shown to be out from optimal as if your data contains a fair amount of suboptimal data the $Q^\beta(s,a)$ will be far from optimal and so the optimal policy associated to it.**

Note 2: We could have also DNN function approximation errors that could also produce overestimation of actions bringing the system to out of distribution data --> Explain this point better!! (TODO ---> There is a whole paper talking about this)