In [2]:
%%capture
%load_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext training_rl
%set_random_seed 12

In [3]:
%presentation_style

In [4]:
%load_latex_macros


$\newcommand{\vect}[1]{{\mathbf{\boldsymbol{#1}} }}$
$\newcommand{\amax}{{\text{argmax}}}$
$\newcommand{\P}{{\mathbb{P}}}$
$\newcommand{\E}{{\mathbb{E}}}$
$\newcommand{\R}{{\mathbb{R}}}$
$\newcommand{\Z}{{\mathbb{Z}}}$
$\newcommand{\N}{{\mathbb{N}}}$
$\newcommand{\C}{{\mathbb{C}}}$
$\newcommand{\abs}[1]{{ \left| #1 \right| }}$
$\newcommand{\simpl}[1]{{\Delta^{#1} }}$


<img src="_static/images/aai-institute-cover.svg" alt="Snow" style="width:100%;">
<div class="md-slide title"> Addressing distributional shift </div>

# Addressing Distributional Shift in Offline RL

## Overview

There are various approaches, but the core idea is to strike a balance where **the policy distribution remains reasonably close to the behavioral one while also improving its performance**. This involves **introducing some distributional shift to enhance the policy without going out of distribution**, all while ensuring that the effective sample size remains large enough to be representative during inference, i.e. **we don't want to exclude state-action pairs that may appear infrequently (few expert data) but could lead to higher-reward trajectories, especially considering that these states may likely appear during inference**. Achieving this balance is a challenging task and a highly active area of research in the RL community.

To attain the aforementioned goal, offline RL algorithms can be classified into three primary categories:

**I - Policy constraint**

**II - Policy Regularization**

**III - Importance sampling**

### I - Policy constraint

#### a) Non-implicit or Direct

We have access to the behavior policy, $\bf \pi_\beta$. For instance it could be a suboptimal classical policy (i.e. non RL) or computed from behavioral cloning on a given dataset.

As we already have $\pi_\beta$ we can constrain the learned and behavioral policy through:

\begin{equation}
D_{KL}(\pi(.|s)||\pi_{\beta}(.|s)) \leq \epsilon
\label{dk_1}
\end{equation}


<img src="_static/images/96_KL_divergence.png" alt="KL divergence" width=700cm>
<div class="slide title"> Fig.1: DKL divergence </div>

with the Kullback-leibler divergence, $D_{KL}$, defined as:

$$
D_{KL}(\pi(.|s)||\pi_{\beta}(.|s)) = \sum_a \pi(a|s) log \frac{\pi(a|s)}{\pi_{\beta}(a|s)} 
\label{dkl_2}
$$

As shown in (ref.1), if Eq. \ref{dk_1} is satisfied, then we can bound the state visitation frequency (specifically, the $\gamma$-discounted state distribution) $d_{\pi}(s)$ and $d_{\pi_{\beta}}(s)$ (induced by $\pi(a|s)$ and $\pi_{\beta}(a|s)$, respectively) by a value denoted as $\delta$, which is $O\left(\frac{\epsilon}{{(1 - \gamma)}^2}\right)$. In other words, as depicted in the figure below, the minimization of the DKL divergence will favor case (a) and penalize situations as in case (b).


<img src="_static/images/96_policy_constraint_DKL.png" alt="KL divergence" width=200%>
<div class="slide title"> Fig.2: DKL divergence's effect on out-of-distribution data </div>

In summary, if the state distributions $d_{\pi}(s)$ and $d_{\pi_{\beta}}(s)$ are close enough around a given state $s$, the set of states visited during data collection will be similar to the set we will encounter during inference. The $D_{KL}$ divergence reduces undesired o.o.d actions as in case (b) but encourages the o.o.d actions as in case (a), which are quite important for improving the behavior policy $\pi_\beta(a|s)$. It's important to remember that we expect the distributional shift of case (b) because typically, the support of state-action pairs of the optimal policy, $\pi(a|s)$, will be different (and usually smaller) than that of the behavior policy $\pi_\beta(a|s)$. Also, as mentioned earlier, actions like $a_4$ could lead our agent to highly rewarding regions like seen in the fig. below.

<img src="_static/images/96_critical_action_states.png" alt="KL divergence" width=50%>
<div class="slide title"> Fig.3: Critical actions may appear infrequently in the collected data but are crucial for finding the optimal policy. </div>

As we will see shortly, to identify the support state-action pairs of the optimal policy, we will leverage some of the methods introduced earlier in the online RL segment of the workshop.

Policy constraint methods use in general the DKL constraint, in an actor-critic like approach, to determine the optimal policy, i.e.:


$$
\begin{equation}
{\hat Q}^{\pi}_{k+1} \gets \arg \min_Q \mathbb{E}_{s,a \sim \mathcal{D}} \Big[\big(Q(s,a) - \mathcal{B}^{\pi}_k Q(s,a)\big)^2\Big] \quad \text{ with } \quad \mathcal{B(s,a)}^{\pi}Q = r(s,a) + {\gamma}\mathbb{E}_{s' \sim D, a' \sim \pi}Q(s',a') 
\tag{Evaluation}
\end{equation}
$$

$$
\begin{equation}
\pi_{k+1} \leftarrow \arg \max_{\pi} \mathbb{E}_{s\sim D} \left[ \mathbb{E}_{a \sim\pi(a|s)} Q^{\hat{\pi}}_{k+1}(s, a) \right] 
\tag{Improvement}
\end{equation}
$$

$$
\begin{equation}
D_{KL}(\pi(.|s), \pi_{\beta}(.|s)) \leq \epsilon.
\tag{Constraint}
\end{equation}
$$

We could incorporate this constraint as a Lagrange multiplier, or sometimes it is absorbed in the evaluation and improvement steps:

$$
{\hat Q}^{\pi}_{k+1} \leftarrow \arg \min_Q \mathbb{E}_{(s,a,s')\sim D} \left[\left( Q(s, a) -  r(s, a) + \gamma \mathbb{E}_{a' \sim\pi_k(a'|s')}[{\hat Q}^{\pi}_k(s', a')] -\alpha\gamma D_{KL}(\pi_k(\cdot|s'), \pi_\beta(\cdot|s')) \right)^2 \right]
$$

$$
\pi_{k+1} \leftarrow \arg \max_{\pi} \mathbb{E}_{s\sim D} \left[ \mathbb{E}_{a \sim\pi(a|s)} Q^{\hat{\pi}_{k+1}}(s, a) -\alpha\gamma D_{KL}(\pi_k(\cdot|s), \pi_\beta(\cdot|s)) \right] \\
$$

These are not equivalent, obviously, but they produce similar results, with the advantage that the second approach is much easier to implement from a technical point of view. However, strictly speaking, the first approach is the more rigorous one.

This approach works quite well, but what happens if we need to deviate considerably from the behavior policy, as can happen in realistic situations where the data collected is far from optimal? In those cases, the $D_{KL}$ constraint could be too conservative.


Let's first analize the simple example in the fig. below:

<img src="_static/images/96_support_policy_constraint.png" alt="offline_rl_4" width=150%>

In this case, there are no out-of-distribution (o.o.d.) data as we are constrained to the one-dimensional grid where all the states are available. Note that this is similar to fig.2(a) and in particular to the critical actions in fig.3, where some of the actions will be much more likely than the others. In this case, if we constrain through the $D_{KL}$ divergence, i.e., we constrain the policy probability distributions to be close, we will be imitating the bad behavior of $\pi_\beta(a|s)$ and we won't be able to get the optimal policy as seen in Figure c). A smarter choice instead would be to constrain on the behavior policy support, as seen in the figure below.

<img src="_static/images/policy_constraint_vs_support.png" alt="offline_rl_4" width=500cm>
<div class="slide title"> Fig.4: distributional vs. support policy constraint </div>

In this way, o.o.d. actions like the ones in fig.2(b) will be avoided, and actions that are in-distribution or close to it, as in fig.2(a) and fig.3, will be encouraged but without any constraint on their probabilities. This implies that the learned probabilities will be determined almost entirely from the evaluation-improvement process. By doing so, we can get the optimal policy observed in fig. d) in the above example.

#### b) Implicit 

**We don't need $\pi_\beta$, and we can work directly with our data $D$**. This situation often arises due to the lack of data or in complex, high-dimensional spaces where matching a policy to the real data distribution can be extremely challenging.

In this approach, nevertheless, we assume that we have a behavioral policy $\pi_\beta$ (which will be integrated out later), and our goal is to find a better one, $\pi$. One strategy is to maximize the difference reward:

$
\begin{equation}
\eta(\pi) = J(\pi) - J(\pi_\beta) \quad \hbox{with} \quad J (\pi) = \mathbb{E}_{\tau \sim \pi}  \left[ \sum_{t = 0}^{\infty} \gamma^t r (s_t, a_t) \right] 
\tag{2}
\end{equation}
$

i.e., given the cumulative reward of our behavioral policy, $\pi_\beta$, try to increase the cumulative reward of the learned policy, $\pi$, as much as possible.

It can be shown that (2) can be written in a form similar to the derivation of Trust Region Policy Optimization (TRPO):

$
\begin{equation}
\eta(\pi) = \mathbb{E}_{s \sim d^{\pi}(s)} \mathbb{E}_{a \sim \pi(a|s)} [A^{\pi_\beta}(s, a)] \\ \text{s.t.} \quad D_{KL}(\pi(\cdot|s) || \pi_\beta(\cdot|s) ) \leq \epsilon
\tag{3}
\end{equation}
$

The math to arrive at eq.3 is somewhat involved (see ref.1), but we can gain an intuitive understanding of its purpose. Eq.3 can be qualitatively understood as depicted in the figure below.


<img src="_static/images/96_difference_reward.png" alt="offline_rl_4" width=200%>
<div class="slide title"> Figure 5</div>



**I summary point 3 implies finding a policy $\pi(a|s)$ that generates state-action pairs $(s_0,a_0)$ (constrained to be close to the dataset distribution through the $D_{KL}$ divergence) that lead to trajectories in our dataset with maximum reward. In other words, by solving eq.3, we aim to compute a policy that closely approximates the optimal one based on a given dataset.**

Now we need to solve eq.3. It can be formulated as a constrained optimization problem in a Lagrangian formalism:

$
\begin{equation}
L(\pi, \lambda) =  \mathbb{E}_{s \sim d^{\pi}(s)} \mathbb{E}_{a \sim \pi(a|s)} [A^{\pi_\beta}(s, a)] + \lambda \left( \epsilon -  D(\pi(\cdot|s) || \pi_\beta(\cdot|s)) \right)
\tag{4}
\end{equation}
$

and this can be maximized easily. After some algebra, we find that the optimal policy, $\pi^*(a|s)$, is given by:

$
\pi^*(a|s) = \frac{1}{Z(s)} \pi_\beta(a|s) \exp\left(\frac{1}{\lambda} A^{\pi_\beta}(s, a)\right) 
\tag{5}.
$

This is what we expected from our previous intuition of eq.3. What eq.5 says is that (up to a normalization constant $\frac{1}{Z(s)}$), given a state $s$, the optimal policy $\pi^*(a|s) $ gives a **probability to find an action $a$ that is proportional to the probability that this action belongs to the dataset ($ \pi_\beta(a|s)$) times a factor $\exp\left(\frac{1}{\lambda} A^{\pi_\beta}(s, a)\right)$ that grows exponentially with $A^{\pi_\beta}(s, a)$, which is proportional to the cumulative reward on the dataset collected from $(s,a)$ (see fig.5)**

So far, this is a theoretical approach. Now, to compute $\pi^*(a|s) $, we can approximate it using a DNN, denoted as $\pi_\theta$. To ensure that our DNN remains close to the theoretical solution $\pi^*$, we can impose that $\pi_\theta$ and $\pi^*$ are close distributions on the dataset, i.e. the probabilities assigned by $\pi_\theta$ to actions $a$ in state $s$ to be similar to those assigned by $\pi^*$. 


$$
\pi_\theta (a|s) = argmin_{\pi_\theta} \mathbb{E}_{s \sim d\pi_\beta(s)} \left[ D_{KL}(\pi^*(\cdot|s) \, \Vert \, \pi_\theta(\cdot|s)) \right] = \\
\arg\max_{\pi_\theta} \mathbb{E}_{s\sim d\pi_\beta(s)}\mathbb{E}_{a\sim\pi_\beta(a|s)} \left[ \frac{1}{Z(s)} \log \pi_\theta(a|s) \exp\left(\frac{1}{\lambda} A^{\pi_\beta}(s, a)\right) \right]
\tag{6}
$$

In Eq.6, we used the definition of the $D_{KL}$ divergence. Additionally, we used the forward $D_{KL}$ divergence, as is commonly done in stochastic variational inference approaches. This trick allows us to obtain an expectation value using the behavioral policy, which we can now approximate by sampling points from our collected dataset, **eliminating the need for the behavioral policy**. This is represented as:

$$
\pi_\theta (a|s) =
\arg\max_{\pi_\theta} \mathbb{\sum}_{(s,a)\sim D} \left[ \frac{1}{Z(s)} \log \pi_\theta(a|s) \exp\left(\frac{1}{\lambda} A^{D}(s, a)\right) \right]
\label{AWR}
\tag{7}
$$

Policy constraint methods are powerful, but they can often be too pessimistic, which is undesirable. For example, if we know that a certain state has all actions yielding zero reward, we should not constrain the policy in this state, as doing so could inadvertently affect our neural network approximator while forcing the learned policy to mimic the behavior policy in this irrelevant state. This overly pessimistic approach limits the quality of the policy we can learn from our dataset. This limitation becomes apparent when we constrain the learned policy through its probabilities or support. An alternative approach to avoid out-of-distribution (o.o.d) actions without directly constraining the policies is to control o.o.d behavior from a Q-function perspective, which we will explore in the next section.

Furthermore, since these methods involve function approximation, they can encounter issues when fitting a unimodal policy to multimodal data. In such cases, policy constraint methods may fail dramatically. However, this is not a major concern as we usually work with some form of deep neural network (DNN), which can handle these complexities.

### II - Policy Regularization

Policy Regularization is an alternative approach to ensuring the robustness of learned value functions, specifically Q-functions. **This approach involves regularizing the value function directly, aiming to prevent overestimation, especially for actions that fall outside the distribution seen during training**.

It's versatile, applicable to different RL methods, including actor-critic and Q-learning methods, and doesn't necessitate explicit behavior policy modeling as the previous methods.

Perhaps one of the most famous examples is the CQL (Conservative Q-Learning) algorithm that introduces the following constraint as Q-value regularization:

\begin{equation}
CCQL_0(D, \phi) = E_{s\sim D, a\sim \mu(a|s)}[Q_{\phi}(s, a)]\ \tag{8}
\end{equation}


<img src="_static/images/96_CQL_1.png" alt="offline_rl_4" width=200%>
<div class="slide title"> Fig.6: Policy regularization approach </div>

As seen in the fig.6, policy regularization methods introduce a minimal modification of the evaluation-improvement process:

$\hat{Q}^{k+1}_{\text{CQL}} \gets \hbox{argmin}_\theta \left[ \color{red} {\alpha\mathbb{E}_{s \sim \mathcal{D}, a \sim \mu}[Q_\theta(s,a)] } + \frac{1}{2} \mathbb{E}_{s,a \sim \mathcal{D}} \Big[\big(Q_\theta(s,a) - \mathcal{B}^{\pi}Q_\theta(s,a)\big)^2\Big] \right]. \tag{9}$

$$
\pi_{k+1} \leftarrow \arg \max_{\pi} \mathbb{E}_{s\sim D} \left[ \mathbb{E}_{a \sim\pi(a|s)} Q_\theta^{\hat{\pi}_{k+1}}(s, a) \right] \\
$$

The main idea is to choose a new policy $\mu(a|s)$ that will try to find the actions $a$ that maximize our DNN $Q_\theta$ values while at the same time we try to minimize the Q function on the $\theta$ parameter space. This effect will be particularly important in actions that are o.o.d. that are the ones that are overestimate in general as we saw before 

( ToDo: --> Explain this better and say that in-distribution data is estimate correctly through the evaluation approach that include only data from D)

This is rigorously demonstrated in the CQL paper, where they establish that the solution to eq.9 provides a lower bound for Q(s,a). The policy $\mu$ doesn't necessarily have to be proportional to $\pi(a|s)$, but it should aim to maximize Q(s,a) consistently. We will explore various options for this later on.

## Short review of some popular offline RL algorithms

### Introduction

In this notebook, we will explore several key algorithms that aim to address distributional shift issues within offline reinforcement learning. It's worth noting that the field of offline RL is evolving rapidly, and this list is by no means exhaustive. Many of the concepts and strategies employed by these algorithms find applications and improvements in various other approaches.

A common approach followed by many algorithms in offline RL involves an actor-critic methodology. Within this framework, there is an iterative process of evaluation and improvement, characterized by:

$$
\begin{equation}
{\hat Q}^{\pi}_{k+1} \gets \arg \min_Q \mathbb{E}_{s,a \sim \mathcal{D}} \Big[\big(Q(s,a) - \mathcal{B}^{\pi}_k Q(s,a)\big)^2\Big].
\tag{Evaluation}
\end{equation}
$$

$$
\begin{equation}
\mathcal{B}^{\pi}Q = r + {\gamma}\mathbb{E}_{s' \sim D, a' \sim \pi}Q(s',a') 
\tag{Bellman backup op.}
\end{equation}
$$


$$
\begin{equation}
\pi_{k+1} \leftarrow \arg \max_{\pi} \mathbb{E}_{s\sim D} \left[ \mathbb{E}_{a \sim\pi(a|s)} Q^{\hat{\pi}}_{k+1}(s, a) \right] \tag{Improvement}
\end{equation}
$$



So the main idea is to modify the Evaluation/Improvement steps to improve the distributional shift problems.

### Batch Constrained deep Q-learning (BCQ) algorithm

The main idea is pictures in the figure below.

<img src="_static/images/97_BCQ_algo_1.png" alt="offline_rl_4" width=200%>
<div class="slide title"> Fig.5: BCQ approach to offline RL </div>


In these methods, the policies  $\pi$ and $\pi_\beta$ are not constrained by the $D_{KL}$ divergence, but we still ensure that $\pi(s)$ generates similar actions to $\pi_\beta(s)$ through a generative model, in this case, a VAE. Therefore, this method falls under the direct policy constraint approach discussed earlier.

$$
\pi(s) = \arg\max_{a_i} Q_\theta(s, a_i),
\\ \{a_i \sim G_\omega(s)\}_{i=1}^n
\tag{10}
$$

The BCQ algorithm uses a clipped Double Deep Q-Learning (clipped-DDQ) approach to compute the Q-values:

$$
L(\theta_i, D) = \mathbb{E}_{ s,a,r,s' \sim D} \left[  Q_{\theta_i}(s,a) - y(r,s') \right]
$$

with

$$
y(r,s') = r + \gamma min_{i=1,2} Q_{\theta_i, targ} (s', a'(s'))
$$

The minimum is taken to avoid the overestimation of Q-values, an issue that also occurs in these kinds of methods in online RL. In offline RL, as we saw, o.o.d. actions are the ones that typically produce such overestimations. So, clipped-DDQ also introduces a control on this issue at the Q-value level, a similar effect that policy regularization methods aim to achieve with a lower bound on Q-values.


**A few technical details**:

The action in eq.10 are clipped with some noise $\epsilon$ (hence the name clipped) as this also helps to avoid overestimation of Q-values:

$$
a \rightarrow clip [a + clip(\epsilon, -c, c), a_{low}, a_{high}]
$$

We allow actions with high Q-values to introduce some uncertainty, helping the algorithm explore regions of lower reward to avoid overestimation effects.

Finally, as running a VAE during training can be computationally expensive, the algorithm introduces a perturbation model $\xi_\phi(s, a_i, \Phi)$, which outputs an adjustment to an action $a$ in the range $[-\Phi, \Phi]$. Therefore, eq.10 becomes:

$$
\pi(s) = \arg\max_{a_i + \xi\phi(s, a_i, \Phi)} Q_\theta(s, a_i + \xi_\phi(s, a_i, \Phi)),
\\ \{a_i \sim G_\omega(s)\}_{i=1}^n
$$


Note that if $\Phi=0$ and $n=1$ the policy will resemble behavioral cloning.
On the opposite side if d $\Phi \rightarrow a_{max} - a_{min}$ and $n \rightarrow \infty$, then the algorithm approaches Q-learning, as the policy begins to greedily maximize the value function over the entire action space.

**Pros**: As it learns how to generate new actions not included in the dataset, it is suitable for small datasets and for unbalanced sets where a few unrepresented actions could be important for the task to be solved.

**cons**: As BCQ generates actions from a VAE, if the dataset used to train it underrepresents some important actions, it could be that the VAE is not able to generate meaningful actions around that state, making the discovery of new or unconventional actions difficult. This is one of the limitations of constrained policy approaches!

Let's give a look to Tianshou BCQ policy.

### Conservative Q-Learning (CQL) algorithm

CQL follows a pessimistic approach by considering a lower bound of the Q-value. In the paper they show that the solution of:

$\hat{Q}^{k+1}_{\text{CQL}} \gets \hbox{argmin}_Q \left[ \color{red} {\alpha\big(\mathbb{E}_{s \sim \mathcal{D}, a \sim \mu}[Q(s,a)] - \mathbb{E}_{s,a \sim \mathcal{D}}[Q(s,a)]\big)} + \frac{1}{2} \mathbb{E}_{s,a \sim \mathcal{D}} \Big[\big(Q(s,a) - \mathcal{B}^{\pi}Q(s,a)\big)^2\Big] \right].$

for $\mu = \pi$ is a lower bound for the Q value.

The nice thing about this method is that it can be applied to any Actor Critic method in a few lines of code.

CQL Focuses on **conservative value estimation** to provide lower bounds on the expected return of a policy. Aims to reduce overestimation bias and ensure that the policy remains within a safe region of the state-action space. Achieves safe exploration by constructing action sets that cover a broader range of state-action pairs. Well-suited for scenarios where safety is a top priority, as it **reduces the risk of catastrophic actions**.

Note that BCQ could be better to discover novel actions and to use the collected data more efficiently but may not guarantee complete safety!.

### Implicit Q-Learning (IQL) algorithm

This is another clever idea to avoid going out of distribution. Let's revisit the ideas for evaluation improvement, assuming that we only operate with state-action pairs from the dataset in a SARSA-style approach, i.e.:

$$
{\hat Q}_{k+1} \leftarrow \arg \min_Q \mathbb{E}_{(s,a,s',a')\sim D} \left[\left( Q(s, a) -  r(s, a) + \gamma{\hat Q}_k(s', a')  \right)^2 \right]
$$

$$
\pi_{k+1} \leftarrow \arg \max_{\pi} \mathbb{E}_{s\sim D} \left[ \mathbb{E}_{a \sim\pi(a|s)} Q^{\hat{\pi}_{k+1}}(s, a)  \right] \\
$$

This is indeed a valid approach. It's important to note that running the evaluation-improvement loop makes sense only once. During evaluation, we compute the $Q$-values of the behavior policy and derive the optimal policy based on those $Q$-values in the improvement step. Further iterations would be futile since we are limited to the fixed dataset.

However, this idea often falls short in finding an optimal policy for many real-world problems. Intuitively, if our data is suboptimal, the Q-values derived from that data will also be suboptimal.

The core principle of IQL is to utilize a pessimistic Q-value lower bound during evaluation, similar to policy regularization, while also ensuring consistency with in-distribution data. This strategy enables a multi-step process, facilitating multiple evaluation-improvement iterations. With each iteration, a new estimate for Q(s,a) is derived, encouraging a deeper exploration of the Q-functions and enabling the capture of broader correlations.

<img src="_static/images/96_one_step_vs_multiple_steps.png" alt="offline_rl_4" width=80%>
<div class="slide title"> Fig.6: one vs multiple step approaches.  </div>


These are the main steps involved in the IQL approach:

$$L_V(\psi) = E_{(s,a)\sim D}[L_2^{\tau}(Q_{\hat{\theta}}(s, a) - V_{\psi}(s))]$$

$$L_Q(\theta) = E_{(s,a,s') \sim D}\left[(r(s, a) + \gamma V_{\psi}(s') - Q_{\theta}(s, a))^2\right]$$

and for the policy improvement step, it uses an advantage weighted regression:

$$L_\pi(\phi) = E_{(s,a)\sim D} \left[\exp(\beta(Q_{\hat{\theta}}(s, a) - V_{\psi}(s))) \log \pi_{\phi}(a|s)\right]
$$

similar to eq.7 . The lower bound used here is the 'expectile' shown in the figure below.


<img src="_static/images/96_expectile.png" alt="offline_rl_4" width=80%>
<div class="slide title"> Fig.7: Expectile of a two dimenstional random variable.  </div>



### Q-Transformer

ToDo: Add theory and show results on blog.

## References

[Schulman et al. 2017 - Trust Region Policy Optimization](https://arxiv.org/pdf/1502.05477.pdf)

[Kumar et al. 2020 - Conservative Q-Learning for Offline Reinforcement Learning](https://arxiv.org/pdf/2006.04779.pdf)

[ Levine et al. 2021 - Offline Reinforcement Learning: Tutorial, Review,
and Perspectives on Open Problems ](https://arxiv.org/pdf/2005.01643.pdf)

[Peng et al. 2019 - Simple and Scalable Off-Policy Reinforcement Learning](https://arxiv.org/abs/1910.00177)

[Nair et al. '2020 - AWAC: Accelerating Online Reinforcement Learning with Offline Datasets](https://arxiv.org/abs/2006.09359)


## ToDo LIST: 

1 - Is IQL added to tianshou??

2 - Maybe we could give an exercise about DKL ....

ToDo: see Bootstrapping Error Accumulation Reduction (BEAR) and why is not the state of the art ?

ToDo: Include the simple exercise in figure above 3?

Page 22 Levine: example of support constrain in discrete space...and discussion there.

( TODO: See discussion in page 20 - Levine).
