In [None]:
%%capture

%load_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext training_rl

In [None]:
%presentation_style

In [None]:
%%capture

%set_random_seed 12

In [None]:
%load_latex_macros


$\newcommand{\vect}[1]{{\mathbf{\boldsymbol{#1}} }}$
$\newcommand{\amax}{{\text{argmax}}}$
$\newcommand{\P}{{\mathbb{P}}}$
$\newcommand{\E}{{\mathbb{E}}}$
$\newcommand{\R}{{\mathbb{R}}}$
$\newcommand{\Z}{{\mathbb{Z}}}$
$\newcommand{\N}{{\mathbb{N}}}$
$\newcommand{\C}{{\mathbb{C}}}$
$\newcommand{\abs}[1]{{ \left| #1 \right| }}$
$\newcommand{\simpl}[1]{{\Delta^{#1} }}$


# Review of some famous offline RL algorithms

In this notebook, we will explore several key algorithms that aim to address distributional shift issues within offline reinforcement learning. It's worth noting that the field of offline RL is evolving rapidly, and this list is by no means exhaustive. Many of the concepts and strategies employed by these algorithms find applications and improvements in various other approaches.

A common approach followed by many algorithms in offline RL involves an actor-critic methodology. Within this framework, there is an iterative process of evaluation and improvement, characterized by:

$$
\begin{equation}
{\hat Q}^{\pi}_{k+1} \gets \arg \min_Q \mathbb{E}_{s,a \sim \mathcal{D}} \Big[\big(Q(s,a) - \mathcal{B}^{\pi}_k Q(s,a)\big)^2\Big].
\tag{Evaluation}
\end{equation}
$$

$$
\mathcal{B}^{\pi}Q = r + {\gamma}\mathbb{E}_{s' \sim D, a' \sim \pi}Q(s',a')
$$

$$
\begin{equation}
\pi_{k+1} \leftarrow \arg \max_{\pi} \mathbb{E}_{s\sim D} \left[ \mathbb{E}_{a \sim\pi(a|s)} Q^{\hat{\pi}}_{k+1}(s, a) \right] \tag{Improvement}
\end{equation}
$$

So the main idea is to modify the Evaluation/Improvement steps to improve the distributional shift problems.

Batch Constrained deep Q-learning (BCQ) algorithm
---

BCQ algorithm tries to solve the problem of distributional shift, and in particular the issues mentioned before during the Q-value evaluation process, i.e.: 

$$
\begin{equation}
\pi_{k+1} \leftarrow \arg \max_{\pi} \mathbb{E}_{s\sim D} \left[ \mathbb{E}_{a \sim\pi(a|s)} Q^{\hat{\pi}}_{k+1}(s, a) \right] \tag{Improvement}
\end{equation}
$$

where out of distribution actions $a'$ could be overestimated. 

The BCQ algorithm propose to tackle this problem by train a Variational Autoencoder (VAE) $G_\omega(s')$ in the given dataset in order to generate in (1) reasonable actions $a'$ (i.e. with a high likelihood that will belong to the probability distribution describing our dataset), given the state $s'$ (thus BCQ belongs to the direct constraint policy algorithms). So the idea is to generate $n$ potential actions for $s'$ and select the one that maximizes the Q-value.

In order to be general enough BCQ propose to perturb the generated actions by a perturbed actor that learns how to change the $n$ actions within a range $[-\Phi, \Phi]$
through a function $\xi_\phi(s, a_i, \Phi)$ trained in order to maximize the Q function (this also makes the algorithm more optimal as we don't need to generate too many VAE samples). In other words BCQ algorithm proposes for the improvement step (The details about the evaluation are not so relevant here, but you are welcome to explore the original paper for more details):

$$ 
\begin{equation}
\begin{aligned}
\pi(s) = \arg\max_{a_i + \xi\phi(s, a_i, \Phi)} Q_\theta(s, a_i + \xi_\phi(s, a_i, \Phi)), 
\\ \{a_i \sim G_\omega(s)\}_{i=1}^n 
\end{aligned}
\end{equation}
$$

$$
\phi \leftarrow \underset{\phi}{\arg\max} \sum_{a \sim G_{\omega}(s)} Q_{\theta_1}(s, a + \xi_\phi(s, a, \Phi))
$$

Note that if $\Phi=0$ and $n=1$ the policy will resemble behavioral cloning.
On the opposite side if d $\Phi \rightarrow a_{max} - a_{min}$ and $n \rightarrow \infty$, then the algorithm approaches Q-learning, as the policy begins to greedily maximize the value function over the entire action space.

**Pros**: As it learns how to generate new actions not include in the dataset
it is suitable for small datasets and for unbalanced sets where a few unrepresented actions
could be important for the task to be solved. 

**cons**: As BCQ generated action from a VAE, if the dataset used to train it underrepresents some important actions it could be that the VAE is not able to generate meaningful actions around that state and so the discovery of new or unconventional actions could be hard. This is one of the limitation of constrained policy approaches!

Conservative Q-Learning (CQL) algorithm
---

CQL follows a pessimistic approach by considering a lower bound of the Q-value. In the paper they show that the solution of:

$\hat{Q}^{k+1}_{\text{CQL}} \gets \hbox{argmin}_Q \left[ \color{red} {\alpha\big(\mathbb{E}_{s \sim \mathcal{D}, a \sim \mu}[Q(s,a)] - \mathbb{E}_{s,a \sim \mathcal{D}}[Q(s,a)]\big)} + \frac{1}{2} \mathbb{E}_{s,a \sim \mathcal{D}} \Big[\big(Q(s,a) - \mathcal{B}^{\pi}Q(s,a)\big)^2\Big] \right].$

for $\mu = \pi$ is a lower bound for the Q value. 

The nice thing about this method is that it can be applied to any Actor Critic method in a few lines of code.  

CQL Focuses on **conservative value estimation** to provide lower bounds on the expected return of a policy. Aims to reduce overestimation bias and ensure that the policy remains within a safe region of the state-action space. Achieves safe exploration by constructing action sets that cover a broader range of state-action pairs. Well-suited for scenarios where safety is a top priority, as it **reduces the risk of catastrophic actions**.

Note that BCQ could be better to discover novel actions and to use the collected data more efficiently but may not guarantee complete safety!.

IMPLICIT Q-LEARNING (IQL):
---
In this case another interesting lower bound to the Q-value is introduced to make it more pessimistic as in point 4. See [paper](https://openreview.net/pdf?id=68n2s9ZJWF8) for more details.
