In [None]:
%%capture

%load_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext training_rl

In [None]:
%presentation_style

In [None]:
%%capture

%set_random_seed 12

In [None]:
%load_latex_macros


$\newcommand{\vect}[1]{{\mathbf{\boldsymbol{#1}} }}$
$\newcommand{\amax}{{\text{argmax}}}$
$\newcommand{\P}{{\mathbb{P}}}$
$\newcommand{\E}{{\mathbb{E}}}$
$\newcommand{\R}{{\mathbb{R}}}$
$\newcommand{\Z}{{\mathbb{Z}}}$
$\newcommand{\N}{{\mathbb{N}}}$
$\newcommand{\C}{{\mathbb{C}}}$
$\newcommand{\abs}[1]{{ \left| #1 \right| }}$
$\newcommand{\simpl}[1]{{\Delta^{#1} }}$


# Approaches in Offline RL to Address Distributional Shift

There are various approaches, but the core idea is to strike a balance where the policy distribution remains reasonably close to the behavioral one while also improving its performance. This involves introducing some distributional shift to enhance the policy without going out of distribution, all while ensuring that the effective sample size remains large enough to be representative during inference. Achieving this balance is a challenging task and a highly active area of research in the RL community.

To attain the aforementioned goal, offline RL algorithms can be classified into three primary categories:

**I - Policy constraint**

**II - Policy Regularization**

**III - Importance sampling**

### I: **Policy constraint:** 

**I-a) Non-implicit or Direct: We have access to the behavior policy**, $\bf \pi_\beta$. For instance it could be a suboptimal classical policy (i.e. non RL) or computed from behavioral cloning on a given dataset.

As we already have $\pi_\beta$ we can constrain the learned and behavioral policy through:

\begin{align*}
D_{KL}(\pi(a|s)||\pi_{\beta}(a|s)) \leq \epsilon
\end{align*}

and as shown in (ref.1 )we can bound $D_{KL}(d_{\pi}(s)||d_{\pi_{\beta}}(s))$ by $\delta$, which is $O\left(\frac{\epsilon}{{(1 - \gamma)}^2}\right)$ . Here $d_{\pi}(s)$ is the state visitation frequency induced by the policy $\pi$. In summary if $d_{\pi}(s)$ and $d_{\pi_{\beta}}(s)$ are close enough this will guarantee that the state distributions will be similar and so the space of states that we visit during data collection will be similar to the one we will encounter in inference.

Basically this kind of methods will use this constraint in actor-critic like algorithms, i.e.:

$$
{\hat Q}^{\pi}_{k+1} \leftarrow \arg \min_Q \mathbb{E}_{(s,a,s')\sim D} \left[ Q(s, a) - \left( r(s, a) + \gamma \mathbb{E}_{a' \sim\pi_k(a'|s')}[{\hat Q}^{\pi}_k(s', a')] \right)^2 \right]
$$

$$
\pi_{k+1} \leftarrow \arg \max_{\pi} \mathbb{E}_{s\sim D} \left[ \mathbb{E}_{a \sim\pi(a|s)} Q^{\hat{\pi}_{k+1}}(s, a) \right] \\
\text{s.t. } D(\pi, \pi_{\beta}) \leq \epsilon.
$$


We could also add the constraint in the evaluation an improvement steps, i.e. ( What is the difference?):

$$
{\hat Q}^{\pi}_{k+1} \leftarrow \arg \min_Q \mathbb{E}_{(s,a,s')\sim D} \left[ Q(s, a) - \left( r(s, a) + \gamma \mathbb{E}_{a' \sim\pi_k(a'|s')}[{\hat Q}^{\pi}_k(s', a')] -\alpha\gamma D(\pi_k(\cdot|s'), \pi_\beta(\cdot|s')) \right)^2 \right]
$$

$$
\pi_{k+1} \leftarrow \arg \max_{\pi} \mathbb{E}_{s\sim D} \left[ \mathbb{E}_{a \sim\pi(a|s)} Q^{\hat{\pi}_{k+1}}(s, a) -\alpha\gamma D(\pi_k(\cdot|s), \pi_\beta(\cdot|s)) \right] \\
$$


However, in some situations we will need to deviate considerably from the behavior policy to find optimal actions and the $D_{KL}$ constraint could be too conservative.

To overcome this issues another approach is to constraint the policies but in their support, i.e. in the space of action where they are defined, as see in the figure below.

<img src="_static/imagespolicy_constraint_vs_support.png" alt="offline_rl_4" width=500cm>

ToDo: Give an example of support matching!! --> see 2023 review.

**I-b) Implicit: We don't need $\pi_\beta$, and we can work directly with our data $D$**. This is the situation many times as the lack of data or in complex high dimensional spaces cloning a policy that match the real data distribution could be extremely hard.

In this approach you assume that you have a behavioral policy $\mu$ (this will be integrated out later) and so you want to find a better one $\pi$. What you could do is to maximize the difference reward:

$
\begin{equation}
\eta(\pi) = J(\pi) - J(\mu) \quad \hbox{with} \quad J (\pi) = \mathbb{E}_{\tau \sim \pi}  \left[ \sum_{t = 0}^{\infty} \gamma^t r (s_t, a_t) \right] 
\tag{1}
\end{equation}
$

It can be shown that (1) can be written as this (similar to Trust Region Policy Optimization (TRPO) derivation):

$
\begin{equation}
\eta(\pi) = \mathbb{E}_{s \sim d^{\pi}(s)} \mathbb{E}_{a \sim \pi(a|s)} [A^{\mu}(s, a)] \\ \text{s.t.} \quad D(\pi(\cdot|s) || \mu(\cdot|s) ) \leq \epsilon
\tag{2}
\end{equation}
$

what it makes sense intuitively as by maximizing (2) we are trying to find tha state-action pairs, i.e the $(s,a)$'s, generated from $\pi$ that will produce the trajectories on the dataset with maximum cumulative reward (i.e. maximum $A^\mu(s,a)$), in other words the best trajectories in our dataset! However, we need to restrict the (s,a) pairs to be close to the dataset and that's the reason of the $D_{KL}$ divergence.

At this point (2) can be formulated as a constrained optimization problem in a Lagrangian language:

$
\begin{equation}
L(\pi, \beta) =  \mathbb{E}_{s \sim d^{\pi}(s)} \mathbb{E}_{a \sim \pi(a|s)} [A^{\mu}(s, a)] + \lambda \left( \epsilon -  D(\pi(\cdot|s) || \mu(\cdot|s)) \right)
\tag{3}
\end{equation}
$

and this can be maximized easily so after some algebra we find that:

$
\pi^*(a|s) = \frac{1}{Z(s)} \mu(a|s) \exp\left(\frac{1}{\lambda} A^{\mu}(s, a)\right) \tag{4}.
$

So $\pi^*$ will be the optimal policy. But what we can do now is to approximate it by a DNN
, $\pi_\theta$ and again we can impose that $\pi_\theta$ and $\pi^*$ be close distributions on the dataset, i.e.


$$
\begin{equation}
\pi_\theta (a|s) = argmin_{\pi_\theta} \mathbb{E}_{s \sim d\mu(s)} \left[ D(\pi^*(\cdot|s) \, \Vert \, \pi_\theta(\cdot|s)) \right] = \\
\arg\max_{\pi_\theta} \mathbb{E}_{s\sim d\mu(s)}\mathbb{E}_{a\sim\mu(a|s)} \left[ \log \pi_\theta(a|s) \exp\left(\frac{1}{\beta} A^{\mu}(s, a)\right) \right]
\end{equation}
$$


So again simple intuition: pairs (s,a) that could produce potentially high trajectory rewards on the dataset will be preferred by $\pi_\theta(a | s)$.



are computing the expectation value of the policy $\mu$ advantage, $A^\mu(s,a)$, so basically the mean cumulative reward followed by $mu$ but starting from a state-action pair sampled from $\pi$.



It can be shown that given two policies $\pi$ and $\mu$ the following general result holds:

$\eta(\pi) = \mathbb{E}_{s \sim d^{\pi}(s)} \mathbb{E}_{a \sim \pi(a|s)} [A^{\mu}(s, a)] = \mathbb{E}_{s \sim d^{\pi}(s)} \mathbb{E}_{a \sim \pi(a|s)} \left[ R^{\mu}_{s,a} - V^{\mu}(s) \right]
$

$
\begin{align*}
\underset{\pi}{\text{arg max}} \int \int \frac{d\mu(s)}{da} \frac{d\pi(a|s)}{ds} (R^{\mu}_{s,a} - V^{\mu}(s)) \, da \, ds \tag{5}\\
\text{s.t.} \int \frac{d\mu(s)}{ds} \text{DKL}(\pi(\cdot|s) || \mu(\cdot|s)) \, ds \leq \epsilon
\end{align*}
$

This derivation comes from the AWR offpolicy algorithm but there are slightly different implementations like the AWAC that uses an offpolicy Q-function $Q_π$ to estimate the advantage.

Policy constraint methods are powerful, but they can be often be too pessimistic, which is always undesirable. For instance, if we know that a certain state has all actions with zero reward, we should not care about constraining the policy in this state once it can inadvertently affect our neural network approximator while forcing the learned policy to be close to the behavior policy in this irrelevant state. We effectively limit how good of a policy we can learn from our dataset by being too pessimistic.

Also, as we use function approximation on these methods this could produce some issues for instance when we fit an unimodal policy into multimodal data. In that case, policy constraint methods can fail dramatically.

### Policy Regularization

Policy Regularization is an alternative approach to ensuring the robustness of learned value functions, specifically Q-functions. **This approach involves regularizing the value function directly, aiming to prevent overestimation, especially for actions that fall outside the distribution seen during training**.

It's versatile, applicable to different RL methods, including actor-critic and Q-learning methods, and doesn't necessitate explicit behavior policy modeling (similar to the implicit constraint methods).


Perhaps one of the most famous examples is the CQL (Conservative Q-Learning) algorithm that introduces the following constraint as Q-value regularization:

\begin{equation}
CCQL_0(D, \phi) = E_{s\sim D, a\sim \mu(a|s)}[Q_{\phi}(s, a)]\ \tag{1}
\end{equation}

Note that if we choose  $Q_\phi$ that minimizes (1) then we will be minimizing the Q-values in all states of the dataset. Suppose now that we choose the policy $\mu$ in an adversarial way such that it maximizes the constraint (1) then the net effect will be that the penalty will push down on high Q-values. This is what is rigorously shown in the CQL paper where they found that the solution of:

$\hat{Q}^{k+1}_{\text{CQL}} \gets \hbox{argmin}_Q \left[ \color{red} {\alpha\mathbb{E}_{s \sim \mathcal{D}, a \sim \mu}[Q(s,a)] } + \frac{1}{2} \mathbb{E}_{s,a \sim \mathcal{D}} \Big[\big(Q(s,a) - \mathcal{B}^{\pi}Q(s,a)\big)^2\Big] \right]. \tag{2}$

produces a lower bound for Q(s,a). There are different choices of $\mu$. If you could choose $\mu$ as $\pi$ but also there are other choices, but these are technical details (see ... for more details).


In summary, CQL employs a conservative penalty mechanism, which pushes down on high Q-values by choosing an adversarial behavior policy µ(a|s). This promotes cautious Q-value estimation, particularly for out-of-distribution actions. The chosen µ(a|s) and penalty weight α are critical factors in this process, leading to a provably conservative Q-learning or actor-critic algorithm.

### References

[Schulman et al. 2017 - Trust Region Policy Optimization](https://arxiv.org/pdf/1502.05477.pdf)

[Kumar et al. 2020 - Conservative Q-Learning for Offline Reinforcement Learning](https://arxiv.org/pdf/2006.04779.pdf)

[ Levine et al. 2021 - Offline Reinforcement Learning: Tutorial, Review,
and Perspectives on Open Problems ](https://arxiv.org/pdf/2005.01643.pdf)

[Peng et al. 2019 - Simple and Scalable Off-Policy Reinforcement Learning](https://arxiv.org/abs/1910.00177)

[Nair et al. '2020 - AWAC: Accelerating Online Reinforcement Learning with Offline Datasets](https://arxiv.org/abs/2006.09359)
