In [None]:
from typing import Callable

import numpy as np

from IPython.core.display import HTML
from celluloid import Camera
from dataclasses import dataclass
import numpy as np
import matplotlib.pyplot as plt
import gymnasium as gym
from functools import partial

In [None]:
%%capture

%load_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext training_rl

In [None]:
%presentation_style

In [None]:
%%capture

%set_random_seed 12

In [None]:
%load_latex_macros

<img src="_static/images/aai-institute-cover.png" alt="Snow" style="width:100%;">
<div class="md-slide title">Intro to Reinforcement Learning</div>

# Basic Concepts of Reinforcement Learning

## Part 1: Introduction

In the last part, we saw interaction with environments from a control perspective. 
For the next two days, we will view the task of controlling an environment from 
a _learning perspective_.

Before diving into the notation and details of RL, let us shortly contrast the overall approach
to supervised (or self-supervised) learning. 


### Bird's Eye's view of Supervised Learning

A supervised learning project can roughly be viewed as the following points:

1. Train a model on a dataset to fulfill some objective
2. Evaluate the model on a separate test set

But is this really the full picture in a realistic project? While for more academic situations it might be taken as given,
in a real scenario somebody has to compile the datasets. In fact, this is often the hardest and most important part
of the whole project! Let us extend the list a bit:

0. Obtain train and test datasets
1. Train a model on a dataset
2. Evaluate the model on a separate test set

Now we are getting closer to a realistic project. But there is still something missing - is it enough
to just test the model on a test set? What if the model is not good enough in some cases?

While sometimes the model details can be changed to improve the performance, often 
changes in the dataset are a much more impactful measure.

For example, a product owner might notice that a face recognition system is not working well for 
people with glasses. After a quick glance, one discovers that the training dataset contains very
few pictures of people with glasses. So the appropriate reaction is to collect more such pictures and retrain.

The extended list now looks like this:

0. Obtain train and test datasets
1. Train a model on a dataset to fulfill some objective
2. Evaluate the model on a separate test set
3. Check on which samples the model underperforms
4. Analyze the results of step 3, and go back to step 0 if necessary

How could you go about extending the dataset with samples on which the current model version
underperforms? Let's slightly extend the previous example and imagine that the face-recognition
system can also output the labels "not a face" and "has glasses".

One could now crawl some large database of images, and use the current version of the model
for an initial filtering of it, to mainly obtain faces. Now a human would go over the filtered
images, and if necessary, reassign labels. Especially the images where the model has made a mistake
are of particular value for the next training round.

This means, that the current version of model itself is used in the acqusition of new data, 
in step 4. In addition, there is an external system (a human) which is responsible for
switching labels if necessary.

Thus, in the next training run the model is receiving something akin to __negative rewards__
from the external system for getting the predictions wrong - new samples where it failed to have
a small value of the training loss are added to the training data.

It is also in a way __positively rewarded__ for samples where it performed well, since comparatively few
new samples of this type will be added to the dataset, and the overall loss due to them will not 
increase by much.

There are many ways to compare reinforcement learning to other learning paradigms, but
in my opinion, the main difference is the following:

__In RL, data acquisition is automated and part of the algorithm!__

This means that the manually performed steps 4 and 0 in supervised learning (SL)
are being handled by the RL algorithm itself.

<img src="_static/images/rl-abstract.png" alt="Decision Process" style="width:100%;">

Other important differences are:

- RL is generally concerned with sequential decision-making
- Contrary to SL, RL can make use of **suboptimal information**
- The training signal (typically sum of returns) in RL may be significantly delayed after an action
- Evaluating an RL policy is generally harder than evaluating an SL model
- Transfer learning in RL is often difficult or impossible

### Relativizing the Differences

Of course, also SL is widely applied for modeling sequences. Suboptimal information
in SL can be taken into account in some situations - e.g., when the goal is not to predict
a perfect label/token from the input, but to correctly model a probability distributions
of outputs (which then contains "suboptimal" ones). We will go back to that point later
when we talk about the *Decision Transformer* algorithm.

The lines get even more blurry when comparing SL to offline RL. In offline RL, there
is no interaction with the environment, thus the crucial part data acquisition 
(called exploration in RL contexts) is missing.

While the other differences remain, offline RL *is* a certain kind of supervised 
learning. Thus, it is no wonder that many standard SL techniques prove useful there.

In recent developments, large foundational models are increasingly being used for tasks
that were previously considered to be only solvable through RL, even tackling the
difficulties of transfer learning across different environments.

These include the [Palm-E](https://palm-e.github.io/) and [Gato](https://www.deepmind.com/publications/a-generalist-agent) 
models, as well as the new and open dataset and models from 
the [Open-X](https://robotics-transformer-x.github.io/) project.

## Reinforcement Learning Notation

We will introduce standard RL Notation that is used in many publications and books, with a slight
tilt towards software implementations. We will also focus on RL with function approximation,
meaning that value functions and policies contain learnable parameters, and are typically
represented by neural networks. This is contrary to tabular RL, where one has finite sets of states and actions,
and the value functions and policies can be represented by tables.

### Decision Processes

A Markov Decision Process (MDP) is a common starting point of any RL-related work. 
It is defined by a tuple $(S, A, P, R)$, where:

- $S$ is the set of states in the environment.
- $A$ is the set of actions available to the agent.
- $P(s' | s, a)$ is the transition probability, representing the probability of transitioning from state $s$ to state $s'$ after taking action $a$.
- $r(s, a)$ is the reward function, which gives the immediate reward after transitioning from state $s$ to state $s'$ by taking action $a$.

The "Markov" property means that transition probabilities and rewards only depend on the current state and action, 
and not on the history of previous states and actions.

Note that "almost anything" can be made Markovian by redefining what "state" means.

### Policy

A policy $\pi$ is a mapping from states to probabilities of selecting each possible action. 
It defines the agent's behavior in the environment. 
Formally, for each state $s$, $\pi(a|s)$ represents the probability of taking action $a$ in state $s$.
The final goal of RL is to find an optimal policy $\pi^*$ that maximizes the expected return:

\begin{equation}
\pi^* = \arg \max_{\pi} \mathbb{E}_{\pi} \left[ \sum_{t=0}^{H} \gamma^t r(s_t, a_t) \right] =: \arg \max_{\pi} J(\pi)
\end{equation}

Where:
- $H$ is the horizon, i.e., the (maximal) number of steps in a trajectory.
- $\gamma$ is the discount factor that balances immediate rewards against future rewards.
- $\mathbb{E}_{\pi}$ is the expectation over trajectories sampled from the policy $\pi$.
- $J(\pi)$ is the optimization objective

For parameterized policies (e.g. in deep RL), the policy is typically represented by a trainable function with parameters $\theta$.
The goal is then to find the optimal parameters $\theta^*$ that maximize the expected return:

\begin{equation}
\theta^* = \arg \max_{\theta} \mathbb{E}_{\pi_{\theta}} \left[ \sum_{t=0}^{H} \gamma^t r(s_t, a_t) \right] =: \arg \max_{\pi} J(\theta)
\end{equation}

Many RL algorithms are based on the idea of optimizing the objective $J(\theta)$ by using gradient-based methods.
These algorithms are often called policy-based, policy gradient methods, or policy-search. For them, it is of central importance
to be able to approximate the gradient of $J(\theta)$ with respect to $\theta$. We will go into more detail about this later.

Note that taking the gradient of the above objective is not trivial! One of the central results of RL is the _Policy Gradient Theorem_ that expresses the gradient in a tractable form. We will see it later.

**Question:** What do you think would be a suitable distribution $\pi_{\theta}$ to use in continuous action spaces? What about discrete action spaces?

**Answer:**

### Partial Observability

In many real-world problems, the agent does not have access to the full state of the environment.
Think of observing the screen of a video game, or navigating a car by observing camera feeds.
Instead, it receives observations $o$ from the environment, which are typically noisy and incomplete. 
Mathematically, observations are generated by a function $O(s)$, which maps states to observations. 

Such situations are called Partially Observable Markov Decision Processes (POMDPs). They are the
standard for most real-world applications of RL (although, somewhat unfortunately, not for 
the majority of academic research).

### Environments in Software

The starting point for most online RL Projects (with the possible exception of RL for large language models (LLMs))
is an `environment`. The standard interface for environments is the `gymnasium` API. You already saw examples
yesterday at the control workshop.

**Question**: What is the state of a gymnasium environment? What does it mean for this environment to be fully observable? Think for example about an environment that provides camera-feeds of a car driving in the city. What would be an effectively fully observed environment for driving a car?

**Answer**:

In [None]:
%%capture
# playing around with environments
cartpole_env = gym.make("CartPole-v1", render_mode="rgb_array")
breakout_env = gym.make("ALE/Breakout-v5", render_mode="rgb_array")
halfcheetah_env = gym.make("HalfCheetah-v4", render_mode="rgb_array")

In [None]:
cartpole_env.observation_space

In [None]:
breakout_env.observation_space

**Question:** which environment is fully observable, which one partially observable? How could you
make the partially observable one fully observable?

**Answer:** 

Other important quantities are:

- $\tau$ is commonly used to denote trajectories, i.e., the sequence of states, actions, and rewards that the agent experiences.
- $H$ is the horizon, i.e., the (maximal) number of steps in a trajectory.
- $R(t)$ is the discounted return from step-t on, i.e. $R(t) = \sum_{i=t}^{H} \gamma^{i-t} r(s_i, a_i)$.
  Here, $\gamma$ is the discount factor, which balances immediate rewards against future rewards.

**Question:** What are the horizons of the environments above? Hint - use the environment's `__repr__` (i.e. just print it).
    Feel free to look at the source code of `gymnasium`.
    

**Answer:**

In [None]:
cartpole_env

### Exercise - Basic Environment Interaction

Write an agent that samples random actions from the environment. Then
visualize the trajectories by plotting the frames, rewards, and returns.

In [None]:

@dataclass
class TrajEntry:
    step: int
    obs: np.ndarray
    action: np.ndarray
    reward: float
    next_obs: np.ndarray
    
    frame: np.ndarray | None = None


TTrajectory = list[TrajEntry]

def get_trajectory_animation(
    traj: TTrajectory,
    entry_plotter: Callable[[TrajEntry], None],
    title_extractor: Callable[[TrajEntry], str] | str = "Trajectory",
    dpi=150,
    figsize=(3, 3),
    display_frame_count=True,
):
    fig = plt.figure(dpi=dpi, figsize=figsize)
    camera = Camera(fig)
    for id_frame, entry in enumerate(traj):
        title = title_extractor if isinstance(title_extractor, str) else title_extractor(entry)
        entry_plotter(entry)
        ax = plt.gca()
        ax.text(-0.05, 1.05, title, transform=ax.transAxes)
        if display_frame_count:
            ax.text(
                .75,
                1.05,
                f"frame: {id_frame}",
                fontsize=8,
                bbox=dict(facecolor="gray", fill=True, linewidth=1, boxstyle="round"),
                transform=ax.transAxes
            )
        camera.snap()
    animation = camera.animate().to_html5_video()
    plt.close()
    display(HTML(animation))

    

In [None]:
def plot_image_with_text_boxes(image: np.ndarray, texts: list[str], fontsize=4):
    plt.imshow(image)
    d_between_boxes = 0.08
    for i, text in enumerate(texts):
        y_pos = -0.01- i * d_between_boxes
        plt.text(
            .75,
            y_pos,
            text,
            fontsize=fontsize,
            bbox=dict(facecolor="white", fill=True, linewidth=1, boxstyle="round"),
            transform=plt.gca().transAxes
        )

def plot_traj_entry_with_reward_and_return(entry: TrajEntry, returns: list[float]):
    cur_return = returns[entry.step]
    plot_image_with_text_boxes(entry.frame, [f"reward: {entry.reward:.2f}", f"return: {cur_return:.2f}"])

In [None]:
cartpole_env

## Solution:


In [None]:
def compute_returns(rewards: list[float], gamma: float = 1):
    """
    Computes returns to go from a list of rewards that are assumed to come from a single episode
    """
    returns_reverted = [rewards[-1]]
    for rew in rewards[-2::-1]:
        returns_reverted.append(rew + gamma * returns_reverted[-1])
    return list(reversed(returns_reverted))

In [None]:

traj = []
rewards = []
cartpole_env.reset()
for step_num in range(50):
    action = cartpole_env.action_space.sample()
    obs, reward, terminated, truncated, info = cartpole_env.step(action)
    rewards.append(reward)
    entry = TrajEntry(step=step_num, obs=obs, action=action, reward=reward, next_obs=obs)
    entry.frame = cartpole_env.render()
    traj.append(entry)
        

In [None]:
traj_returns = compute_returns(rewards)
entry_plotter = partial(plot_traj_entry_with_reward_and_return, returns=traj_returns)

In [None]:
get_trajectory_animation(traj, entry_plotter)

## Sceleton of an Online RL algorithm

- fill in the blanks

## Policy Evaluation

Before we dive into how a policy that solves a decision process may be found, we shall talk about how
policies can be evaluated. The evaluation plays a central building block for many policy improvement algorithms,
so hang in there even if for the moment it might seem too dry!



Already in the random policy we saw one aspect that makes RL harder than SL - it is difficult, or even impossible
to attribute the return to a specific action, since the rewards may be delayed. This is called the **credit assignment problem**.

As we will see below, evaluating a policy is also far from trivial. In fact, it is often the hardest part of RL.
First, let us introduce some notation.

### Value Function

The value function $V_\pi(s_t)$ (with some abuse of notation) represents the expected return (cumulative future rewards) 
an agent can obtain starting from a state $s_t$ at time $t$ and then following the policy $\pi$.

\begin{equation}
V_\pi(s_t) = \mathbb{E}_{\pi} \left[ \sum_{i=t}^{H} \gamma^{i-t} r(s_i, a_i) \right]
\end{equation}

Where:
- $\pi(a|s)$ is the probability of taking action $a$ in state $s$ under the policy.
- $\gamma$ is the discount factor that balances immediate rewards against future rewards.

We will be interested in the infinite horizon case, i.e., $H = \infty$. 
In this case, one needs only one value function $V_{\pi}(s)$, since the remaining horizon (and return) 
do not depend on the current time step $t$.

Thus, we can drop the index $t$ and rewrite the value function as:

\begin{equation}
V_{\pi}(s) = \mathbb{E}_{\pi} \left[ \sum_{i=0}^{\infty} \gamma^i r(s_i, a_i) \right]
\end{equation}

**Question:** Do the environments above have an infinite or a finite horizon? What about environments with a terminal state,
will the formalism for infinite horizons still work?

**Answer:**

The RL objective can be rewritten in terms of the value function as:

\begin{equation}
J(\pi) = \mathbb{E}_{\pi} \ \mathbb{E}_{s_0 \sim \rho_0} \left[ V_{\pi}(s_0) \right]
\end{equation}

Where $s_0$ is the initial state and $\rho_0$ is the initial state distribution.

**Question:** Which method defines the initial state distribution for `gymnasium` environments?

**Answer:**

### Q-function

Another useful quantity is the Q-function, 
denoted as $Q_\pi(s, a)$. It represents the expected return an agent can obtain from taking action $a$ in state $s$ and then following the policy $\pi$. 
It is defined as:

\begin{equation}
Q_\pi(s, a) = r(s, a) + \gamma \ \mathbb{E}_{s' \sim P(\cdot | s, a)} \mathbb{E}_\pi \left[ r(s', a_0) + \sum_{i=1}^{\infty} \gamma^{i} r(s_i, a_i) \right]
\end{equation}

Where $a_0 \sim \pi(\cdot | s')$ is the action taken in the next state $s'$ by following the policy $\pi$.

The Q-function is closely related to the value function:

\begin{equation}
Q_\pi(s, a) = r(s, a) + \gamma \ \mathbb{E}_{s' \sim P(\cdot | s, a)} \left[ V_\pi(s') \right]
\end{equation}


### Advantage Function

Finally, one more ubiquitous quantity in  RL is the advantage function, denoted as $A_\pi(s, a)$. It is defined as:

\begin{equation}
A_\pi(s, a) = Q_\pi(s, a) - V_\pi(s)
\end{equation}

The advantage function represents the advantage of taking action $a$ in state $s$ over just following the policy $\pi$
from that state.

Q: property of A for perfect policy

### Estimating Value Functions

The equations for $Q$ and $V$ are of telescopic nature, i.e., they can be rewritten recursively:

\begin{equation}
V_\pi(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) \right]  + \gamma \ \mathbb{E}_{a \sim \pi(s), s' \sim P(\cdot | s, a)} \left[ V_\pi(s') \right]
\end{equation}

and

\begin{equation}
Q_\pi(s, a) = r(s, a) + \gamma \ \mathbb{E}_{a' \sim \pi(s'), s' \sim P(\cdot | s, a)} \left[ Q_\pi(s', a') \right]
\end{equation}


When a value function is learned, the above equalities no longer hold exactly. Instead, one can view the right-hand side as an 
"improved estimation" of the value function. It is "improved", because it has used a sample from the environment,
thereby incorporating "ground truth" information.

For example, given a learned value function $V_\theta$ (we are going to worry about how to learn it later), 


and a sample $(s, a, r(s,a), s')$ from the environment, one can get a less biased estimate of $V_\theta(s)$, let's call it
$V_{\theta, 1}(s)$ through a single sample estimate of the above expectation value:

\begin{equation}
V_{\theta, 1}(s) = r(s, a) + \gamma \ V_\theta(s')
\end{equation}

This is called a one-step estimate, since it only uses one sample from the environment. The 
resulting $V_{\theta, 1}(s)$ is less biased than $V_\theta(s)$ by itself, but has higher variance,
 because the single-sample-expectation is unbiased, but has high variance.

**Question:** What would be a fully unbiased estimate of $V(s)$ given a full trajectory? What about a biased estimate with lowest possible variance?
How can one balance bias and variance in the estimate?

**Answer:**

### TD-Lambda and Generalized Advantage Estimation

Above we have seen the one-step estimate of the value function. It is also possible to use a multi-step estimate:

\begin{equation}
V_{\theta, n}(s) = r(s, a) + \gamma \ r(s_1, a_1) + \gamma^2 \ r(s_2, a_2) + \dots + \gamma^n \ V_\theta(s_n).
\end{equation}


For $n = \infty$, this is the Monte Carlo estimate, that does not use $V_\theta$ at all (although for truncated trajectories, 
$V_\theta$ would still be used for the last state).

![TD-Lambda](_static/images/50_td_lambda.png)

Since for any $n$ one has a valid estimate of $V(s)$, one can combine them into a weighted average:

\begin{equation}
V_{\theta, \lambda}(s) = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} V_{\theta, n}(s)
\end{equation}

Without too much effort, one can derive a recursive formula for conveniently computing $V_{\theta, \lambda}(s)$
for a trajectory $(s_0, a_0, r_0, s_1, a_1, r_1, \dots, s_T, a_T, r_T, s_{T+1})$ by traversing it backwards:

\begin{equation}
V_{\theta, \lambda}(s_t) = (1 - \lambda) \left[ r_t + \gamma \ V_{\theta, \lambda}(s_{t+1}) \right] + \lambda \ V_{\theta, \lambda}(s_{t+1})
\end{equation}

and

\begin{equation}
V_{\theta, \lambda}(s_T) = r_T + \gamma \ V_\theta(s_{T+1}).
\end{equation}

The parameter $\lambda$ controls the trade-off between bias and variance, and this approach for estimating values is called TD-Lambda.

The exact same approach can be used for estimating the Advantage function (or the Q-function), which is called Generalized Advantage Estimation (GAE):

\begin{equation}
A_{\theta, \lambda}(s, a) = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} A_{\theta, n}(s, a)
\end{equation}

which for a trajectory $(s_0, a_0, r_0, s_1, a_1, r_1, \dots, s_T, a_T, r_T, s_{T+1})$ can be computed recursively as:

\begin{equation}
A_{\theta, \lambda}(s_t, a_t) = (1 - \lambda) \left[ r_t + \gamma \ A_{\theta, \lambda}(s_{t+1}, a_{t+1}) \right] + \lambda \ A_{\theta, \lambda}(s_{t+1}, a_{t+1})
\end{equation}

with

\begin{equation}
A_{\theta, \lambda}(s_T, a_T) = r_T + \gamma V_\theta(s_{T+1}) - V_\theta(s_T).
\end{equation}

From the above considerations, a somewhat rarely mentioned identity can be directly derived:

For $(s, a, r, s')$ being parts of a trajectory, and with the advantage and value functions being estimated from the trajectory as above, the following holds:

\begin{equation}
A_{\theta, \lambda}(s, a) = V_{\theta, \lambda}(s') - V_\theta(s).
\end{equation}

Thus, TD-Lambda and GAE are equivalent methods for estimating the value or the advantage function on a trajectory.

## Learning Value Functions

Now that we have seen how to estimate value functions, we can turn to the question of how to learn them.
In the following, we will focus on the value function $V_\theta(s)$, but the same considerations apply to the Q-function $Q_\theta(s, a)$.

The sample-improved values that we have seen above can be viewed as **targets** for the value function. Thus, a straightforward learning approach would be to perform **supervised regression** by minimizing
 some error (often the mean-squared-error (MSE)) between the current value function and the target.
 
The difference between the current value function and the target is called the Bellman error, or TD-error:

\begin{equation}
\delta_1 = V_\theta(s) - \left[ r(s, a) + \gamma \ V_\theta(s') \right]
\end{equation}

or, for TD-Lambda / GAE:

\begin{equation}
\delta_\lambda = V_\theta(s) - V_{\theta, \lambda}(s)
\end{equation}


**Question:** Given some policy, an environment, and a value function represented by a neural network, how would you train the latter? Are there any possible problems you can think of?

**Answer:**

**Question:** For learning a Q-function represented by a neural network, in practice one usually uses the 1-step target. Why do you think this is the case?

**Answer:**

## Policy Improvement

## From Value Functions to Actions - Value Iteration


So far, we have only considered how to compute value functions. Now we will turn to the question of how to use them to select good actions.

A simple way is to improve on the actions performed by some policy $\pi$ is to estimate the Q-function, and then greedily select the action with the highest Q-value:

\begin{equation}
\pi'(s) := \arg \max_{a \in A} Q_{\pi,\theta}(s, a)
\end{equation}

Note that $\pi'$ is a deterministic policy. 
Unsurprisingly, if $Q_{\pi,\theta}$ were the true $Q_\pi$, this would lead to $\pi'$ being better than $\pi$ 
(due to the *policy improvement theorem*). In the function-approximation case, we can just hope that the estimate of $Q$ is good enough.

Now the Q-function of this improved policy $\pi'$ can be estimated by:

\begin{equation}
Q_{\pi',\theta}(s, a) = r(s, a) + \gamma \max_{a'} Q_{\pi,\theta}(s', a')
\end{equation}

If the Q-function was already optimal (i.e., it would represent the best possible policy) - let us call it $Q^*$ - then it would satisfy the Bellman optimality equation:

\begin{equation}
Q^*(s, a) = r(s, a) + \gamma\mathbb{E}_{s'} [\max_{a'} Q^*(s', a')]
\end{equation}


The core idea of **value iteration** in the style of deep Q-learning is to use the improved Q-function as a target for the current Q-function.
Thus, one wants to minimize the (mean of squares of the) 1-step Bellman error for the Q-function:

\begin{equation}
\delta_1 = Q_{\pi,\theta}(s, a) - \left[ r(s, a) + \gamma \max_{a'} Q_{\pi,\theta}(s', a') \right]
\end{equation}

When this error is zero, the Q-function is optimal.
The policy at any point is then given by simply selecting the action with the highest Q-value: $\pi(s) := \arg \max_{a \in A} Q_{\pi,\theta}(s, a)$.

**Question:** How would you design a Q-based deep RL algorithm based on the above idea? For which kind of environments would it work, and which problems could arise?

**Answer:**

## Policy Iteration and the Policy Gradient Theorem

While purely value-based methods can give rise to powerful deep RL algorithms, they are difficult to use 
in environments with continuous action spaces. Therefore, we now go back to the original RL objective, 
and turn to policy-based methods. At the center of many policy-based methods is a
fundamental result called the **policy gradient theorem**:

\begin{equation}
\nabla_{\theta} J(\theta) \propto \mathbb{E}_{\pi_\theta} \left[ \left( Q_\pi(s, a) - b(s) \right) \nabla_{\theta} \log \pi_\theta(a|s) \right]
\end{equation}
  
Where $b(s)$ is an arbitrary state-dependent **baseline** function.
The most common choice for $b(s)$ is the value function $V(s)$, which leads to the policy gradient of the form:

\begin{equation}
\nabla_{\theta} J(\theta) \propto \mathbb{E}_{\pi} \left[ A_\pi(s, a) \nabla_{\theta} \log \pi_\theta(a|s) \right]
\end{equation}

Note that $\mathbb{E}_{\pi} \left[ A_\pi(s, a) \right] = 0$.

Different RL algorithms are obtained by choosing different methods for estimating advantages $A(s, a)$.
For example, not using any learned function and estimating $A(s, a)$ by the Monte Carlo return $R(s, a)$
leads to the REINFORCE algorithm.

Using a learned value function $V_\theta(s)$
 for estimating $A(s, a)$ leads to many successful
**Actor-Critic algorithms**, such as A2C, A3C, and PPO. There, the *actor* is the policy $\pi_\theta(a|s)$,
and the *critic* is typically a parameterized value function $V_\theta(s)$. So, the previous section on
policy evaluation comes in handy for policy-gradient methods as well.

In the most straightforward implementation of this policy search scheme, the advantages $A_\pi$ are
estimated on the fly, i.e. in an on-policy fashion.

Note that the critic is only ever used on values sampled from the environment with the, contrary to
the value-iteration approach (where $\max$ and $\arg \max$ have to be computed). Moreover, the critic
is only used during training and is typically discarded during inference.

**Question**: How would you design a deep Actor-Critic RL algorithm? How would you perform updates of the actor and critic networks?
 What are important hyperparameters?
 
**Answer**:

## Q-Methods for Continuous Control - DDPG and its Variants

The Q-based method outlined above don't lend themselves directly for continuous action spaces - the $\max$ and $\arg \max$
operations are too expensive and unreliable there.

Fortunately, the deterministic policy gradients (DPG) approach and its deep-learning successors give an elegant way
out of this. This core idea is the following: instead of maximizing the Q-function over the action space, one can
learn a deterministic policy $\mu_\theta(s)$ that takes the role of the $\arg \max$ operation. The Q-function is then
updated by minimizing the (mean square of the) 1-step Bellman error:

\begin{equation}
\delta_1 = Q_{\pi,\theta}(s, a) - \left[ r(s, a) + \gamma \ Q_{\pi,\theta}(s', \mu_\theta(s')) \right]
\end{equation}

The policy is updated by using the **deterministic policy gradient theorem**:

\begin{equation}
\nabla_{\theta} J(\theta) \propto \mathbb{E}_{\pi_\theta} \left[ \nabla_{\theta} \mu_\theta(s) \nabla_{a} Q_\pi(s, a) \right]
\end{equation}

For sampling actions from the policy during training, usually some exploration noise is added to the output of the policy network.

Note that since now the Q-values are estimated directly from a single sample (and not from a trajectory, like described above), 
the DPG-family of algorithms is amenable to off-policy learning.

**Question:** This is a good moment to pause and look back at the algorithm schemes that we saw above. How would
you summarize the differences between these approaches? What are the advantages and disadvantages of each? What
could be turning knobs for improving their stability and efficiency?

## Other Important Topics in Algorithm Design

Two major recurring (and highly related) schemes in derivations of RL algorithms are

1. Regularization and conservative updates
2. Entropy maximization, aka soft learning

Regularization techniques build on the idea that changes to policies (and other quantities) should happen slowly,
and entropy maximization roughly means that policies should be as stochastic as possible while still
leading to high rewards.

A common technique for incorporating regularization is to include a penalty for deviating
too far from the previous directly into the RL objective. Algorithms like PPO, TRPO as well as REPS, AWR and follow-ups
are derived this way. This approach is also often used for RL from human feedback (RLHF) in language modelling.
 
A nice aspect of this form is that an analytic, non-parametric solution of the modified RL objective can often be derived explicitly.

With the modified objective being

\begin{equation}
J(\pi') = \mathbb{E}_{\pi'} \left[ \sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) - \alpha \mathcal{D}_{KL}(\pi'(\cdot|s_t) || \pi(\cdot|s_t)) \right]
\end{equation}

the solution is given by (using the policy gradient theorem and the value function as state-dependent baseline):

\begin{equation}
\pi'(a|s) \sim \pi(a|s) \exp \left( \frac{1}{\alpha} A_\pi(s, a) \right),
\end{equation}

which is a weighted version of the original policy.

Entropy maximizing algorithms like Soft Actor Critic (SAC) are often naturally derived from the "RL as inference" perspective,
despite the original derivation of SAC following a different route. The objective considered in SAC is


\begin{equation}
J(\pi) = \mathbb{E}_{\pi} \left[ \sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t)) \right]
\end{equation}

where $\mathcal{H}(\pi(\cdot|s_t))$ is the entropy of the policy distribution $\pi(\cdot|s_t)$.


Maximising the entropy of the policy adds noise to the exploration, which is often beneficial for learning. The soft family of
actor critic algorithms often tends to be more sample efficient and stable than their non-soft counterparts. An initial difficulty
of these algorithms was that $\alpha$ was a hyperparameter that had to be tuned carefully. However, follow-up work
has introduced methods for automatically tuning $\alpha$, significally simplifying the use of these algorithms.

## Exploration

So far, we have mainly talked about policy evaluation and policy improvement. Going back to our original comparison to
supervised learning, this merely addresses the parts

1. Train a model on a dataset to fulfill some objective
2. Evaluate the model on a separate test set

However, the most important part about how to obtain a suitable dataset and to steer the data acquisition process
is still missing. This is the part of **exploration**, and is arguably the hardest and most important aspect of an RL
algorithm.

**Question:** What are some possible ways of exploration that you can think of? How does exploration differ for on-policy and off-policy algorithms?

**Answer:**

Exploration in RL is a very active area of research, and many modifications of existing RL algorithms have been proposed
to handle exploration in a variety of settings. We will not go into a deep dive here, but instead have a small live 
discussion about this topic.

<img src="_static/images/aai-institute-cover.png" alt="Snow" style="width:100%;">
<div class="md-slide title">Thank you for the attention!</div>