# Advantage Actor-Critic on LunarLanderContinuous-v3

## Introduction

This project delves into a famous deep reinforcement learning algorithm, the Advantage Actor-Critic (A2C) Algorithm. We first cover the algorithm's underlying theory, followed by its implementation in PyTorch. Then we train and evaluate the reinforcement learning agent within OpenAI's LunarLanderConitnuous-v3 environment. 

## Part 1: Theoretical Backgrounds

### Policy Gradient

Let $J(\theta)$ denote the expectation of our cumulative reward over the entire episode with a finite time horizon $T$ given policy $\pi_\theta$:

$$
J(\theta) = \mathbb E\left[ \sum_{t=1}^T \gamma ^{t-1}r(s_t,a_t) \mid \pi_\theta, s_1 \right]
$$

for some discount factor $\gamma$. Our objective is to find optimal set of parameters $\theta^*$ such that $\theta^*=\arg\max_\theta J(\theta)$. 

Let $D^{\pi_\theta}(\tau)$ denote the probability distribution of a trajectory $\tau=(s_1, a_1, \cdots, s_{T-1}, a_{T-1}, s_T)$. That is,

$$
D^{\pi_\theta}(\tau) = \prod_{i=1}^{T-1} \pi_\theta(a_i\mid s_i) P(s_{i+1}\mid s_i, a_i).
$$

Then we have

$$
J(\theta) = \mathbb E_{\tau\sim D^{\pi_\theta}}[R(\tau)]
$$

where $R(\tau)=\sum_{t=1}^{T-1}\gamma^{t-1}R(s_t, a_t)$ is the expected total reward over the entire trajectory $\tau$. Note that $R(s_t, a_t)$ is the expectation of the reward given $s_t$ and $a_t$.

Under finite state and action spaces, the distribution $D(\tau)$ has a finite support. Assuming that our policy $\pi_\theta$ is differentiable with respect to $\theta$, we have 

$$
\begin{align*}
\nabla_\theta J(\theta) 
&= \nabla_\theta \sum_{\tau:D^{\pi_\theta}>0}D^{\pi_\theta}(\tau) R(\tau)
\\
&= \sum_{\tau:D^{\pi_\theta}>0}D^{\pi_\theta}(\tau) \frac{\nabla_\theta D^{\pi_\theta}(\tau)}{D^{\pi_\theta}(\tau)}R(\tau)
\\
&= \sum_{\tau:D^{\pi_\theta}>0}D^{\pi_\theta}(\tau) \nabla_\theta \log D^{\pi_\theta}(\tau) R(\tau)
\\
&= \mathbb E_{\tau\sim D^{\pi_\theta}}\left[ \nabla_\theta \log D^{\pi_\theta}(\tau) R(\tau) \right].
\end{align*}
$$

For any given trajectory, we then have

$$
\begin{align*}
\nabla_\theta \log D^{\pi_\theta}(\tau) 
&= \nabla_\theta \log \left( \prod_{t=1}^{T-1} \pi_\theta(s_t, a_t) P(s_{t+1} \mid s_t, a_t) \right)
\\
&= \sum_{t=1}^{T-1} \nabla_\theta \log \pi_\theta(a_t\mid s_t) + \sum_{t=1}^{T-1}\nabla_\theta P(s_{t+1}\mid s_t, a_t)
\\
&= \sum_{t=1}^{T-1} \nabla_\theta \log \pi_\theta(a_t\mid s_t)
\end{align*}
$$

where the derivative of the transition function $P(s_{t+1} \mid s_t, a_t)$ is zero.

Given a trajectory $\tau$, the quantity $D^{\pi_\theta}(\tau)$ is determined and we have

$$
\begin{align*}
\nabla_\theta J(\theta) 
&= \mathbb E_\tau \left[ R(\tau) \nabla_\theta \log(D^{\pi_\theta}(\tau)) \right]
\\
&= \mathbb E_\tau \left[ R(\tau) \sum_{t=1}^{T-1} \nabla_\theta \log (\pi_\theta(a_t\mid s_t)) \right].
\end{align*}
$$

or, as the Policy Gradient Theorem ([Sutton, McAllester, et al. (1999)]()) suggests,

$$
\nabla_\theta J(\theta) = \mathbb E_{\pi_\theta} \left[Q^{\pi_\theta}(s,a) \nabla_\theta \log \pi_\theta(a\mid s) \right].
$$


The update rule is

$$
\begin{align*}
\theta \leftarrow \theta - \alpha \nabla_\theta J(\theta)
\end{align*}
$$

where $\alpha$ is the learning rate.

### Actor-Critic

Vanilla policy gradient method suffers from high variance, meaning that the updates to the policy can be noisy and unstable. This high variance arises because the updates rely on estimating the total future reward, which can vary significantly from one episode to another.

An actor-critic methods approaches this issue by utilising two different networks. Straightforward from their names, the actor network decides which action to take and the critic network evaluates the actions taken by the actor network by estimating the value of the current state. 

We approximate the action-value function $Q^{\pi_\theta}(s,a)$ by some function approximator $Q_w(s,a)$ (a neural network with a set of parameters $w$). So that in the training process, we update two networks simultaneously. The actor and the critic give feedbacks to each other. 

### Advantage Actor-Critic

A further way to reduce variance is using a "baseline" and subtract it from the policy gradient. We now figure out how this can reduce variance without changing expectation.

For any function of the state $b(s)$ we have 

$$
\begin{align*}
\mathbb E_{\pi_\theta}\left[ b(s)\nabla_\theta \log \pi_\theta(s,a) \right]
&= \int_{s\in\mathcal S}D^{\pi_\theta}(s)\int_{a\in\mathcal A} b(s)\nabla_\theta \pi_\theta(a \mid s)
\\
&= \int_{s\in\mathcal S}D^{\pi_\theta}(s) b(s)\nabla_\theta\int_{a\in\mathcal A}\pi_\theta(a\mid s)
\\
&= \int_{s\in\mathcal S}D^{\pi_\theta}(s) b(s)\nabla_\theta 1
\\
&= 0
\end{align*}
$$

where $\mathcal S$ is the state space and $\mathcal A$ is the action space. As the expectation is zero, subtracting the baseline function does not change expectation.

Although it is different than the optimal baseline suggested by [Williams, 1992](), often we use $V^{\pi_\theta}(s)$ for our baseline. By choosing $b(s)=V(s)$ we define the advantage function $A^{\pi_\theta}$ such that

$$
\begin{align*}
A^{\pi_\theta}(s,a) 
&= Q^{\pi_\theta}(s,a) - V^{\pi_\theta}(s)
\\
&= r(s,a) + V^{\pi_\theta}(s') - V^{\pi_\theta}(s)
\end{align*}
$$

and we have 

$$
\begin{align*}
\nabla_\theta J(\theta)
&= \mathbb E_{\pi_\theta} \left[ A^{\pi_\theta}(s,a) \nabla_\theta \log \pi_\theta(a\mid s) \right]
\\
&= \mathbb E_{\pi_\theta} \left[ \left( r(s,a) + V^{\pi_\theta}(s') - V^{\pi_\theta}(s) \right) \nabla_\theta \log \pi_\theta(a\mid s)\right]
\\
&= \mathbb E_{\pi_\theta} \left[ \left( r(s,a) + V_w(s') - V_w(s) \right) \nabla_\theta \log \pi_\theta(a\mid s) \right]
\end{align*}
$$

where in the last part we approximated the value function by a neural network with a set of parameters $w$.

![a2c-algorithm](../../figures/a2c.png)

#### Continuous action space

## Part 2: Implementation

### Load libraries

### Define actor and critic networks

### Define A2C agent

### Define train function

### Train agent

### Visualise training results

### Define test function

### Test agent