# Choosing the right algorithm

> This is the summary of lecture "A complete reinforcement learning system (capstone)" from Coursera.

- toc: true 
- badges: true
- comments: true
- author: Chanseok Kang
- categories: [Python, Coursera, Reinforcement_Learning]
- image: 

## Expected SARSA

### The bellman equation for action-values

$$ q_{\pi}(s, a) = \sum\limits_{s', r} p(s', r \vert s, a) \big( r + \gamma \sum\limits_a \pi(a' \vert s') q_{\pi}(s', a')\big) $$

$$ \text{SARSA: } \quad Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \big( R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \big) \\
S_{t+1} \sim p(s', r \vert s, a) \\
A_{t+1} \sim \pi(a' \vert s') $$

### The expected SARSA algorithm

$$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \big(R_{t+1} + \gamma \sum\limits_{a'} \pi(a' \vert S_{t+1}) Q(S_{t+1}, a') - Q(S_t, A_t)\big) $$

## Q-learning

### The Q-learning algorithm

- Q-learning (off-policy TD control) for estimating $\pi \approx \pi_*$

$\begin{aligned} 
&\text{Algorithm paramters: step size } \alpha \in (0, 1], \text{ small } \epsilon > 0 \\
&\text{Initialize } Q(s, a), \text{ for all } s \in \mathcal{S}^+, a \in \mathcal{A}(s), \text{ arbitrarily except that } Q(terminal, \cdot) = 0 \\
\newline
&\text{Loop for each episode:} \\
&\quad \text{Initialize } S \\
&\quad \text{Loop for each step of episode:} \\
&\qquad \text{Choose } A \text{ from } S \text{ using policy derived from } Q \text{ (e.g., } \epsilon\text{-greedy)} \\
&\qquad \text{Take action } A, \text{ observe } R, S' \\
&\qquad Q(S, A) \leftarrow Q(S, A) + \alpha[R + \gamma \max_a Q(S', a) - Q(S, A)] \\
&\qquad S \leftarrow S' \\
&\quad \text{utill } S \text{ is terminal} \\
\end{aligned}$

### Revisiting Bellman equations

$$ \text{SARSA: } Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\big(R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)\big) \\
q_{\pi}(s, a) = \sum\limits_{s', r}p(s', r \vert s, a) \big( r + \gamma \sum\limits_{a'}\pi(a' \vert s') q_{\pi}(s', a') \big) $$

$$ \text{Q-learning: } Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \big( R_{t+1} + \gamma \max \limits_{a'} Q(S_{t+1}, a') - Q(S_t, A_t) \big) \\
q_{*}(s, a) = \sum\limits_{s', r} p(s', r \vert s, a) \big( r+ \gamma \max\limits_{a'} q_{\pi}(s', a') \big)$$

### Connections to Dynamic Programming

SARSA $ \sim $ Policy Iteration

Q-learning $ \sim $ Value Iteration

## Average Reward- A New Way of Formulating Control Problems

### The average reward objective

$$ r(\pi) \doteq \lim\limits_{h \to \infty} \frac{1}{h} \sum_{t=1}^h \mathbb{E}[R_t \vert S_0, A_{0:t-1} \sim \pi ] $$

$$ r(\pi) = \sum_s \mu_{\pi}(s) \sum_a \pi(a \vert s) \sum_{s', r} p(s', r \vert s, a) r $$

### Value Functions for Average reward

$$ G_t = R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + R_{t+3} - r(\pi) + \dots $$

$$ q_{\pi}(s, a) = \mathbb{E}_{\pi}[G_t \vert S_t = s, A_t = a ] $$

$$ q_{\pi}(s, a) = \sum\limits_{s', r} p(s', r \vert s, a) \big(r - r(\pi) + \sum\limits_{a'}\pi(a' \vert s') q_{\pi}(s', a') \big) $$

### Differential semi-gradient SARSA for estimating $\hat{q} \approx q_{*}$

$\begin{aligned}
&\text{Input: a differentiable action-value function parameterization } \hat{q} : \mathcal{S} \times \mathcal{A} \times \mathbb{R}^d \to \mathbb{R} \\
&\text{Algorithm parameters: step sizes } \alpha, \beta > 0 \\
&\text{Initialize value-function weights } w \in \mathbb{R}^d \text{ arbitrarily (e.g., } w=0 \text{)} \\
&\text{Initialize average reward estimate } \bar{R} \in \mathbb{R} \text{ arbitrarily (e.g., } \bar{R} = 0 \text{)} \\
\newline
&\text{Initialize state } S, \text{ and action } A \\
&\text{Loop for each step: } \\
&\quad \text{Take action } A, \text{ observe } R, S' \\
&\quad \text{Choose } A', \text{ as a function } \hat{q}(S', \cdot, w) \text{ (e.g., } \epsilon\text{-greedy)} \\
&\quad \delta \leftarrow R - \bar{R} + \hat{q}(S', A', w) - \hat{q}(S, A, w) \\
&\quad \bar{R} \leftarrow \bar{R} + \beta \delta \\
&\quad w \leftarrow w + \alpha \delta \nabla \hat{q}(S, A, w) \\
&\quad S \leftarrow S' \\
&\quad A \leftarrow A' \\
\end{aligned}$

## Actor-Critic Algorithm

### Approximating the Action Value in the policy update

Using one step boot-strapping,

$ \begin{aligned} \theta_{t+1} &\doteq + \alpha \nabla \ln \pi(A_t \vert S_t, \theta_t) q_{\pi}(S_t \vert A_t) \\
 &= \theta_t + \alpha \nabla \ln \pi(A_t \vert S_t, \theta_t)[ R_{t+1} - \bar{R} + \hat{v}(S_{t+1}, w)] \end{aligned} $
 
![ac](image/actor_critic.png)

### Subtracting the current state's value estimate

$ \theta_{t+1} \doteq \theta_t + \alpha \nabla \ln \pi(A_t \vert S_t, \theta_t) [ R_{t+1} - \bar{R} + \hat{v}(S_{t+1}, w) - \hat{v}(S_t, w)] $

$R_{t+1} - \bar{R} + \hat{v}(S_{t+1}, w) - \hat{v}(S_t, w)$ is TD error $\delta$, and it does not affect the expected update.

### Adding a baseline

$\mathbb{E}_{\pi}[ \nabla \ln \pi(A_t \vert S_t, \theta_t) [R_{t+1} - \bar{R} + \hat{v}(S_{t+1}, w) - \hat{v}(S_t, w)] \vert S_t = s ] \\
= \mathbb{E}_{\pi}[ \nabla \ln \pi(A_t \vert S_t, \theta_t) [ R_{t+1} - \bar{R} + \hat{v}(S_{t+1}, w)] \vert S_t = s] - \mathbb{E}_{\pi} [ \nabla \ln \pi (A_t \vert S_t, \theta_t) \hat{v}(S_t, w) \vert S_t = s] $

Baseline term ($\mathbb{E}_{\pi}[ \nabla \ln \pi(A_t \vert S_t, \theta_t) \hat{v}(S_t, w) \vert S_t = s$) is 0. But we can reduce the update variance through it, which results in faster learning.

### How the actor and the critic interact

![](image/actor_critic2.png)

$ \theta_{t+1} \doteq \theta_t + \alpha \nabla \ln \pi (A_t \vert S_t, \theta_t) \delta_t $

### Actor-Critic (continuing), for estimating $\pi_{\theta} \approx \pi_{*}$

$\begin{aligned}
&\text{Input: a differentiable policy parameterization } \pi(a \vert s, \theta) \\
&\text{Input: a differentiable state-value function parameterization } \hat{v}(S, w) \\
&\text{Initialize } \bar{R} \in \mathbb{R} \text{ to } 0 \\
&\text{Initialize state-value weights } w \in \mathbb{R}^d \text{ and policy parameter } \theta \in \mathbb{R}^d \text{ (e.g., to } 0 ) \\
&\text{Algorithm parameters: } \alpha^w > 0, \alpha^{\theta} > 0, \alpha^{\bar{R}} > 0 \\
&\text{Initialize } S \in \mathcal{S} \\
&\text{Loop forever (for each time step):} \\
&\quad A \sim \pi(\cdot \vert S, \theta) \\
&\quad \text{Take action } A, \text{ observe } S', R \\
&\quad \delta \leftarrow R - \bar{R} + \hat{v}(s', w) - \hat{v}(S, w) \\ 
&\quad \bar{R} \leftarrow \bar{R} + \alpha^{\bar{R}} \delta \\
&\quad w \leftarrow w + \alpha^w \delta \nabla \hat{v}(S, w) \\
&\quad \theta \leftarrow \theta + \alpha^{\theta} \delta \nabla \ln \pi(A \vert S, \theta) \\
&\quad S \leftarrow S'
\end{aligned}$