## A3C 
- Asynchronous Advantage Actor -Critic
- Actor : Policy
- Critic : Value
- Advantage = G- V(s)

## Policy Gradient Review

\begin{equation*}
\pi(a|s,\theta_p) = NeuralNet(input:s,weight:\theta_p)\\
V(a|s,\theta_v) = NeuralNet(input:s,weight:\theta_v)
\end{equation*}

- For policy loss, backwards from policy gradient, For Value loss, use squared error
\begin{equation*}
L_p = -(G- V(s))log\pi(a|s,\theta_p)\\
L_v = (G-V(s,\theta_v))^2
\end{equation*}


- Pseudocode
\begin{equation*}
\theta_p = \theta_p - learningrate* dL_p/d\theta_p \\
\theta_v = \theta_v - learningrate* dL_v/d\theta_v
\end{equation*}


- N-Step return 
- instead of using TD(0), we use the N-step return
\begin{equation*}
V(s)= r + \gamma r' + \gamma^2 r'' + \gamma^3 V(s''')
\end{equation*}


- Entropy Regularization
- Definition of entropy
\begin{equation*}
H = - \sum_{k=1}^n \pi_k log\pi_k
\end{equation*}

- New loss (C: regularization constant)
\begin{equation*}
L_p'= L_p +CH
\end{equation*}


![](https://cn.bing.com/th?id=OIP.TlKyrDc2rVlD0tNkkderjAHaGs&pid=Api&rs=1&p=0)

- A3C simply achieves stability using a different method(parallel agents)

## Review of MDPs

MDP is a collection of 5 things  

- states
- actions
- rewards
- state trasition probabilities
- discount factor(gamma)

### Markov Property

$p[s(t+1),r(t+1)|s(t),a(t),...,s(1),a(1)] $ = $p[s(t+1),r(t+1)|s(t),a(t)] $

### Discount Factor

$G(t) = \sum_{\tau = 0}^{\infty}\gamma^{\tau}R(t+\tau+1)$

### State value and state action value

$V_{\pi}(s) = E_{\pi}[G(t)|S_{t}=s]$  
$Q_{\pi}(s,a) = E_{\pi}[G(t)|S_{t}=s,A_{t}=a]$

$\pi(s) = argmax_{a}{(Q(s,a))}$

## Dynamic Programming

$V_{\pi}(s) = \sum_{a}^{}\pi(a|s)\sum_{s'}^{}\sum_{r}^{}p(s',r|s,a)(r+\gamma V_{\pi}(s'))$  

![](https://lilianweng.github.io/lil-log/assets/images/TD_MC_DP_backups.png)

### Iterative Policy Evaluation
- prediction problem: given a policy , find the value function

### Policy Iteration

```python
while not converged:
    step1) policy evaluation of current policy
    step2) policy improvement(take the argmax Q(s,a))
```    

### Value Iteration

- Q-learning

$V_{k+1}(s) = max_{a}\sum_{s'}^{}\sum_{r}^{}p(s',r|s,a)(r+\gamma V_{\pi}(s'))$  

### Dynamic Programming Summary

- it is not practical
- state space may be very large
- doesn't learn from experience
- MC and TD learning, no model of the environment needed

## Monte Carlo Review

- unlike DP, MC is all about learning from experience

$V(s) = E[G(t)|S(t)=s]\approx \frac{1}{N}\sum_{N}^{i=1}G_{i,s}$

### Monte Carlo Control

1. Initailize random policy
2. while not converged:
    - a. play an episode,calculate retruns for each state
    - b. do policy improvements based on current Q(s,a)take argmax
    

## TD 

- MC: sample returns based on an episode
- TD: estimate returns based on current value function estimate

TD(0)  

$V(S_{t}) = V(S_{t})+\alpha[r+\gamma V(S_{t+1})-V(S_{t})]$  

### TD Control

#### SARSA

$Q(s,a) \leftarrow Q(s,a) + \alpha [r+\gamma Q(s',a') - Q(s,a)]$  
$a' = argmax_{a}[Q(s',a)]$

### Q-Learning
- off policy 

$Q(s,a) \leftarrow Q(s,a) + \alpha [r+\gamma *max_{a'}Q(s',a') - Q(s,a)]$  
