# A2C versus DQN
- comparison of A2C to DQN on simple control tasks

### problem statement

Goal of RL is to maximize expected value $\mathbb{E}_\pi[v(s)]$ 

Given a paramtrized policy $\pi_\theta$, the criterion is: $J(\theta) = \mathbb{E}_\pi[V(s)]$

The expectation is with respect to environment and policy. Let $d(s)$ be the probability of occupying state $s$, expand the state value, and you can sample from this expectation:

$\mathbb{E}_\pi[v(s)] = \sum_s d(s) \sum_a \pi_\theta(a|s) q(s,a)$  where  $\pi_\theta$ is parametrized policy

thus, the goal is therefore to find a $ \theta $ that maximizes $J(\theta)$:

$argmax_\theta J(\theta) = argmax_\theta \mathbb{E}_\pi[V(s)] = argmax_\theta \sum_s d(s) \sum_a \pi_\theta(a|s) q(s,a) $

one way to do this is to do gradient ascent on $J(\theta)$.


### policy gradient

importantly, state visitation probability $d(s)$ depends on policy, so the full expression for $ \nabla_\theta J(\theta)$ depends on the gradient of the state distribution $ d(s)$:

$\nabla_\theta J(\theta) = \nabla_\theta \sum_s d(s) \sum_a\pi_\theta(s|a) Q(s,a)$

fortunately, the policy gradient allows us to get away without having to compute this gradient:

$\nabla_\theta J(\theta) \propto \sum_s d(s) \sum_a Q(s,a) \nabla_\theta \pi_\theta(s|a) $


the following trick, where the above is multiplied by 1,
allows us to express this approximation as an expectation, from which we can sample


$\nabla_\theta J(\theta) \propto \sum_s d(s) \sum_a  \frac{\pi_\theta(s|a)}{\pi_\theta(s|a)}Q(s,a) \nabla_\theta \pi_\theta(s|a) $

$ = \mathbb{E}_\pi [ \frac{\nabla_\theta \pi_\theta(s|a)}{\pi_\theta(s|a)}Q(s,a)  ] $

$ = \mathbb{E}_\pi [ Q(s,a) \nabla_\theta \ln \pi_\theta(s|a)   ] $



#### REINFORCE:

$\mathbb{E}_\pi [ Q(s,a) \nabla_\theta \ln \pi_\theta(s|a) ] $

estimate $Q(s,a)$  using episodic returns $G_t$ 

$ \mathbb{E}_\pi [ G_t \nabla_\theta \ln \pi_\theta(s|a)  ]$ 

can also baseline with advantage function $A(s,a)$ in place of return $G_t$

$ A(s,a) = Q(s,a) - V(s)$ 

when implementing this, $Q(s,a)$ is replaced by return $G_t$, and $V(s)$ is parametrized value approximator

#### A2C:
uses TD error (computed online) instead of monte carlo estimates (requires full episode) 

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import gym
from utils import *
from collections import namedtuple

import torch as tr
import gym

%load_ext autoreload
%autoreload 2

In [16]:
# setup
def experiment(nseeds,neps,kw):
  """
  interactrion logic
  a < agent(s,h)
  s',r  < env (a)
  
  """
  metric = np.zeros((ns,neps))
  # loop over seeds
  for s in range(ns):
    np.random.seed(s)
    tr.manual_seed(s)
    # setup
    agent = kw['agent']
    task = Task(max_ep_len=kw['max_ep_len'])
    buffer = Buffer('episodic',kw['buff_size'])
    for e in range(neps):
        ## score on greedy policy
        episode = task.play_ep(agent.softmax_policy_fn(1.0))
        metric[s,e] = np.sum(unpack_expL(episode)['reward'])
        ## train on softmax
#         episode = task.play_ep(agent.softmax_policy_fn(0.85))
#         buffer.record(episode)
#         expLoD = buffer.sample(kw['batch_size'])
#         # update
#         exp = unpack_expL(expLoD)
#         agent.train(exp)
  return metric



In [22]:
# run experiments
ns,ni=1,1
kwargs = {'buff_size':500,'batch_size':128,'max_ep_len':100,'buff_mode':'episodic'}
# REINFORCE
kwargs['buff_mode']='episodic'
kwargs['agent']=PolValNet()
m = experiment(ns,ni,kwargs)


1
1
1
0
1
0
1
1
0
1
1
1
1
0
0
0


In [None]:
# sample from distant past
kwargs['buff_mode']='online'
m_online = experiment(ns,ni,kwargs)

### visualize sum of episode rewards

In [None]:
def plt_metric(metric):
  # seeds and mean
  M = metric.mean(0)
  S = metric.std(0)/np.sqrt(ns)
  ax = plt.gca()
  ax.plot(M,zorder=99)
  for smet in metric:
    ax.plot(smet,c='k',lw=.01)

In [None]:
for metric in [m_online,m_episodic]:
  plt_metric(metric)
ax = plt.gca()
ax.set_ylim(-10,kwargs['max_ep_len'])

In [None]:
metric.shape

In [None]:
metric

#### resources
- [mnih et al., 2015](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)
- [blog](https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f)
- [lilian weng blog](https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#a3c)
- [A3C paper](https://arxiv.org/pdf/1602.01783.pdf)
- [TF implementation](https://github.com/dennybritz/reinforcement-learning/tree/master/PolicyGradient)