# Deep Q-learning

The goal of a DQN agent is to maximize the future discounted return at each timestep $t$, namely

$$ R_t = \sum_{t'=t}^T \gamma^{t'-t}r_{t'} $$

assuming the environment episode ends at timestep $T$. The optimal action-value function $Q^*(s; a)$ defines the maximum discounted return achievable, i.e. when following an optimal policy $\pi^*$. This optimal action-value function satisfies a recursive relationship called the Bellman optimality Eq. $(1)$, where $\mathcal{S}$ is the distribution over next states $s'$ given a state $s_t$ and action $a_t$:
$$
Q^*(s,a) := \max_{\pi}\mathbb{E}_{\pi} \Big[R_t ~\big|~ s_t=s, ~a_t=a\Big] \implies Q^*(s,a) = \mathbb{E}_{s'\sim \mathcal{S}}\Big[r + \gamma \max_{a'} Q^*(s',a') ~\big|~ s, a\Big] \qquad (1)
$$  

Generally, we can estimate this optimal Q-function by updating the Q-value function in an iterative fashion as
$$ Q_{i+1}(s,a) = \mathbb{E}_{s'\sim \mathcal{S}}\Big[r + \gamma \max_{a'} Q_i(s',a') ~\big|~ s, a\Big] \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad (2)$$
which ultimately converges to $Q^*$ as the iterations $i$ goes to infinity. In DQN we use a function approximator to represent the Q-value function. Therefore, instead of assigning values as in Eq. $(2)$ we solve a regression problem, as detailed below in Section 2. Also, instead of trying to impose Eq. $(2)$ in all $(s,a)$ pairs, they are sampled from a *replay buffer* that at every iteration received new pairs obtained by executing in the environment the actions given by an "epsilon-greedy" sampling proceedure also described in Section 2.

In this assignemnt you will be asked to implement three parts:
- Define a Neural Network class that will be used as the Q-function approximator.
- Implement the epsilon-greedy sampling proceedure.
- Implement the Q-learning loss function.

Then you will be able to test your algorithm in two environments: a simple grid-world and a more complex Atari game called Pong.

In [1]:
# import helpers, gym environments, and other needed dependencies
from collections import deque
import time
import numpy as np
import pickle
import os.path as osp
import click
import gym

from simpledqn.replay_buffer import ReplayBuffer
import logger
from simpledqn.wrappers import NoopResetEnv, EpisodicLifeEnv
from simpledqn import gridworld_env
from simpledqn.main import assert_allclose, preprocess_obs_gridworld, preprocess_obs_ram, LinearSchedule

nprs = np.random.RandomState
rng = nprs(42)

## 1. Construcing a Neural Network
Build a NN with **3 linear layers** (take 256 for all hidden sizes) and **relu** non-linearities at each layer output but the last.

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.autograd as autograd

class NN_linear(nn.Module):
    def __init__(self, obs_size, act_size):
        super(NN_linear, self).__init__()
        self.Linear = nn.Linear(obs_size, act_size)

    def forward(self, obs):
        out = self.Linear(obs)
        return out
    
class NN(nn.Module):
    def __init__(self, obs_size, act_size):
        super(NN, self).__init__()
        "*** YOUR CODE HERE ***"
        
    def forward(self, obs):
        "*** YOUR CODE HERE ***"
        out = None
        return out

## 2. Training the Q-function approximators

The function $Q(s,a; \theta)$ is trained to approximate $Q^*(s,a)$ over time using a loss function defined as:
$$ \mathbb{E}_{(s,a,s')\sim\mathcal{D}}\big[(y-Q(s,a;\theta))^2\big], \qquad\text{ where }\quad y= \begin{cases}
r+\gamma\max_{a'}Q(s',a';\theta') \qquad\text{ if non-terminal transition}\\
r \qquad\qquad\qquad\qquad\qquad\text{ for terminal transition}
\end{cases} \qquad\qquad (3)
$$
where the network $Q(s; a; \theta')$ is called the target network, and its parameters $\theta'$ are updated (i.e. set to the current value of $\theta$) at a specific interval.
DQN is inherently off-policy, which means that we can update the agent towards the goal behavior through using data that is sampled from arbitrary behavior. Therefore, all sampled $(s; a; s'; r)$ tuples are stored in a replay buffer $\mathcal{D}$.
The approximator $Q(s,a; \theta)$ is updated by minimizing the loss described in Eq. $(3)$. In between updates, we add new tuples $(s,a,s',r)$ to the replay buffer by taking actions in the environment with and **epsilon greedy** proceedure:

**for** $t$ from 1 to T do:
* with probability $\epsilon$ select random action $a_t$, otherwise select $a_t = \max_a Q(s, a; \theta)$
* execute action $a_t$ in environment and observe reward $r_t$, next state $s_{t+1}$ and episode termination signal $d_t$
* store transition $(s_t, a_t, r_t, s_{t+1}, d_t)$ in $\mathcal{D}$.

**end**

*In the next DQN class do the following:*
- complete the **epsilon greedy** action sampling
- write the full `compute_q_learning_loss` function

In [6]:
class DQN(object):
    def __init__(self, env, obs_dim, act_dim, obs_preprocessor, replay_buffer, NN, 
                 opt_batch_size, discount, initial_step, max_steps, learning_start_itr, target_q_update_freq,
                 train_q_freq, log_freq, final_eps, initial_eps, fraction_eps, render):
        self._env = env
        self._obs_dim = obs_dim
        self._act_dim = act_dim
        self._obs_preprocessor = obs_preprocessor
        self._replay_buffer = replay_buffer
        self._initial_step = initial_step
        self._max_steps = max_steps
        self._target_q_update_freq = target_q_update_freq
        self._learning_start_itr = learning_start_itr
        self._train_q_freq = train_q_freq
        self._log_freq = log_freq
        self._opt_batch_size = opt_batch_size
        self._discount = discount
        self._render = render

        self._q = NN(self._obs_dim, self._act_dim)  # Q function which params are optimized
        self._qt = NN(self._obs_dim, self._act_dim)  # target Q copying the params in Q after several updates
        self._qt.requires_grad = False
        
        self.optimizer = optim.Adam(self._q.parameters(), lr=0.0001)

        self.exploration = LinearSchedule(  # gives value of eps across iterations
            schedule_timesteps=int(fraction_eps * max_steps),
            initial_p=initial_eps,
            final_p=final_eps)

    def eps_greedy(self, obs, epsilon):
        # Check Q function, do argmax.
        rnd = rng.rand()
        if rnd > epsilon:
            obs = self._obs_preprocessor(obs)
            "*** YOUR CODE HERE ***"
            # compute q_values of obs
            q_values = np.ones(self._act_dim)
            # return the greedy action
            return 0
        else:
            return rng.randint(0, self._act_dim)

    def compute_q_learning_loss(self, l_obs, l_act, l_rew, l_next_obs, l_done):
        """
        :param l_obs: A np.array holding a list of observations. Should be of shape N * |S|.
        :param l_act: A np.array variable holding a list of actions. Should be of shape N.
        :param l_rew: A np.array variable holding a list of rewards. Should be of shape N.
        :param l_next_obs: A np.array variable holding a list of observations at the next time step. Should be of
        shape N * |S|.
        :param l_done: A np.array variable holding a list of binary values (indicating whether episode ended after this
        time step). Should be of shape N.
        :return: A PyTorch Variable holding a scalar loss.
        """

        "*** YOUR CODE HERE ***"
        # wrap the observations into Variables
        l_next_obs_var = autograd.Variable(torch.Tensor(l_next_obs), requires_grad=False)
        l_obs_var = autograd.Variable(torch.Tensor(l_obs), requires_grad=False)

        # compute Q values of the next_obs based on the target Q network self._qt, and convert back to numpy
        qt_next = 0
        
        # compute the target for the MSELoss (you can do it entirely in numpy). Use self._discount
        target = 0

        # wrap into a Variable
        target = autograd.Variable(torch.Tensor(target), requires_grad=False)
        # compute Q values aelf._q of current obs and select the one corresponding to the action that was taken
        q_sel = autograd.Variable(torch.Tensor(0), requires_grad=False)

        # form the MSELOss and compute it
        loss = autograd.Variable(torch.Tensor([0]), requires_grad=False)
        return loss


    def train_q(self, l_obs, l_act, l_rew, l_next_obs, l_done):
        """Update Q-value function by sampling from the replay buffer."""
        self._q.zero_grad()
        
        l_obs = self._obs_preprocessor(l_obs)
        l_next_obs = self._obs_preprocessor(l_next_obs)
        
        loss = self.compute_q_learning_loss(
            l_obs, l_act, l_rew, l_next_obs, l_done)
        
        loss.backward()
        self.optimizer.step()
        
        return loss.data

    def _update_target_q(self):
        """Update the target Q-value function by copying the current Q-value function weights."""
        q_params_dict = dict(self._q.named_parameters())
        self._qt.load_state_dict(q_params_dict)

    def train(self):
        obs = self._env.reset()

        episode_rewards = []
        n_episodes = 0
        l_episode_return = deque([], maxlen=10)
        l_discounted_episode_return = deque([], maxlen=10)
        l_tq_squared_error = deque(maxlen=50)
        log_itr = -1
        for itr in range(self._initial_step, self._max_steps):
            act = self.eps_greedy(obs[np.newaxis, :],
                                  self.exploration.value(itr))
            next_obs, rew, done, _ = self._env.step(act)
            if self._render:
                self._env.render()
            self._replay_buffer.add(obs, act, rew, next_obs, float(done))

            episode_rewards.append(rew)

            if done:
                obs = self._env.reset()
                episode_return = np.sum(episode_rewards)
                discounted_episode_return = np.sum(
                    episode_rewards * self._discount ** np.arange(len(episode_rewards)))
                l_episode_return.append(episode_return)
                l_discounted_episode_return.append(discounted_episode_return)
                episode_rewards = []
                n_episodes += 1
            else:
                obs = next_obs

            if itr % self._target_q_update_freq == 0 and itr > self._learning_start_itr:
                self._update_target_q()

            if itr % self._train_q_freq == 0 and itr > self._learning_start_itr:
                # Sample from replay buffer.
                l_obs, l_act, l_rew, l_obs_prime, l_done = self._replay_buffer.sample(
                    self._opt_batch_size)
                # Train Q value function with sampled data.
                td_squared_error = self.train_q(
                    l_obs, l_act, l_rew, l_obs_prime, l_done)
                l_tq_squared_error.append(td_squared_error)

            if (itr + 1) % self._log_freq == 0 and len(l_episode_return) > 5:
                log_itr += 1
                logger.logkv('Iteration', log_itr)
                logger.logkv('Steps', itr)
                logger.logkv('Epsilon', self.exploration.value(itr))
                logger.logkv('Episodes', n_episodes)
                logger.logkv('AverageReturn', np.mean(l_episode_return))
                logger.logkv('AverageDiscountedReturn',
                             np.mean(l_discounted_episode_return))
                logger.logkv('TDError^2', np.mean(l_tq_squared_error))
                logger.dumpkvs()
#                 self._q.dump(logger.get_dir() + '/weights.pkl')

    def test(self, epsilon):
        try:
            self._q.set_params(self._q.load(logger.get_dir() + '/weights.pkl'))
        except Exception as e:
            print(e)
        obs = self._env.reset()
        while True:
            act = self.eps_greedy(obs[np.newaxis, :], epsilon)
            obs_prime, rew, done, _ = self._env.step(act)
            self._env.render()
            if done:
                obs = self._env.reset()
                print('Done!')
                time.sleep(1)
            else:
                obs = obs_prime

## 3. Test the algorithm on grid world
Now let's train a simple GridWorld to test out our algorithm!

In [7]:
env = gym.make('GridWorld-v0')
test_dir = "data/local/dqn_gridworld_test"
log_dir = "data/local/dqn_gridworld2"
logger.session(log_dir).__enter__()
env.seed(42)

# Initialize the replay buffer that we will use.
replay_buffer = ReplayBuffer(max_size=10000)

# Initialize DQN training procedure.
dqn_gridworld = DQN(
    env=env,
    obs_dim=env.observation_space.n,
    act_dim=env.action_space.n,
    NN=NN_linear,
    obs_preprocessor=preprocess_obs_gridworld,
    replay_buffer=replay_buffer,
    opt_batch_size=64,
    # DQN gamma parameter
    discount=0.99,
    # Training procedure length
    initial_step=0,
    max_steps=100000,
    learning_start_itr=1000,
    # Frequency of copying the actual Q to the target Q
    target_q_update_freq=100,
    # Frequency of updating the Q-value function
    train_q_freq=4,
    # Exploration parameters
    initial_eps=1.0,
    final_eps=0.05,
    fraction_eps=0.1,
    # Logging
    log_freq=1000,
    render=False,
)

from simpledqn.main import test_loss
test_loss(dqn_gridworld, test_dir)

[2018-04-12 18:09:00,587] Making new env: GridWorld-v0


[ 2.21811724]
Test for compute_q_learning_loss passed!


If you passed the previous test, let's train the full policy!

In [5]:
dqn_gridworld.train()

--------------------------------------
| Iteration               | 0        |
| Steps                   | 999      |
| Epsilon                 | 0.90509  |
| Episodes                | 63       |
| AverageReturn           | 0        |
| AverageDiscountedReturn | 0        |
| TDError^2               | nan      |
--------------------------------------


  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


--------------------------------------
| Iteration               | 1        |
| Steps                   | 1999     |
| Epsilon                 | 0.8101   |
| Episodes                | 124      |
| AverageReturn           | 0.3      |
| AverageDiscountedReturn | 0.26066  |
| TDError^2               | 0.080576 |
--------------------------------------
--------------------------------------
| Iteration               | 2        |
| Steps                   | 2999     |
| Epsilon                 | 0.7151   |
| Episodes                | 195      |
| AverageReturn           | 0.2      |
| AverageDiscountedReturn | 0.18456  |
| TDError^2               | 0.063006 |
--------------------------------------
--------------------------------------
| Iteration               | 3        |
| Steps                   | 3999     |
| Epsilon                 | 0.6201   |
| Episodes                | 277      |
| AverageReturn           | 0.1      |
| AverageDiscountedReturn | 0.089534 |
| TDError^2              

In [6]:
# visualize learned policy
dqn_gridworld.test(epsilon=0.0)

'NN_linear' object has no attribute 'set_params'
  (Right)
S[41mF[0mFF
FFFH
FFFF
HFFG
  (Down)
SFFF
F[41mF[0mFH
FFFF
HFFG
  (Down)
SFFF
FFFH
F[41mF[0mFF
HFFG
  (Right)
SFFF
FFFH
FF[41mF[0mF
HFFG
  (Right)
SFFF
FFFH
FFF[41mF[0m
HFFG
  (Down)
SFFF
FFFH
FFFF
HFF[41mG[0m
Done!
  (Right)
S[41mF[0mFF
FFFH
FFFF
HFFG
  (Down)
SFFF
F[41mF[0mFH
FFFF
HFFG
  (Down)
SFFF
FFFH
F[41mF[0mFF
HFFG
  (Right)
SFFF
FFFH
FF[41mF[0mF
HFFG
  (Right)
SFFF
FFFH
FFF[41mF[0m
HFFG
  (Down)
SFFF
FFFH
FFFF
HFF[41mG[0m
Done!
  (Right)
S[41mF[0mFF
FFFH
FFFF
HFFG
  (Down)
SFFF
F[41mF[0mFH
FFFF
HFFG
  (Down)
SFFF
FFFH
F[41mF[0mFF
HFFG
  (Right)
SFFF
FFFH
FF[41mF[0mF
HFFG
  (Right)
SFFF
FFFH
FFF[41mF[0m
HFFG
  (Down)
SFFF
FFFH
FFFF
HFF[41mG[0m
Done!
  (Right)
S[41mF[0mFF
FFFH
FFFF
HFFG
  (Down)
SFFF
F[41mF[0mFH
FFFF
HFFG
  (Down)
SFFF
FFFH
F[41mF[0mFF
HFFG
  (Right)
SFFF
FFFH
FF[41mF[0mF
HFFG
  (Right)
SFFF
FFFH
FFF[41mF[0m
HFFG
  (Down)
SFFF
FFFH
FFFF
HFF[41mG[0m
Done!
  (

KeyboardInterrupt: 

In [7]:
env.close()

## 4. Test algorithm on Pong
Now we can train for longer on a substantially more complex environment: Pong from the Atari suite. To speed up training, instead of playing from pixels we will be playing directly from the ram state.

In [8]:
env = EpisodicLifeEnv(NoopResetEnv(gym.make('Pong-ram-v0')))
log_dir = "data/local/dqn_pong"

logger.session(log_dir).__enter__()
env.seed(42)

# Initialize the replay buffer that we will use.
replay_buffer = ReplayBuffer(max_size=10000)

# Initialize DQN training procedure.
dqn_pong = DQN(
    env=env,
    obs_dim=env.observation_space.shape[0],
    act_dim=env.action_space.n,
    NN=NN,
    obs_preprocessor=preprocess_obs_ram,
    replay_buffer=replay_buffer,
    opt_batch_size=64,
    # DQN gamma parameter
    discount=0.99,
    # Training procedure length
    initial_step=1000000,
    max_steps=10000000,
    learning_start_itr=100000,
    # Frequency of copying the actual Q to the target Q
    target_q_update_freq=1000,
    # Frequency of updating the Q-value function
    train_q_freq=4,
    # Exploration parameters
    initial_eps=1.0,
    final_eps=0.05,
    fraction_eps=0.1,
    # Logging
    log_freq=10000,
    render=False,
)

[2018-04-04 14:44:05,649] Making new env: Pong-ram-v0


In [9]:
dqn_pong.train()

--------------------------------------
| Iteration               | 0        |
| Steps                   | 1.01e+06 |
| Epsilon                 | 0.05     |
| Episodes                | 8        |
| AverageReturn           | -20.625  |
| AverageDiscountedReturn | -1.2601  |
| TDError^2               | 0.012777 |
--------------------------------------
---------------------------------------
| Iteration               | 1         |
| Steps                   | 1.02e+06  |
| Epsilon                 | 0.05      |
| Episodes                | 16        |
| AverageReturn           | -20.6     |
| AverageDiscountedReturn | -1.0162   |
| TDError^2               | 0.0055535 |
---------------------------------------
---------------------------------------
| Iteration               | 2         |
| Steps                   | 1.03e+06  |
| Epsilon                 | 0.05      |
| Episodes                | 25        |
| AverageReturn           | -20.6     |
| AverageDiscountedReturn | -0.89998  |
| TDError

KeyboardInterrupt: 

## Visualization
To visualize your learning curves, you can use the `viskit` tool by calling in a terminal:
`python viskit/frontend.py path/to/log_dir`
where `path/to/log_dir` is by default `data/local/exp_name`, where `exp_name` is `dqn_pong` in the case of pong for example.
For this visualization to work you need to have the path to the homework directory to be added to your `$PYTHONPATH`. You should then see in your browser something like this:
![title](simpledqn/pong_learning.png)