# DQN 

__Deep Q-learning:__ approximates the Q function, the max expected value of the total reward over any and all successive steps, to then learn the optimal policy.  



__Key Pytorch functions:__
    
   - `torch.nn` neural net
   - `torch.optim` - optimization 
   - `torch.autograd` - automatic differentiaion 
   - `torchvision` - utilities for vision tasks (separate package)

In [12]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F 
#import torchvisions.transforms as T

from collections import namedtuple

In [13]:
# if gpu is to be used 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Replay Memory 

Stores the transitions (s, a, r, s') that the agent observes to allow us to reuse the data later. By randomly sampling from this data, the transitions built up are decorrelated, which has been shown to greatly stabilize and improve the DQN training process. 

For this we need two classes: 

- `Transitions` - a named tuple representing a single transition in our environment. It essentially maps (state, action) pairs to their (next_state, reward) result. 

- `ReplayMemory` - a cyclic buffer of bounded size that holds the transitions observed recently. It also implements '.sample()' method for selecting a random batch of transitions for training. 


In [14]:
Transition = namedtuple('Transition', ('state', 
                                       'action', 
                                       'next_state', 
                                       'reward'))

In [15]:
class ReplayMemory(object): 
    
    def __init__(self, capacity):
        self.capacity = capacity
        self.memory = []
        self.position = 0
        
    def push(self, *args):
        """Saves a transition"""
        if len(self.memory) < self.capacity:
            self.memory.append(None)
        self.memory[self.position] = Transition(*args)
        self.position = (self.position+1) % self.capacity 
        
    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)
    
    def __len__(self):
        return len(self.memory)

### DQN Algorithm 

__Environment:__ stochastic, thus expectations over stochastic transitions in the environment

__Goal:__ train a policy that tries to maximize the discounted, cumulative reward $R_{t0}=∑_{t=t_0}^∞γ^{t−t_0}r_t$, where $R_{t_0}$ is the return.

The main idea behind Q-learning is that if we had a function $Q^∗:State×Action→ℝ$ that could tell us what our return would be, if we were to take an action in a given state, then we could easily construct a policy that maximizes our rewards:

$π^∗(s)=argmax_a Q^∗(s,a)$

But, we don’t know everything about the world, so we don’t have access to $Q^*$ so our network, as a universal function approximators, will be trained to approximate/resemble $Q^∗$.


__Update rule:__ every Q-value function for some policy obeys the Bellman equation:

$Q^π(s,a)=r+γQ^π(s′,π(s′))$

__TD Error:__ δ, the difference between the two sides of the update rule's equality: 

$δ=Q(s,a)−(r+γmax_aQ(s′,a))$

__Loss:__ <font color='red'>WIP</font>

To minimize this error we use ... (e.g. Huber loss) - WIP

__Q-network__

Our model will be a neural network that takes in the difference between the current and previous Q values. If has outputs for every possible action? 

In [19]:
# call to determine next action 
def forward(self, x):
    #bn = batchnorm & conv=convolutional net
    x = F.relu(self.bn1(self.conv1(x))) 
    x = F.relu(self.bn2(self.conv2(x)))
    x = F.relu(self.bn3(self.conv3(x)))
    return self.head(x.view(x.size(0), -1))