# DQN 

__Deep Q-learning:__ approximates the Q function, the max expected value of the total reward over any and all successive steps, to then learn the optimal policy.  



__Key Pytorch functions:__
    
   - `torch.nn` neural net
   - `torch.optim` - optimization 
   - `torch.autograd` - automatic differentiaion 
   - `torchvision` - utilities for vision tasks (separate package)

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F 
#import torchvisions.transforms as T

from collections import namedtuple

In [3]:
# if gpu is to be used 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Replay Memory 

Stores the transitions $(s, a, r, s')$ that the agent observes to allow us to reuse the data later. By randomly sampling from this data, the transitions built up are decorrelated, which has been shown to greatly stabilize and improve the DQN training process. 

For this we need two classes: 

- `Transitions` - a named tuple representing a single transition in our environment. It essentially maps (state, action) pairs to their (next_state, reward) result. 

- `ReplayMemory` - a cyclic buffer of bounded size that holds the transitions observed recently. It also implements '.sample()' method for selecting a random batch of transitions for training. 


In [4]:
Transition = namedtuple('Transition', ('state', 
                                       'action', 
                                       'next_state', 
                                       'reward'))

In [1]:
class ReplayMemory(object): 
    
    def __init__(self, capacity):
        """Components of replay memory"""
        self.capacity = capacity
        self.memory = []
        self.position = 0
        
    def push(self, *args):
        """Saves a transition"""
        if len(self.memory) < self.capacity:
            self.memory.append(None)
        self.memory[self.position] = Transition(*args)
        self.position = (self.position+1) % self.capacity 
        
    def sample(self, batch_size):
        """Randomly sample transitions to decorrelate data"""
        return random.sample(self.memory, batch_size)
    
    def __len__(self):
        return len(self.memory)

### DQN Algorithm 

__Environment:__ stochastic environment, thus the expectations are over stochastic transitions of our environment

__Input:__ current state vector of the agent 

__Output:__ produces a Q-value for _every possible_ state-action pair of the current state in a single forward pass. 

__Tools__

- __Experience replay & replay buffer:__ store each experience tuple in the replay buffer as agent interacts, then sample a small batch of tuples from it in order to learn. This process enables learning from the same experience multiple times, including rare events. 
- __Fixed Q-Targets:__ The main idea of introducing fixed Q-targets is that both _labels and predicted values_ are functions of the same weights. All the Q-values are intrinsically tied together through the function parameters. Break correlations between the target and the parameters that we are changing.

__Goal:__ train a policy that tries to maximize the discounted, cumulative reward, where $R_{t_0}$ is the return: 

$R_{t0}=∑_{t=t_0}^∞γ^{t−t_0}r_t$.

The idea behind Q-learning is that if we had a function: 

$Q^∗: State×Action→ℝ$ 

that could tell us what our return would be when we take a certain action in a given state, then we could construct a policy that maximizes our rewards:

$π^∗(s)=argmax_a Q^∗(s,a)$

But, because we don’t know everything about the world, we don’t have access to $Q^*$ so our network, as a universal function approximator, will be trained to approximate and resemble $Q^*$.

The approximate action-value function: $\hat{Q}(S,A,w) \approx q_π(S,A)$

__Linear Action-Value Function Approximation__

Represent the state and action pairs using a feature vector: 

$x(S,A) = (x_1(S,A), ... , x_n(S,A))^T$

Which lets us represent the action-value function by a linear combinatinon of features: 

$\hat{q}(S,A,w) = x(S,A)^Tw = \sum_{j=1}^{n}x_j(S,A)w_j$

$\hat{q}(S,A,w) = w_1f_1(s,a) + w_2f_2(s,a) +...+ w_nf_n(s,a)$ 

__Goal:__ To minimize the mean-squared error between the true action-value function, $q_π(S,A)$, and the approximate action-value function $\hat{Q}(S,A,w) \approx q_π(S,A)$, we use stochastic gradient descent with a differentiable function of the parameter vector w:

$J(w) = \mathbb{E}_π [(q_π(S,A) - \hat{q}(S,A,w))^2]$

__Update rule:__ every Q-value function for some policy obeys the Bellman equation:

$Q^π(s,a)= r + γQ^π(s′,π(s′))$ 

Because we don't know the true action-value function, $q_π(S,A)$, then we have to substitute a target for this value. For Q-learning, we use the TD target: $R_{t+1} + γQ(S_{t+1},A_{t+1})$.

__Linear VFA weight update__ 

The format for the weight update = step-size * prediction_error * feature_value, which is mathematically expressed as: 

$\Delta w=α[(q_π(S, A)-\hat{q}(S,A,w))\nabla_w\hat{q}(S,A,w)]$ 

Substituting the TD target yields the following:

$\Delta w = α(R_{t+1} + γ\hat{q}(S_{t+1},A_{t+1},w) - \hat{q}(S_t,A_t,w))\nabla_w\hat{q}(S_t,A_t,w) $ or 

$w_m \leftarrow w_m + α[r + γ max_a Q(s',a') - Q(s,a)]f_m(s,a)$



__Loss:__ <font color='red'>WIP</font>

- $α = 0.05$, the learning rate
- linear VFA update $\Delta w$ 
- SGD with pytorch: ` torch.optim.sgd` - implments SGD with args params, learning_rate, momentum, weight_decay, dampening, and nesterov

___Example of Pytorch Loss implementation:___

```
import torch
from .optimizer import Optimizer, required

optimizer = torch.optim.SGD(model.parameters(), 
                            lr=0.1, momentum=0.9)
optimizer.zero_grad()
loss_fn(model(input),target).backward()
optimizer.step()
```

__Q-network__

Our model will be a neural network that takes in the difference between the current and previous Q-values. If has outputs for every possible action? 

Reference for below code: https://towardsdatascience.com/building-neural-network-using-pytorch-84f6e75f9a

__Description of the code below__

`self.hidden = nn.Linear(784, 256)`

The module automatically creates the weight and bias tensors which we'll use in the forward method. You can access the weight and bias tensors once the network (net) is created with net.hidden.weight and net.hidden.bias.

`self.sigmoid = nn.Sigmoid()
self.softmax = nn.Softmax(dim=1)`

Defined operations for the sigmoid activation and softmax output. Setting dim=1 in nn.Softmax(dim=1) calculates softmax across the columns.

`def forward(self, x):`

PyTorch networks created with nn.Module must have a forward method defined. It takes in a tensor x and passes it through the operations you defined in the __ init __ method.

In [6]:
# approximate q-values of actions for current state  
from torch import nn 

class Network(nn.Module): # class to track nn architecture
    def __init__(self):
        super().__init__()

        # Inputs to hidden layer: linear transformation 
        # 30 inputs; 20 outputs
        self.hidden = nn.Linear(30,20) 
        
        # Output layer: linear transformation - 20 inputs; 1 output  
        # Does the size of the output depend on the # of actions?
        self.output = nn.Linear(20,1)
        
        # Define sigmoid activation and softmax output 
        
        
    def forward(self, x):
        # Pass the input tensor through each of our operations
        x = self.hidden(x)
        x = self.sigmoid(x)
        x = self.output(x)
        
        return x