# Introduction

### Background
In Reinforcement Learning (RL) we study agents whose goal is to interact with the environment in a way that maximizes rewards over time (return). RL problems can be formally defined as Markov Decision Processes. This involves specifying:
- the set of states and actions available to the agent in an environment;
- the one-step dynamics of that environment, i.e. the probability that action a in state s at time t will result in state s' at time t+1;
- the reward that the environment will dispense after the agent transitions from a state *s* to a state *s'* due to action *a*.

To maximise return, the agent must learn an optimal policy which tells it for each state it experiences, which action will bring about the highest return. There are multiple methods for solving RL problems. In model-free approaches, the agent does not know the environment's one-step dynamics or reward function. Instead, it uses its experience interacting with the environment to gradually build up and estimate a state-value (Q) table specifying for each state *s* what the expected return of taking an action *a* would be. 

Q-learning is one of several algorithms used for updating a Q-table over an agent's experience. It is a Temporal Difference method, meaning updates happen after each time-step, rather than at the end of an episode. Succintly, in Q-learning the value of a state-action pair is updated based on the value of the best possible action available at the next time-step, as per the following equation:

```Python
new_Q(St, At) = old_Q(St, At) + learning_rate * (R(t) + discount * max(Q(St+1)) - old_Q(St, At))```

Given enough interaction episodes, if the Q-table converges to the true state-value function, then retrieving the optimal policy becomes a matter of for each state, looking up which action maximises return.

### Deep Q-Networks
Powerful as they may be, tabular Q-learning approaches are limited in that they are unable to generalise: Q values are estimated separately for individual *experienced* state-action pairs and they imply nothing for previously unobserved state-action pairs. Moreover, they are unsuited for continuous spaces as that would entail building infinitely large Q-tables. Instead of a table, it would be desirable to estimate a continuous function, which is exactly what function approximation methods aim to do.  

Deep Q-Networks are a neural-network based function approximation approach, where we aim to learn the parameters *w* of a Q'(s,a,w) function such that Q'(s,a,w) ~= Q(s,a) where Q is the "true" action-value function. 

# Algorithm

## The QNetwork class

We begin by defining a QNetwork class, which instantiates a neural network composed of fully-connected layers according to a pre-defined architecture, where network depth and layer width are specified as arguments. 
```Python
class QNetwork(nn.Module):
    
    def __init__(self, state_size, action_size, fc1_units=64, fc2_units=64):
        super(QNetwork, self).__init__()
        self.bn1 = nn.BatchNorm1d(num_features=state_size)
        self.fc1 = nn.Linear(state_size, fc1_units)
        self.fc2 = nn.Linear(fc1_units, fc2_units)
        self.fc3 = nn.Linear(fc2_units, action_size)
        
    def forward(self, state):
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        return self.fc3(x)```
QNetwork objects are neural networks that store and learn the weight parameters necessary for estimating the value of each action available to the agent when given as input a particular state. Accordingly, their input dimensions match the state size, and their output dimensions match the action size. A forward propagation method is defined such that if an instance of a QNetwork is passed a state, it returns the current estimate of value for each action in that state.

## The Agent class

We define an Agent class whose only arguments are the state size and action size:
```Python
class Agent():
    
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        
        self.qnetwork_local = QNetwork(state_size, action_size).to(device)
        self.qnetwork_target = QNetwork(state_size, action_size).to(device)
        
        self.t_step = 0
        
        self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=lr)
        
        self.memory = ReplayBuffer(action_size, buffer_size, batch_size)
        ```
When initialised with a particular state and action size, the Agent class calls on QNetwork, passes it state size and action size arguments, and instantiates two QNetworks as attributes: a ```qnetwork_local```, which learns to approximate the true action-value function, and a ```qnetwork_target```, which sets the fixed targets for the local QNetwork to learn.  

Agent is also endowed with a ```t_step``` attribute which we will use as a time-step counter for keeping track of how many time-steps have passed since the last time Agent updated its knowledge. In the ```optimizer``` attribute we specify which optimisation method we will use during gradient descent and weight updating.  

More interestingly, to enable Experience Replay, we endow the agent with a ```memory``` attribute, which is an instantiation of the ```ReplayBuffer``` class:
```Python
class ReplayBuffer:
    
    def __init__(self, action_size, buffer_size, batch_size):
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=['state', 'action', 'reward', 'next_state', 'done'])```


A ReplayBuffer object such as ```memory``` requires action size and two additional arguments to be initialised:
- ```buffer_size```, an integer specifying memory capacity, ie how many elements memory can store.
- ```batch_size```, an integer specifying how many experiences the agent should 'recall' from memory when performing a learning episode. We will get into this shortly.

ReplayBuffer has one other attribute, ```experience```, which we define as a namedtuple object, providing names for it and the fields we expect tuples we pass to it to have.  
The great thing about namedtuple object types is that if we define a namedtuple object *a* it acts as a function, such that if we assign a variable *b* to be the result of passing a tuple of appropriate length through *a*:
```Python
a = namedtuple("Experience", field_names=['state', 'action', 'reward', 'next_state', 'done'])
b=a(1,2,0,4,False)```
then we can access the values of elements in b by calling them through their field names. For example,
```Python
b.done```
returns False. This will be convenient later on, when we recall ```experiences``` from ```memory``` for learning.

Having specified what Agent *has* - its attributes and initialisation procedure - we now turn to what Agent *does* - the methods available to it.  

```Python
    def step(self, state, action, reward, next_state, done):
        self.memory.add(state, action, reward, next_state, done)
        
        self.t_step = (self.t_step + 1) % update_every
        
        if self.t_step == 0:
            if len(self.memory) > batch_size:
                experiences = self.memory.sample()
                self.learn(experiences, gamma)```


At every time-step, we will call ```step```, which first calls on the method ```.add```, a function of ```memory``` and its parent class ReplayBuffer:
```Python
    def add(self, state, action, reward, next_state, done):
        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)```

Calling ```memory.add``` on a (state, action, reward, next_state, done) passes that tuple to ```memory```'s attribute ```experience```, organising it as a named tuple, and then appending it to ```memory```'s ```memory``` attribute which is the actual container: a deque of length (capacity) ```buffer_size```.

Method ```step``` then updates the agent's ```t_step``` counter, resetting it to 0 every time ```update_every``` time-steps have passed. If this condition is met *and* there are enough (>```batch_size```) experience tuples stored in memory, the agent calls class ReplayBuffer's method ```.sample```:
```Python
def sample(self):
        experiences = random.sample(self.memory, k = self.batch_size)
        
        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to(device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)
  
        return (states, actions, rewards, next_states, dones)
```
The function of ```.sample``` is to retrieve from memory a number ```k = self.batch_size``` of experiences to learn from, then parse the elements of each experience (state, action, reward, next_state and done booleans) into a convenient form to learn from - that is, to return something as close as possible to a mini-batch that we can feed into a neural network. This is important, because so far our data is organised in a convenient way for retrieving every element of each type in an experience,  whereas in subsequent steps we will want to retrieve elements of a particular type *across all the experiences in a mini-batch*.  

To achieve this, ```.sample``` takes advantage of the namedtuple data structure. Let's exemplify by looking at states. We use list comprehension of ```e.state``` for e in ```experiences``` to retrieve *only* the state component of each of the k experience tuples stored in the variable ```experiences```. We transform the resulting list into a vertical numpy array by passing it to ```np.vstack```, then turn this into a pytorch tensor, with the ```torch.from_numpy``` method. Finally, we cast it as ```.float()``` and send it to GPU so that it's the appropriate format to feed into a neural network for forward propagation. This process is repeated for each of the components of an experience tuple and the return is therefore a tuple of 5 torch tensors.  
           
Having retrieved and re-organised experiences conveniently, ```step``` passes experiences into ```learn```:

```Python
    def learn(self, experiences, gamma):
        states, actions, rewards, next_states, dones = experiences
        
        Q_targets_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1) 
        Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))
        Q_expected = self.qnetwork_local(states).gather(1, actions)
        
        loss = F.mse_loss(Q_expected, Q_targets)
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        self.soft_update(self.qnetwork_local, self.qnetwork_target, tau)```
First, ```learn``` unpacks the tuple of elements of ```experiences``` each into its own variable of type float tensor.  

Since we don't know the optimal target values, we need to use in their stead the approximated target values ```Y = reward + gamma * (maximum value across actions for next state)```. Here, the maximum value across actions for next state is calculated by ```Q_targets_next``` by passing it the ```next_states``` tensor as input, from which we obtain values for every possible action in the next state. From here we calculate the max value. We calculate Y with ```Q_targets``` according to the formula above. Notice that if an episode terminates at t + 1 there is no next state and therefore Y should equal reward. An elegant trick to implement this 'exception' is to invert the tensor of booleans ```dones``` by subtracting it from 1 (True --> False because 1-1=0, False --> True because 1-0=1) then multiplying this by the term ```Q_targets_next```. For episodes which finished at the next state (that is, done = True), 1-dones=0, therefore ```Q_targets = rewards + (gamma * Q_targets_next * 0) <=> Q_targets = rewards + 0```. For episodes whose next state is not done, ```Q_targets = rewards + (gamma * Q_targets_next * 1) <=> Q_targets = rewards + gamma * Q_targets_next```. 

In [1]:
import numpy as np

import random
from collections import namedtuple, deque
import matplotlib.pyplot as plt

import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F

In [2]:
a = namedtuple("Experience", field_names=['state', 'action', 'reward', 'next_state', 'done'])

In [15]:
b=a(1,2,0,4,False)

In [16]:
b.done

False

In [28]:
a = torch.from_numpy(np.array([0.1,1.2,3.1]))
a

tensor([0.1000, 1.2000, 3.1000], dtype=torch.float64)

In [26]:
class QNetwork(nn.Module):

    def __init__(self, state_size, action_size, fc1_units=64, fc2_units=64):
        super(QNetwork, self).__init__()
        self.bn1 = nn.BatchNorm1d(num_features=state_size)
        self.fc1 = nn.Linear(state_size, fc1_units)
        self.fc2 = nn.Linear(fc1_units, fc2_units)
        self.fc3 = nn.Linear(fc2_units, action_size)

    def forward(self, state):
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

net = QNetwork(state_size=3, action_size=2)

In [29]:
net(a)

RuntimeError: Expected object of scalar type Double but got scalar type Float for argument #2 'mat2' in call to _th_mm