# Introduction

### Background
In Reinforcement Learning (RL) we study agents whose goal is to interact with the environment in a way that maximizes rewards over time (return). RL problems can be formally defined as Markov Decision Processes. This involves specifying:
- the set of states and actions available to the agent in an environment;
- the one-step dynamics of that environment, i.e. the probability that action a in state s at time t will result in state s' at time t+1;
- the reward that the environment will dispense after the agent transitions from a state *s* to a state *s'* due to action *a*.

To maximise return, the agent must learn an optimal policy which tells it for each state it experiences, which action will bring about the highest return. There are multiple methods for solving RL problems. In model-free approaches, the agent does not know the environment's one-step dynamics or reward function. Instead, it uses its experience interacting with the environment to gradually build up and estimate a state-value (Q) table specifying for each state *s* what the expected return of taking an action *a* would be. 

Q-learning is one of several algorithms used for updating a Q-table over an agent's experience. It is a Temporal Difference method, meaning updates happen after each time-step, rather than at the end of an episode. Succintly, in Q-learning the value of a state-action pair is updated based on the value of the best possible action available at the next time-step, as per the following equation:

```Python
new_Q(St, At) = old_Q(St, At) + learning_rate * (R(t) + discount * max(Q(St+1)) - old_Q(St, At))```

Given enough interaction episodes, if the Q-table converges to the true state-value function, then retrieving the optimal policy becomes a matter of for each state, looking up which action maximises return.

### Deep Q-Networks
Powerful as they may be, tabular Q-learning approaches are limited in that they are unable to generalise: Q values are estimated separately for individual *experienced* state-action pairs and they imply nothing for previously unobserved state-action pairs. Moreover, they are unsuited for continuous spaces as that would entail building infinitely large Q-tables. Instead of a table, it would be desirable to estimate a continuous function, which is exactly what function approximation methods aim to do.  

Deep Q-Networks are a neural-network based function approximation approach, where we aim to learn the parameters *w* of a Q'(s,a,w) function such that Q'(s,a,w) ~= Q(s,a) where Q is the "true" action-value function. 

# Algorithm

## The QNetwork class

We begin by defining a QNetwork class, which instantiates a neural network composed of fully-connected layers according to a pre-defined architecture, where network depth and layer width are specified as arguments. 
```Python
class QNetwork(nn.Module):
    
    def __init__(self, state_size, action_size, fc1_units=64, fc2_units=64):
        super(QNetwork, self).__init__()
        self.bn1 = nn.BatchNorm1d(num_features=state_size)
        self.fc1 = nn.Linear(state_size, fc1_units)
        self.fc2 = nn.Linear(fc1_units, fc2_units)
        self.fc3 = nn.Linear(fc2_units, action_size)
        
    def forward(self, state):
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        return self.fc3(x)```
QNetwork objects are neural networks that store and learn the weight parameters necessary for estimating the value of each action available to the agent when given as input a particular state. Accordingly, their input dimensions match the state size, and their output dimensions match the action size. A forward propagation method is defined such that if an instance of a QNetwork is passed a state, it returns the current estimate of value for each action in that state.

## The Agent class

We define an Agent class whose only arguments are the state size and action size:
```Python
class Agent():
    
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        
        self.qnetwork_local = QNetwork(state_size, action_size).to(device)
        self.qnetwork_target = QNetwork(state_size, action_size).to(device)
        
        self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=lr)
        
        self.memory = ReplayBuffer(action_size, buffer_size, batch_size)
        
        self.t_step = 0```
When initialised with a particular state and action size, the Agent class calls on QNetwork, passes it state size and action size arguments, and instantiates two QNetworks as attributes: a ```qnetwork_local```, which learns to approximate the true action-value function, and a ```qnetwork_target```, which sets the fixed targets for the local QNetwork to learn.  

To enable Experience Replay, we endow the agent with a ```memory``` attribute, which is an instantiation of the ```ReplayBuffer``` class. We will explain how this class works in greater detail below, but suffice to say for now that it is a container with capacity ```buffer_size``` for storing the agent's experiences for later learning.  

Agent is also endowed with a ```t_step``` attribute which we will use as a time-step counter. This allows the Agent to keep track of how many time-steps have passed since the last time it updated its knowledge. Finally, Agent has an ```optimizer``` attribute where we simply specify which particular form of optimisation method we will use during gradient descent and weight updating.

Having specified what Agent *has* - its attributes and initialisation procedure - we now turn to what Agent *does* - the methods available to it.  

```Python
    def step(self, state, action, reward, next_state, done):
        self.memory.add(state, action, reward, next_state, done)
        
        self.t_step = (self.t_step + 1) % update_every
        
        if self.t_step == 0:
            if len(self.memory) > batch_size:
                experiences = self.memory.sample()
                self.learn(experiences, gamma)```
Besides its arguments, ```step``` relies on two variables defined prior to starting the training loop:
- ```update_every``` is an integer specifying how often, in time-steps, the agent should perform a learning episode.
- ```batch_size``` is an integer specifying how many experiences the agent should 'recall' when performing a learning episode.

At every time-step, we will call ```step```. This method first saves the agent's current experience to its memory container as a tuple of ```state```, ```action```, ```reward```, ```next_state``` and the boolean ```done```. It then updates the agent's ```t_step``` counter, resetting it to 0 every time ```update_every``` time-steps have passed. If this condition is met *and* there are enough (>```batch_size```) experience tuples stored in memory, the agent calls class ReplayBuffer's method ```.sample``` to recall a number of experiences equal to ```batch_size``` and store them in the variable ```experiences```. We don't need to pass the argument ```batch_size``` to ```memory.sample()``` because when we instantiated Agent we initialised its ```memory``` attribute  with this argument baked in. Recall from above that in Agent:
```Python
           self.memory = ReplayBuffer(action_size, buffer_size, batch_size)```
Finally, ```step``` calls the method ```learn```, passing it the experiences recalled at ```experiences = memory.sample()``` and which Agent should learn from.

To update its current estimates of action values and the fixed targets it's approximating them to, Agent relies on
```Python
    def learn(self, experiences, gamma):
        states, actions, rewards, next_states, dones = experiences
        
        Q_targets_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1) 
        Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))
        Q_expected = self.qnetwork_local(states).gather(1, actions)
        
        loss = F.mse_loss(Q_expected, Q_targets)
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        self.soft_update(self.qnetwork_local, self.qnetwork_target, tau)```
First, we unpack the ```experiences``` tuples