In [3]:
import numpy as np
import random
from collections import namedtuple, deque

import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F

# A Deep Q-Network Implementation

## The QNetwork class
Create the QNetwork class, which is an FC n-layered network with i) parameters for state size and action size, ii) parameters for number of units in each layer we wish to program, and 3) a method for forward propagation.

The line
```Python
super(QNetwork, self).__init__()```

is there because we are going to use QNetwork as a subclass of the parent class Agent. We want it to have both its parent class attributes (from Agent, which we will define shortly), as well as some other attributes which are specific to agents of type QNetwork but not other types of Agent.

In [5]:
class QNetwork(nn.Module):
    
    def __init__(self, state_size, action_size, seed, fc1_units=64, fc2_units=64):
        '''Initialise parameters and build the Q-network model.
        Arguments
        =========
            state_size (int): size of state space
            action_size (int): size of action space
            seed (int): random seed for replicability
            '''
        super(QNetwork, self).__init__()
        self.seed = torch.manual_seed(seed) # Sets the seed for generating random numbers
        self.fc1 = nn.Linear(state_size, fc1_units)
        self.fc2 = nn.Linear(fc1_units, fc2_units)
        self.fc3 = nn.Linear(fc2_units, action_size)
        
    def forward(self, state):
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

Set up replay buffer size, minibatch size, discount factor gamma, soft update factor for w tau, learning rate and how often to update the local network. Then check if we have a GPU and cuda available and if so, get it ready to use.

In [4]:
buffer_size = int(1e5)
batch_size = 64
gamma = 0.99
tau = 1e-3
lr = 5e-4
update_every = 4

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## The Agent class

### Agent attributes
To endow our Agent with its QNetwork 'brain', we set attributes qnetwork_local and qnetwork_target to be instances of the class QNetwork in the following lines:
```Python
        self.qnetwork_local = QNetwork(state_size, action_size, seed).to(device)
        self.qnetwork_target = QNetwork(state_size, action_size, seed).to(device)```
Because we're using **Fixed Q Targets**, we actually set up two Q-networks: one which is the target we're trying to approximate and the other the current parameters for the approximation to the Q value function.

Here, we initialise the Agent's memory attribute, which is a collection of *experiences*, each of which is a tuple of form (state, action, reward, next_state).
```Python
        self.memory = ReplayBuffer(action_size, buffer_size, batch_size, seed)```
And after it, we add a time-step counter.
### Agent methods: step
The step function,
```Python
        def step(self, state, action, reward, next_state, done):```
Takes in an experience tuple and adds it to the self.memory attribute of agent, as seen above. Then, it takes one time-step and checks if the number of time-steps after which to update (*update_every*) has been reached:
```Python
        self.t_step = (self.t_step + 1) % update_every```

If it has, *and* enough samples have been stored in memory (threshold defined by *batch_size*), it retrieves an experience tuple at random from memory and then passes it to the .learn method.
### Agent methods: act
*act* is the method for selecting an epsilon-greedy action given the current policy.  
First, we take the state and convert it from numpy to a pytorch tensor, unsqueeze it with 0(*returns a new tensor with a dimension of size one inserted at the specified position*) and send it to the same device where our QNetworks reside.

We then set the local QNetwork model to eval() mode. This pytorch method turns off dropout and batch normalisation. The reason for this is that in the next line,
```Python
        with torch.no_grad():
            action_values = self.qnetwork_local(state)```
   we are accessing the QNetwork final output values, by passing in a state and computing forward propagation through the QNetwork's current values. Dropout and batch normalisation are used for improving training and generalization, so here we wish to turn them off, in order to select an action. The code that follows is boilerplate for instantiating an epsilon-greedy action.

### Agent methods: learn
Here we update value parameters given a batch of experience tuples.  
First, we *detach* the target QNetwork from the computational graph, ensuring no gradient is backpropped along this network specifically. Then we obtain the maximum predicted Q values for next states from the target model:
```Python
        Q_targets_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1)
        Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))

        Q_expected = self.qnetwork_local(states).gather(1, actions)```
Then, we use the Q-learning formula to compute Q-targets (approximation of true function) for current states that we're learning from. We next obtain the expected Q values from the local model, forward propping *states* through it and gather(ing) the results (think this is similar to zip).  
Next, we compute the loss, run backprop and update the local weights:
```Python
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()```  
Finally, we update the target QNetwork. Notice this is a different update from that executed on the local QNetwork. On the former, we're doing a 'real' learned update with gradient descent. On the latter we're using Fixed Q Targets through the *soft_update* function, which we'll look at next.
```Python
        self.soft_update(self.qnetwork_local, self.qnetwork_target, tau)```
### Agent methods: soft update
Here the donkey has gotten close enough, so we move the carrot a little bit. The soft update goes by the formula
```θ_target = τ*θ_local + (1 - τ)*θ_target```, using tau as initialised at the top. The _ after copy is for making .copy an in-place method.

In [6]:
class Agent():
    
    def __init__(self, state_size, action_size, seed):
        self.state_size = state_size
        self.action_size = action_size
        self.seed = random.seed(seed)
        
        # Q-network 'brain'
        self.qnetwork_local = QNetwork(state_size, action_size, seed).to(device)
        self.qnetwork_target = QNetwork(state_size, action_size, seed).to(device)
        
        # Optimizer - notice only used on local qnetwork
        self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=lr)
        
        # Replay memory
        self.memory = ReplayBuffer(action_size, buffer_size, batch_size, seed)
        
        # Timestep counter
        self.t_step = 0
        
    def step(self, state, action, reward, next_state, done):
        # Save the current experience in replay memory
        self.memory.add(state, action, reward, next_state, done)
        
        # Learn after every 'update_every' steps
        self.t_step = (self.t_step + 1) % update_every
        
        # if 'update_every' time steps have been reached (modulo division above),
        if self.t_step == 0:
                # and if there are enough samples in memory:
            if len(self.memory) > batch_size:
                experiences = self.memory.sample()
                self.learn(experiences, gamma)
                
    def act(self, state, eps=0.):
        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        self.qnetwork_local.eval() # pytorch method for setting a network to evaluation (stop dropout and batchnorm)
        with torch.no_grad():
            action_values = self.qnetwork_local(state)
        self.qnetwork_local.train()
        
        if random.random > eps:
            return np.argmax(action_values.cpu().data.numpy())
        else:
            return random.choice(np.arange(self.action_size))
    
    def learn(self, experiences, gamma):
        states, actions, rewards, next_states, dones = experiences
        
        Q_targets_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1)
        Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))
        
        Q_expected = self.qnetwork_local(states).gather(1, actions)
        
        loss = F.mse_loss(Q_expected, Q_targets)
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        self.soft_update(self.qnetwork_local, self.qnetwork_target, tau)
    
    def soft_update(self, local_model, target_model, tau):
        for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
            target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data)

## The ReplayBuffer class
Here we define ReplayBuffer which is an object with the ability to carry experiences up to a certain capacity (configured at the top, *buffer_size*).  
The attribute of ReplayBuffer which stores experiences is, appropriately, *self.memory*. We package experiences as named tuples in the self.experience attribute:
```Python
        self.experience = namedtuple("Experience", field_names=['state', 'action', 'reward', 'next_state', 'done'])```
Calling self.experience on data of the right format organises that data by returning a tuple of name 'Experience', with the fields defined and named above. 

### ReplayBuffer methods: add
.add takes in arguments for
```Python
        def add(self, state, action, reward, next_state, done):```
and passes them to self.experience for packaging. This experience is then appended to memory, up to the capacity *buffer_size*. Because *memory* is a deque, that means if it reaches capacity, the rule goes "new in, oldest out", where we can imagine elements getting added left to right. Once we reach capacity, if a new element is added (at the right), the oldest element falls off at the left end, and everything shifts one step left.

### ReplayBuffer methods: sample
From *memory* we sample a number *batch_size* of experiences at random, unpack them first into concatenated numpy arrays, converting them to pytorch and then sending them to the device (GPU) where we're doing computation.  
.sample returns tensors each containing:
```Python
        return (states, actions, rewards, next_states, dones)```

In [8]:
class ReplayBuffer:
    
    def __init__(self, action_size, buffer_size, batch_size, seed):
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=['state', 'action', 'reward', 'next_state', 'done'])
        self.seed = random.seed(seed)
    
    def add(self, state, action, reward, next_state, done):
        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)
    
    def sample(self):
        experiences = random.sample(self.memory, k = self.batch_size)
        
        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to(device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)
  
        return (states, actions, rewards, next_states, dones)

    def __len__(self):
        return len(self.memory)

## Training
Finally, we will build a function for controlling the agent-environment interaction, calling the functions and classes we have defined above appropriately. The algorithm created in this notebook can be used with environments that have continuous observation space and discrete action space.

### DQN function

In [None]:
def dqn(n_episodes = 2000, max_t=1000, eps_start=1.0, eps_decay=0.995, eps_end=0.01):
    scores = []
    scores_window = deque(maxlen=100)
    eps = eps_start
    
    for i_episode in range(1, n_episodes+1):
        state = env.reset()
        score = 0
        
        for r in range(max_t)
        action = agent.act(state, eps)
        next_state, reward, done, + = env.step(action)
        agent.step(state, action, reward, next_state, done)
        state = next_state
        score += reward
        if done:
            break
    
        scores_window.append(score)
        scores.append(score)
        eps = max(eps_end, eps_end*eps)

        print('\rEpisode {}\tAverage Score {:.2f}'.format(i_episode, np.mean(scores_window)), end='')
        if i_episode % 100 == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
        if np.mean(scores_window)>=200.0:
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode-100, np.mean(scores_window)))
            torch.save(agent.qnetwork_local.state_dict(), 'checkpoint.pth')
            break
    return scores

In [None]:
env = gym.make('LunarLander-v2')
env.seed(0)
scores = dqn()

# plot the scores
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(scores)), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()