# Deep Q-learning

Deep Reinforcement learning has resulted superman performance in a range of scenarios, such as Chess, Atari 2600, and Starcraft, but also in robotics. The main reason for this success compared to traditional Reinforcement Learning, is the introduction of deep neural networks. The first successful implementation of deep neural networks in combination with a Reinforcement learning algorithm was Mnih et. al. which was able to reach superhuman performance in the classic Atari games. Their approach was coined Deep Q-Networks(DQN), which is very similar to the Q-learning algorithm we implemented in the previous workshop, in combination with deep neural networks. This is the algorithm we will be implementing in this notebook.

In the DQN approach, a neural network replaces the Q-table which we used in the previous notebook. The main reason for this is the neural networks ability to generalize around similar states. In a continuous environment a Q-table would quickly grow to tens of millions of unique states, although many of these are so similar that the same action should be applied. This is both an issue due to the huge table we would need, but also because we would need a tremendous amount of training data to update all the states. 

Neural networks address this issue, however they do come with their own challenges. Neural network struggles with instabilities during training, in addition to critical forgetting and divergence. Mnih's success was due to several ingenious tricks which stabilized the network during training, and we will go through these during the workshop.

In the previous notebook we created a lot of methods, which resulted in a lot of parameters being passed from method to method. This can quickly become convoluted, and in this notebook we instead use classes. An example can be seen below. When the class is created the init method is ran, and the different parameters are set and stored in this instance of the class. The parameters can be called within the class using "self.<name_of_parameter>" syntax, e.g. self.learning_rate.

### Create Q-Networks
There are several frameworks which can be used to create neural networks such as Tensorflow from Google, PyTorch from Facebook or MXNet from Apache. In this workshop we will be using Tensorflow, with a high-level API called Keras on top. Keras greatly simplifies the effort needed to create a neural network, and it is often referred to as the best Deep Learning framework for those who are just starting out.

Neural networks are partially inspired by the structure of the human brain and consist of a network of interconnected neurons. The neural network consists of an input layer, a set of hidden layers, and an output layer, as shown in the figure below.

<img src="images/Neural_network__achitecture.svg" alt="drawing" width="400" height="200"/>



#### Task
Now we will see how easy it is to create a neural network using Keras. We are going to create a neural network with an input layer, two hidden layers, and one output layer. The design criterions are as follows: <br>

1. The model should be fully connected 
<br>
<br>
2. The input layer and first hidden layer is created together, and should:
    - Be of type Dense
    - Use the ReLU activation function
    - Have input size equal to the observations size of the environment
    - Have hidden layers size equal to parameter hidden_layer_1 
<br>
<br>
3. The second hidden layer should:
    - Be of type Dense
    - Use the ReLU activation function 
    - The hidden layer should have size equal to parameter hidden_layer_2
    - Tip1: This layer is similar to previous, but without the input_dim parameter
    - Tip2: Remember to add the dense layer using model.add()
<br>  
<br>
4. The output layer should:
    - Be of type Dense
    - The activation function should be linear
    - Should have size equal to action vector, i.e. action_size
<br>
<br>
5. The final part is to compile the network. Before we do this we must define som parameters
    - Loss metric should be Mean-squared-error (MSE)
    - Optimizer should be Adaptive Moment Estimation (Adam)
    - Learning rate should equal to learning_rate
    - Learning rate decay should be equal to learning_rate_decay
<br>
As you can see below step 1,2 and 5 is already done, so your task is to complete task 3 and 4.

**Ps!** <br>
At the end of the workshop you can experiment with different structures and parameters, however in the first walkthrough, you will follow the defined steps.

In [None]:
import warnings
warnings.filterwarnings("ignore")
should_assert = True

In [None]:
from tensorflow.python.keras import Sequential
from tensorflow.python.keras.layers import Dense
from tensorflow.python.keras.metrics import MSE
from tensorflow.python.keras.optimizers import Adam


class QNetwork:

    def __init__(self, env, parameters):
        self.observations_size = env.observation_space.shape[0]
        self.action_size = env.action_space.n
        self.learning_rate = parameters["learning_rate"]
        self.learning_rate_decay = parameters["learning_rate_decay"]
        self.loss_metric = parameters["loss_metric"]
        self.hidden_layer_1 = parameters["hidden_layer_1"]
        self.hidden_layer_2 = parameters["hidden_layer_2"]
        
    def build_q_network(self):
        
        model = Sequential()
        model.add(Dense(self.hidden_layer_1, input_dim=self.observations_size, activation='relu'))
        
        "Input code below"
        model.add(Dense(self.hidden_layer_2, activation='relu'))
        model.add(Dense(self.action_size, activation='linear'))
        
        "Input code above"
        
        model.compile(loss=self.loss_metric, optimizer=Adam(lr=self.learning_rate, decay=self.learning_rate_decay))
        
        return model

In [None]:
if(should_assert):
    parameters = {"loss_metric" : "mse",
                "learning_rate" : 0.01,
                "learning_rate_decay": 0.01,
                "hidden_layer_1": 24,
                "hidden_layer_2": 24}

    env = gym.make("CartPole-v0")
    q_network = QNetwork(env, parameters)
    model = q_network.build_q_network()
    config_network = model.get_config()
    assert(config_network.get("name")[0:10] == "sequential")

    # Layers
    config_layers = config_network.get("layers")
    assert(len(config_layers) == 3), "You should only have 2 dense layers"

    # First layer
    assert(config_layers[0].get("class_name") == "Dense"),"Incorrect layertype in first layer"
    assert(config_layers[0].get("config").get("batch_input_shape") == (None, q_network.observations_size))
    assert(config_layers[0].get("config").get("units") == 24), "Incorrect number of neurons in first layer"
    assert(config_layers[0].get("config").get("activation") == "relu"), \
    "Activation function for first layer should be relu"

    # Second layer
    assert(config_layers[1].get("class_name") == "Dense"),"Incorrect layertype in first layer"
    assert(config_layers[1].get("config").get("units") == 24), "Incorrect number of neurons in first layer"
    assert(config_layers[1].get("config").get("activation") == "relu"), \
    "Activation function for second layer should be relu"

    # Thrid layer
    assert(config_layers[2].get("class_name") == "Dense"),"Incorrect layertype in first layer"
    assert(config_layers[2].get("config").get("units") == q_network.action_size), "Incorrect number of neurons in first layer"
    assert(config_layers[2].get("config").get("activation") == "linear"),\
    "Activation function for third layer should be linear"


    config_optimizer = model.optimizer.get_config()
    assert(0.0099 < config_optimizer.get("lr") <= 0.01),"Learning rate should be 0.01"
    assert(0.0099 < config_optimizer.get("decay") <= 0.01),"Learning rate decay should be 0.01"
    assert(model.loss == "mse"), "Loss metric should me mse"
    
    print("Superb, you implemented the layers correct. PS, Ignore the warning above!")
    print("Information about your model is listed below. PS. The input layer is not shown,")
    print("altough you should be able to calculate it by looking at <Param #> of the first hidden layer.")
    model.summary()
    

### Create an Experience Buffer

For a neural network to perform optimally we want the data to be I.I.D (Independent and Identically Distributed). In supervised learning, where for instance the network is fed an image and it will predict either cat or dog, this is achived by randomly sampling the training data from the full data set. As a result:
- Batches have close-to similar data distribution.
- Samples in each batch are independent of each other.

In Reinforcement learning where our agent samples data by moving from state to state, the recently sampled data will be highly correlated to each other. As a result, we will feed our network data which closely resembles each other, this a recipe for disaster when working with neural network, as it leads to overfitting. Overfitting 

In addition, the data distribution of the initial states will be different from that of the later stage, i.e. this does not satifiy the I.I.D criterion. This is bad news!

Luckly Mnih et. al. has a solution. Instead of training on the just the previously collected samples we store all states in a **Experience Buffer** and sample randomly from this.




In [None]:
import numpy as np
import random
from collections import deque, namedtuple

class ExperienceReplay:
    
    def __init__(self, parameters):
        self.buffer_size = parameters["buffer_size"]
        self.batch_size = parameters["batch_size"]
        self.experience_buffer = deque(maxlen=self.buffer_size)
        
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "done", "new_state"])
    
    def add_experience(self, state, action, reward, done, new_state):   
        e = self.experience(state, action, reward, done, new_state)
        self.experience_buffer.append(e)
    
    def get_batch(self):
        
        if len(self.experience_buffer) < self.batch_size:
            experiences =  self.experience_buffer
        else:
            experiences = random.sample(self.experience_buffer, self.batch_size)
        
        states = np.vstack([e.state for e in experiences if e is not None])
        actions = np.vstack([e.action for e in experiences if e is not None])
        rewards = np.vstack([e.reward for e in experiences if e is not None])
        new_states = np.vstack([e.new_state for e in experiences if e is not None])
        dones = np.vstack([e.done for e in experiences if e is not None])
        
        return (states, actions, rewards, dones, new_states)
    
    def warm_up(self, env):
        for _ in range(10):
            state = env.reset()
            done = False
            
            while not done:
                action = env.action_space.sample()
                new_state, reward, done, _ = env.step(action)
                self.add_experience(state, action, reward, done, new_state)
                state = new_state

## Create Agent

### Target network
In addition, the labels are stationary, e.g. an image of a cat will always have the label cat. This is not the case in Reinforcement learning.

As we experienced in the previous notebook, the Q-value of the next states is calculated based on the assumption that the Q-table is correct. However, since we begin each training session with a blank table and update the values as we go along, the highest Q-value which is used in the Q-learning equation

\begin{equation}
Q(s_{t},a_{t})^{new} \leftarrow Q(s_{t}, a_{t}) + \alpha \ \big[r_{t} + \gamma \ \underset{a}{max} \  Q(s_{t+1},a) - Q(s_{t},a_{t})\big]
\end{equation}

In [None]:
import os
import datetime
import gym

class Agent:    

    def __init__(self, env, parameters):
        self.env = env
        self.local_network = QNetwork(env, parameters).build_q_network()
        self.target_network = QNetwork(env, parameters).build_q_network()
        self.experience_replay = ExperienceReplay(parameters)
        
        self.epsilon = parameters["epsilon_init"]
        self.epsilon_decay = parameters["epsilon_decay"]
        self.epsilon_minimum = parameters["epsilon_minimum"]
        self.tau = parameters["tau"]
        self.gamma = parameters["gamma"]
        self.epochs = parameters["epochs"]
    
    def learn(self):
        states, actions, rewards, dones, next_states = self.experience_replay.get_batch()
        
        # Get Q-values for next state
        Q_target = self.target_network.predict(next_states)

        # Apply Q-learning algorithm to calculate the actual Q-value for state
        Q_calc = rewards + (self.gamma * np.amax(Q_target, axis=1).reshape(-1, 1) * (1 - dones))
        
        # Calculate the predicted Q-value for the action taken in the state using local network
        Q_local = self.local_network.predict(states)
        
        # Change Q_values for chosen action with "correct" Q-values, e.g. Q_actual        
        for row, col_id in enumerate(actions):
            Q_local[row, np.asscalar(col_id)] = Q_calc[row]
        
        # Network inputs states and outputs 
        self.local_network.fit(states, Q_local, epochs=self.epochs, verbose=0)
    
    def update_target_network(self):
        local_weights = self.local_network.get_weights()
        target_weights = self.target_network.get_weights()

        for i in range(len(local_weights)):
            target_weights[i] = self.tau * local_weights[i] + (1 - self.tau) * target_weights[i]
        self.target_network.set_weights(target_weights)
    
    def update_epsilon(self):
        if self.epsilon >= self.epsilon_minimum:
            self.epsilon *= self.epsilon_decay
        
    def select_action(self, state):
    
        if self.epsilon > np.random.uniform():
            action = env.action_space.sample()
        else:
            action = np.argmax(self.local_network.predict(np.array([state])))

        return action
    
    def step(self, env, state):
        
        action  = self.select_action(state)
        new_state, reward, done, _ = env.step(action)

        return action, reward, done, new_state
    
    def save(self):
        save_dir = os.path.join(os.getcwd(), env.spec.id +"_"+ datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S'))
        os.makedirs(save_dir)
        self.target_network.save("target_network.h5")
        self.local_network.save("local_network.h5")
        print("Weights saved successfully")
        
    def load(self, path):
        self.target_network.load_weights(path + "/target_network.h5")
        self.local_network.load_weights(path + "/local_network.h5")
        print("Weights loaded successfully")

### Train

In [None]:
def train(agent, env, iterations, episodes):
    
    total_reward = 0
    total_reward_list, iterations_list = [], []
    agent.experience_replay.warm_up(env)
    
    for episode in range(episodes):
        
        state = env.reset()
        total_reward=0
        
        
        if (episode != 0): 
            agent.update_epsilon()
    
        for iteration in range(iterations):
            
            action, reward, done, new_state = agent.step(env, state)
            agent.experience_replay.add_experience(state, action, reward, done, new_state)
            
            state = new_state
            
            agent.learn()
            agent.update_target_network()
            total_reward += reward
            
            if done: 
                break
        
        total_reward_list.append(total_reward)
        iterations_list.append(iteration+1)
        
        if episode % 10 == 0:
            print("Episode: {} | Average iterations: {} | Average total reward: {} | Epsilon: {} " \
                  .format(episode, mean(iterations_list), mean(total_reward_list), agent.epsilon))
            total_reward_list.clear()
            iterations_list.clear()

### Play

In [None]:
def play(agent, env):
    
    done = False
    agent.epsilon = 0
    total_reward = 0
    state = env.reset()
    
    while not done:
        action, reward, done, new_state = agent.step(env, state)
        state = new_state
        
        total_reward += reward
    
    print("Total Reward: {}".format(total_reward))

## Use agent

In [None]:
def get_hyperparameters(env):
    
    env_id = env.spec.id

    if env_id == "CartPole-v0":
        print("Hyperparameters for {} chosen!".format("CartPole-v0"))
        parameters = {
            "tau" : 0.05,
            "gamma" : 0.95,
            "epsilon_init" : 1,
            "epsilon_decay" : 0.97,
            "epsilon_minimum": 0.01,
            "buffer_size" : 10000,
            "batch_size" : 32,
            "epochs": 1,
            "loss_metric" : "mse",
            "learning_rate" : 0.01,
            "learning_rate_decay": 0.01,
            "hidden_layer_1": 24,
            "hidden_layer_2": 24}
        
    elif env_id == "CartPole-v1":
        print("Hyperparameters for {} chosen!".format("CartPole-v1"))
        parameters = {
            "tau" : 0.05,
            "gamma" : 0.99,
            "epsilon_init" : 1,
            "epsilon_decay" : 0.999,
            "epsilon_minimum": 0.01,
            "buffer_size" : 2000,
            "batch_size" : 64,
            "epochs": 1,
            "loss_metric" : "mse",
            "learning_rate" : 0.01,
            "learning_rate_decay": 0.01,
            "hidden_layer_1": 24,
            "hidden_layer_2": 24}
        
    elif env_id == "LunarLander-v2":
        print("Hyperparameters for {} chosen!".format("LunarLander-v2"))
        parameters = {
            "tau" : 0.05,
            "gamma" : 0.99,
            "epsilon_init" : 1,
            "epsilon_decay" : 0.999,
            "epsilon_minimum": 0.01,
            "buffer_size" : 2500,
            "batch_size" : 64,
            "epochs": 1,
            "loss_metric" : "mse",
            "learning_rate" : 0.01,
            "learning_rate_decay": 0.01,
            "hidden_layer_1": 24,
            "hidden_layer_2": 24}
    else:
        print("Standard hyperparameters {} chosen!".format(env.spec.id))
        parameters = {
            "tau" : 0.05,
            "gamma" : 0.95,
            "epsilon_init" : 1,
            "epsilon_decay" : 0.97,
            "epsilon_minimum": 0.01,
            "buffer_size" : 10000,
            "batch_size" : 32,
            "epochs": 1,
            "loss_metric" : "mse",
            "learning_rate" : 0.01,
            "learning_rate_decay": 0.01,
            "hidden_layer_1": 24,
            "hidden_layer_2": 24}
    
    return parameters

In [None]:
import gym

environment = "CartPole-v1"

env = gym.make(environment)

episodes = 3000
iterations = 500
parameters = get_hyperparameters(env)

dqn_agent = Agent(env, parameters)
train(dqn_agent, env, iterations, episodes)

## Run - Record - Show : One simulation

In [None]:
env = gym.make("CartPole-v1")
monitor = gym.wrappers.Monitor(env, directory="videos", force=True)

#Why is three simulations needed for saving?!
for _ in range(3):
    play(dqn_agent, monitor)

monitor.close()
env.close()

#HTML("""
#<video width="640" height="480" controls>
#  <source src="{}" type="video/mp4">
#</video>
#""".format("./videos/"+list(filter(lambda s: s.endswith(".mp4"), os.listdir("./videos/")))[-1]))

### If we want to plot
from matplotlib import pyplot <br>
from IPython.display import display, clear_output

fig = plt.figure()
ax = fig.add_subplot(111)
plt.ion()

fig.show()
fig.canvas.draw()
reward_list, episode_list, iteration_list = [], [], []

episode_list.append(episode)
reward_list.append(total_reward)

if episode % 15 == 0:
    ax.clear()
    ax.plot(episode_list, reward_list)
    fig.canvas.draw()