# Deep Q-learning
##### Authors: Eirik Fagtun Kjærnli and Fabian Dietrichson [Accenture]

Deep Reinforcement learning has resulted in superman performance in a range of scenarios, such as Chess, Atari 2600, and Starcraft, but also in robotics. The main reason for this success compared to traditional Reinforcement Learning is the introduction of deep neural networks. The first successful implementation of deep neural networks in combination with a Reinforcement learning algorithm was Mnih et. al. which was able to reach superhuman performance in the classic Atari games. Their approach was coined Deep Q-Networks(DQN), and uses the Q-learning algorithm we used in the previous notebook, in combination with deep neural networks.

In the DQN approach, a neural network replaces the Q-table which we used in the previous notebook. The main reason is the neural networks ability to generalize around similar states. In a continuous environment, a Q-table would quickly grow to tens of millions of unique states, although many of these are so similar that the same action should be applied. This is both an issue due to the huge table we would need, but also because we would need a tremendous amount of training data to update all the states. 

Neural networks address this issue, but come with their own challenges. Neural network struggles with instabilities during training, in addition to critical forgetting and divergence. Mnih's success was due to several ingenious tricks which stabilized the network during training, and we will go through these during the workshop.

In the previous notebook, we created a lot of methods, which resulted in many parameters being passed from method to method. This can quickly become convoluted, and in this notebook we instead use classes. An example can be seen below. When the class is created the init method is run, and the different parameters are set and stored in this instance of the class. The parameters can be called within the class using "self.<name_of_parameter>" syntax, e.g. self.learning_rate.

## Import packages and create support methods for workshop
Before you can go on, the cell below must be run. These are methods used to verify your work, in addition to the support function that will be used throughout the notebook. 

### Task 1
Import the packages needed for this workshop, simply mark the cell below and press CTRL + Enter

### Task 2
Create the necessary support methods by running the second cell below. Mark it and press CTRL + Enter

In [None]:
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

In [None]:
import warnings
import logging
import os
import datetime
import gym
import io
import base64
import glob
import numpy as np
import random
from copy import deepcopy
from statistics import mean
from IPython.display import HTML
from collections import deque, namedtuple

from tensorflow.python.keras import Sequential
from tensorflow.python.keras.layers import Dense
from tensorflow.python.keras.metrics import MSE
from tensorflow.python.keras.optimizers import adam_v2
from tensorflow.python.keras.initializers import glorot_uniform

from IPython.display import HTML

from IPython import display as ipythondisplay

from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

if not os.path.exists("videos"):
    os.makedirs("videos")
    
warnings.filterwarnings("ignore")
logging.getLogger('tensorflow').disabled = True
%precision 3
should_assert = True

In [None]:
def generate_blank_weights(layer_dims):
    
    # Xavier/Glorot Initialization
    new_weights = np.random.randn(layer_dims[0], layer_dims[1])*np.sqrt(1/layer_dims[0])
    new_bias = np.zeros((layer_dims[1]))
    
    return [new_weights, new_bias]

def create_dummybuffer():
    mock_env = gym.make("CartPole-v1")
    parameters = {"buffer_size": 6,
                  "batch_size": 3}
    DummyBuffer = ExperienceReplay(mock_env, parameters)
    
    return DummyBuffer

def get_dummy_parameters_and_env():
        
        parameters = {
            "tau" : 0.4,
            "gamma" : 1,
            "epsilon_init" : 1,
            "epsilon_decay" : 1,
            "epsilon_minimum": 0.01,
            "buffer_size" : 2000,
            "batch_size" : 64,
            "epochs": 1,
            "loss_metric" : "mse",
            "learning_rate" : 0.01,
            "learning_rate_decay": 0.01,
            "hidden_layer_1": 5,
            "hidden_layer_2": 5}
        
        dummy_env = gym.make("CartPole-v0")
        return parameters, dummy_env 
    
def clear_video_folder():
    video_path = "videos"
    for item in os.listdir(video_path):
        os.remove(os.path.join(video_path, item))

def play(agent):
    
    old_epsilon = agent.epsilon
    
    done = False
    agent.epsilon = 0
    total_reward = 0
    state = agent.env.reset()
    
    while not done:
        action, reward, done, new_state = agent.step(state)
        state = new_state
        
        total_reward += reward
    print("Total Reward: {}".format(total_reward))    
    
    agent.epsilon = old_epsilon
    
def generate_video(agent):
    
    if os.listdir("./videos/"):
        clear_video_folder()

    monitor = gym.wrappers.Monitor(agent.env, directory="videos", force=True)
    original_env = agent.env
    agent.env = monitor
    
    old_epsilon = agent.epsilon
    
    for _ in range(15):
        total_reward = 0
        done = False
        agent.epsilon = 0
        total_reward = 0
        state = agent.env.reset()
        while not done:
            action, reward, done, new_state = agent.step(state)
            state = new_state

            total_reward += reward
    
    print("Total Reward: {}".format(total_reward))    
    agent.epsilon = old_epsilon
    agent.env = original_env
    monitor.close()

def show_video():
  mp4list = glob.glob('videos/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

## Create Q-Networks
There are several frameworks which can be used to create neural networks such as Tensorflow from Google, PyTorch from Facebook or MXNet from Apache. In this notebook, we will be using Tensorflow, with a high-level API called Keras on top. Keras greatly simplifies the effort needed to create and train a neural network, and it is often referred to as the best Deep Learning framework for those who are just starting with neural networks.

Neural networks are partially inspired by the structure of the human brain and consist of a network of interconnected neurons. The neural network consists of an input layer, a set of hidden layers, and an output layer, as shown in the figure below.

<img src="https://github.com/acntech/reinforcement-learning-workshop/blob/master/Workshop/Images/Neural_network__achitecture.svg?raw=1" alt="drawing" width="400" height="200"/>



### Task
Now we will experience how easy it is to create a neural network using Keras. We are going to create a neural network with an input layer, two hidden layers, and one output layer. We have completed steps 1,2, 5, and your task is to complete task 3 and 4 according to the design criterions below: <br>

1. The model should be fully connected 
<br>
<br>
2. The input layer and first hidden layer is created together, and should:
    - Be of type Dense
    - Use the ReLU activation function
    - Have input size equal to the observations size of the environment
    - Have hidden layers size equal to parameter hidden_layer_1 
<br>
<br>
3. The second hidden layer should:
    - Be of type Dense
    - Use the ReLU activation function 
    - The hidden layer should have size equal to parameter hidden_layer_2
    - Tip1: This layer is similar to previous, but without the input_dim parameter
    - Tip2: Remember to add the dense layer using model.add()
<br>
<br>
4. The output layer should:
    - Be of type Dense
    - The activation function should be linear
    - Should have size equal to action vector, i.e. action_size
<br>
<br>
5. The final part is to compile the network. Before we do this we must define som parameters
    - Loss metric should be Mean-squared-error (MSE)
    - The optimizer should be Adaptive Moment Estimation (Adam)
    - Learning rate should equal to learning_rate
    - Learning rate decay should be equal to learning_rate_decay
<br>

**Ps!** <br>
At the end of the workshop, you can experiment with different structures and parameters, however, in the first walkthrough, you will follow the defined steps.

In [None]:
class QNetwork:

    def __init__(self, env, parameters):
        self.observations_size = env.observation_space.shape[0]
        self.action_size = env.action_space.n
        self.learning_rate = parameters["learning_rate"]
        self.learning_rate_decay = parameters["learning_rate_decay"]
        self.loss_metric = parameters["loss_metric"]
        self.hidden_layer_1 = parameters["hidden_layer_1"]
        self.hidden_layer_2 = parameters["hidden_layer_2"]    
        
    def build_q_network(self):
        
        model = Sequential()
        model.add(Dense(self.hidden_layer_1, input_dim=self.observations_size, activation='relu'))
        
        "Input code below"
        
        "Input code above"
        
        model.compile(loss=self.loss_metric, optimizer=adam_v2.Adam(lr=self.learning_rate, decay=self.learning_rate_decay))
    
        return model
    

In [None]:
# Do not edit - Assertion cell #
if(should_assert):
    mock_parameters = {"loss_metric" : "mse",
                "learning_rate" : 0.01,
                "learning_rate_decay": 0.01,
                "hidden_layer_1": 24,
                "hidden_layer_2": 24}

    mock_env = gym.make("CartPole-v0")
    mock_q_network = QNetwork(mock_env, mock_parameters)
    mock_model = mock_q_network.build_q_network()
    config_network = mock_model.get_config()
    assert(config_network.get("name")[0:10] == "sequential")

    # Layers
    config_layers = config_network.get("layers")
    assert(len(config_layers) == 4), f"You should only have 2 dense layers, you've got {len(config_layers)}"

    # First layer
    assert(config_layers[0].get("class_name") == "Dense"),"Incorrect layertype in first layer"
    assert(config_layers[0].get("config").get("batch_input_shape") == (None, mock_q_network.observations_size))
    assert(config_layers[0].get("config").get("units") == 24), "Incorrect number of neurons in first layer"
    assert(config_layers[0].get("config").get("activation") == "relu"), \
    "Activation function for first layer should be relu"


    # Second layer
    assert(config_layers[1].get("class_name") == "Dense"),"Incorrect layertype in first layer"
    assert(config_layers[1].get("config").get("units") == 24), "Incorrect number of neurons in first layer"
    assert(config_layers[1].get("config").get("activation") == "relu"), \
    "Activation function for second layer should be relu"

    # Thrid layer
    assert(config_layers[2].get("class_name") == "Dense"),"Incorrect layertype in first layer"
    assert(config_layers[2].get("config").get("units") == mock_q_network.action_size), \
    "Incorrect number of neurons in first layer"
    assert(config_layers[2].get("config").get("activation") == "linear"),\
    "Activation function for third layer should be linear"

    config_optimizer = mock_model.optimizer.get_config()
    print(config_optimizer)
    assert(0.0099 < config_optimizer.get("learning_rate") <= 0.01),"Learning rate should be 0.01"
    assert(0.0099 < config_optimizer.get("decay") <= 0.01),"Learning rate decay should be 0.01"
    assert(mock_model.loss == "mse"), "Loss metric should me mse"
    
    print("Superb, you implemented the layers correct! Information about your model is shown below.\n")
    print("PS. The input layer is not shown, altough you should be able to calculate it by looking at <Param #>")
    print(" of the first hidden layer.\n")
    mock_model.summary()
    

{'name': 'sequential', 'layers': [{'class_name': 'InputLayer', 'config': {'batch_input_shape': (None, 4), 'dtype': 'float32', 'sparse': False, 'ragged': False, 'name': 'dense_input'}}, {'class_name': 'Dense', 'config': {'name': 'dense', 'trainable': True, 'batch_input_shape': (None, 4), 'dtype': 'float32', 'units': 24, 'activation': 'relu', 'use_bias': True, 'kernel_initializer': {'class_name': 'GlorotUniform', 'config': {'seed': None}}, 'bias_initializer': {'class_name': 'Zeros', 'config': {}}, 'kernel_regularizer': None, 'bias_regularizer': None, 'activity_regularizer': None, 'kernel_constraint': None, 'bias_constraint': None}}, {'class_name': 'Dense', 'config': {'name': 'dense_1', 'trainable': True, 'dtype': 'float32', 'units': 24, 'activation': 'relu', 'use_bias': True, 'kernel_initializer': {'class_name': 'GlorotUniform', 'config': {'seed': None}}, 'bias_initializer': {'class_name': 'Zeros', 'config': {}}, 'kernel_regularizer': None, 'bias_regularizer': None, 'activity_regularizer

## Create an Experience Buffer

For a neural network to perform optimally we want the data to be I.I.D (Independent and Identically Distributed). In supervised learning, where for instance the network is fed an image and it will predict either cat or dog, this is achieved by randomly sampling the training data from the full data set. As a result:
- Batches have close-to similar data distribution.
- Samples in each batch are independent of each other.

In Reinforcement learning where our agent samples data by moving from state to state, the recently sampled data will be highly correlated to each other. As a result, we will feed our network data which closely resembles each other, which is a recipe for disaster when working with a neural network. Furthermore, the data distribution of the initial states will be different from that of the later stage, i.e. this does not satisfy the I.I.D criterion. This is bad news!

Luckily Mnih et. al. has a solution. Instead of training on the just the previously collected samples, we store all states in an **Experience Buffer** and sample randomly from this buffer.


## Tasks

This section is divided into multiple tasks. You will first complete task 1, and then run the first assertion cell for task 1. Then you will continue to task 2 and run assertion cell 2 etc.

### Task 1 - Add experience

The first method we are going to create is a method which adds an experience, i.e. a state-transition. This is stored as a tuple which contains the following fields; state, action, reward, done, new_state.

To store a state-transition we are going to use a specific type of tuple, namely a namedtuple. You can read more about it [here](https://docs.python.org/2/library/collections.html#collections.namedtuple). We have created a namedtuple which you are going to use called experience_tuple. The experience_tuple have the following field where you can store information; state, action, reward, done, new_state. To create a state-transition do the following:
- self.experience_tuple(state=your_state, action=your_action, reward=your_reward, done=your_done, new_state=your_new_state)

To add a named_tuple to the buffer use the append method of the experience_buffer:
- self.experience_buffer.append(your_experience_tuple)


### Task 2 - Get batch

The next method we are going to create is a method which returns a batch of experience_tuples, which is used during training. The methods should set the variable *experiences* equal to either
- All the samples in the experience_buffer IF it is less than the batch size
- or ELSE a *batch_size* of randomly sample experience_tuples from the experience_buffer. To randomly sample use the random.sample() method. You can read more about the method [here](https://docs.python.org/2/library/random.html).

The last part of the method is created by us, and it creates vectors of all elements of each field. This is done to vectorize the training and gives a significant reduction in training time.

### Task 3 - Warm-up

The Experience Buffer is a collection of all the experiences the agent has collected during its interaction with the environment. As a result, the experience buffer is empty at the begin of the first episode. This is not an issue, as you in the previous task implemented the necessary logic to handle the case when there are fewer experience_touples in the buffer than the batch size. However, this does slow down training as we train with few and highly correlated samples. 

To counteract this issue, we introduce a method called warm_up. The warm_up method performs completes a certain amount of episodes taking random actions, thereby creating reducing the correlation between samples in the initial training phase.

We have created the high-level structure of this method, but you will create the code which completes one episode. The method should do the following
1. Use a While-loop which runs as long as the Done parameter is "not true"
2. Chooses a random action - Use env.action_space.sample()
3. Use the random action to make a step in the environment - Use env.step(random_action).
    - env.step() returns the following variables accordingly; new_state, reward, done, state_transition_probability
    - We will not use the state_transition_probability variable
4. Add the experience to the replay_buffer - Use the add_experience method we created earlier
5. Remember to update the current state, to the new_state, e.g. state = new_state

Remember to test your method by running the assertion cell for task 3.

In [None]:
class ExperienceReplay:
    
    def __init__(self, env, parameters):
        self.env = env
        self.buffer_size = parameters["buffer_size"]
        self.batch_size = parameters["batch_size"]
        self.experience_buffer = deque(maxlen=self.buffer_size)
        
        self.experience_tuple = namedtuple("Experience", 
                                           field_names=["state", "action", "reward", "done", "new_state"])
    
    def add_experience(self, state, action, reward, done, new_state):
        
        "Input code below"

        "Input code above"
        
    def get_batch(self):
        
        "Input code below"
        
        "Input code above"
        
        states = np.vstack([e.state for e in experiences if e is not None])
        actions = np.vstack([e.action for e in experiences if e is not None])
        rewards = np.vstack([e.reward for e in experiences if e is not None])
        new_states = np.vstack([e.new_state for e in experiences if e is not None])
        dones = np.vstack([e.done for e in experiences if e is not None])
        
        return (states, actions, rewards, dones, new_states)
    
    def warm_up(self):
        for _ in range(10):
            state = self.env.reset()
            done = False
            
            "Input code below"

            "Input code above"

### Assertion - Task 1

In [None]:
# Do not edit - Assertion cell #
if should_assert:
    DummyBuffer = create_dummybuffer()

    for i in range(6):
        DummyBuffer.add_experience(i, i%2, i, i%2, i+1)

    dummy_experience_buffer = DummyBuffer.experience_buffer.copy()
    assert(len(dummy_experience_buffer) == 6),\
    "Length of your experience buffer is wrong, should be 6, was {}".format(len(dummy_experience_buffer))

    for e in range(6):
        dummy_experience = DummyBuffer.experience_tuple(state=e, action=e%2, reward=e, done=e%2, new_state=e+1)
        stored_experience = dummy_experience_buffer.popleft()
        assert(dummy_experience == stored_experience), \
        "The values were incorrectly stored, was {}, should have been {}".format(stored_experience, dummy_experience)

    print("You implemented the add_experience method correctly, great job!")

You implemented the add_experience method correctly, great job!


### Assertion - Task 2

In [None]:
# Do not edit - Assertion cell #
if should_assert:
    DummyBuffer = create_dummybuffer()
    for i in range(2):
        DummyBuffer.add_experience(i, i%2, i, i%2, i+1)

    assert(len(DummyBuffer.get_batch()[0]) == 2),\
    "Should return 2 samples, when there are less experience_tuples than the batch_size. Your method returned {} samples"\
    .format(len(DummyBuffer.get_batch()[0]))

    DummyBuffer.add_experience(3, 3%2, 3, 3%2, 3+1)
    dummy_batch = DummyBuffer.get_batch()
    assert(len(dummy_batch[0]) == 3),\
    "Should return 3 samples, when there are 3 experience_tuples in the experience_buffer. Your method returned {} samples"\
    .format(len(DummyBuffer.get_batch()[0]))


    DummyBuffer.add_experience(4, 4%2, 4, 4%2, 4+1)
    dummy_batch = DummyBuffer.get_batch()
    assert(len(dummy_batch[0] == 3)),\
    "Should return 3 samples, when there are more experience_tuples than the batch_size. Your method returned {} samples"\
    .format(len(DummyBuffer.get_batch()[0]))

    assert(len(np.unique(dummy_batch[0][0:3])) == 3), "The method did not return the values correctly, states should be all unique values. You returned {}".format(dummy_batch[0])
    assert(len(np.unique(dummy_batch[-1][0:3])) == 3), "The method did not return the values correctly, new_states should be all unique values. You returned {}".format(dummy_batch[-1])

    print("Great, you implemented the get_batch() method correctly")

Great, you implemented the get_batch() method correctly


### Assertion Task 3

In [None]:
# Do not edit - Assertion cell #
if should_assert:
    parameters = {"buffer_size": 1000,
                  "batch_size": 1}
    mock_env1 = gym.make("CartPole-v1")
    mock_experience_replay = ExperienceReplay(mock_env1, parameters)
    mock_env1.reset()
    mock_env2 = deepcopy(mock_env1)

    mock_experience_replay.warm_up()
    mock_state = mock_env2.reset()

    while 0 < len(mock_experience_replay.experience_buffer):
        mock_action = mock_env2.action_space.sample()
        mock_new_state, mock_reward, mock_done, _ = mock_env2.step(mock_action)

        
        if mock_done:
            mock_state = mock_env2.reset()
        else:
            mock_state = mock_new_state
            
    print("Your warm-up method seems to function properly, good job!")


0
Experience(state=array([ 0.015, -0.006,  0.017, -0.011]), action=1, reward=1.0, done=False, new_state=array([ 0.015,  0.189,  0.016, -0.299]))
1
Experience(state=array([ 0.015,  0.189,  0.016, -0.299]), action=1, reward=1.0, done=False, new_state=array([ 0.019,  0.384,  0.01 , -0.586]))
1
Experience(state=array([ 0.019,  0.384,  0.01 , -0.586]), action=1, reward=1.0, done=False, new_state=array([ 0.027,  0.579, -0.001, -0.875]))
1
Experience(state=array([ 0.027,  0.579, -0.001, -0.875]), action=1, reward=1.0, done=False, new_state=array([ 0.038,  0.774, -0.019, -1.169]))
1
Experience(state=array([ 0.038,  0.774, -0.019, -1.169]), action=0, reward=1.0, done=False, new_state=array([ 0.054,  0.579, -0.042, -0.882]))
1
Experience(state=array([ 0.054,  0.579, -0.042, -0.882]), action=0, reward=1.0, done=False, new_state=array([ 0.065,  0.385, -0.06 , -0.603]))
1
Experience(state=array([ 0.065,  0.385, -0.06 , -0.603]), action=0, reward=1.0, done=False, new_state=array([ 0.073,  0.191, -0

## Create Agent

### Target network

In the supervised learning example, where we predict cat or dog, the labels are stationary, e.g. an image of a cat will always have the label cat. This is not the case in Reinforcement learning.

As we experienced in the previous notebook, the Q-value of the next states is calculated based on the assumption that the Q-table is correct. However, since we begin each training session with a blank table and update the values as we go along, the Q-values of the next state is just an estimate. As a result, we are chasing a moving target, which degrades the training of the neural network and may lead to divergence.

Mnih et. al. solved this by using a target network to choose action and to predict the Q-values of the next state $S_{t+1}$, and used a local network to predict the Q-values of state $S_{t}$. The two networks have an identical structure. To update the target network, the equation below is used, where $\theta$ represents the weights of each network.

$
\theta_{target} = \tau * \theta_{local} + (1-\tau) * \theta_{target}
$

### Task 1 - Target network
Implement the target network algorithm in the "update_target_network" method below. We have already created the shell, where we loop through each layer of the network. Your task
- Update the weights of each layer by using the algorithm above. Remember to use the self.tau parameter.
- Remember to run the assertion cell


### Task 2 - Q-learning algorithm
Go through the Q-learning algorithm, and compare it to the Q-learning algorithm we used in the previous notebook. Make sure you understand the implementation before moving on.


In [None]:
class Agent:    

    def __init__(self, env, parameters):
        self.env = env
        self.q_network = QNetwork(env, parameters)
        self.local_network = self.q_network.build_q_network()
        self.target_network = self.q_network.build_q_network()
        self.experience_replay = ExperienceReplay(env, parameters)
        
        self.epsilon = parameters["epsilon_init"]
        self.epsilon_decay = parameters["epsilon_decay"]
        self.epsilon_minimum = parameters["epsilon_minimum"]
        self.tau = parameters["tau"]
        self.gamma = parameters["gamma"]
        self.epochs = parameters["epochs"]

    def update_local_network(self):
        states, actions, rewards, dones, next_states = self.experience_replay.get_batch()
        
        # Get Q-values for the next state, Q(next_state), using the target network
        Q_target = self.target_network.predict(next_states)

        # Apply Q-learning algorithm and Q-value for next state to calculate the actual Q-value the Q(state)
        Q_calc = rewards + (self.gamma * np.amax(Q_target, axis=1).reshape(-1, 1) * (1 - dones))
        
        # Calculate Q-value Q(state) we predicted earlier using the local network
        Q_local = self.local_network.predict(states)
        
        # Update Q_values with "correct" Q-values calculated using the Q-learning algorithm      
        for row, col_id in enumerate(actions):
            Q_local[row, np.asscalar(col_id)] = Q_calc[row]
        
        # Train network by minimizing the difference between Q_local and modified Q_local
        self.local_network.fit(states, Q_local, epochs=self.epochs, verbose=0)
    
    def update_target_network(self):
        local_weights = self.local_network.get_weights()
        target_weights = self.target_network.get_weights()

        for layer in range(len(local_weights)):
            
            "Input code below"
        
            "Input code above"
        
        self.target_network.set_weights(target_weights)
    
    def update_epsilon(self):
        if self.epsilon >= self.epsilon_minimum:
            self.epsilon *= self.epsilon_decay
        
    def select_action(self, state):
    
        if self.epsilon > np.random.uniform():
            action = self.env.action_space.sample()
        else:
            action = np.argmax(self.local_network.predict(np.array([state])))

        return action
    
    def step(self, state):
        
        action  = self.select_action(state)
        new_state, reward, done, _ = self.env.step(action)

        return action, reward, done, new_state
    
    def save(self):
        save_dir = os.path.join(os.getcwd(), 
                                self.env.spec.id +"_"+ datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S'))
        os.makedirs(save_dir)
        self.target_network.save("target_network.h5")
        self.local_network.save("local_network.h5")
        print("Weights saved successfully")
        
    def load(self):
        uri = "freeze_weights/" + self.env.spec.id + "/"
        self.target_network.load_weights(uri + "target_network.h5")
        self.local_network.load_weights(uri + "local_network.h5")
        print("Weights loaded successfully for environment: {}".format(self.env.spec.id))
        
    def freeze_network(self, freeze_layers):
        # Freezes first freeze_layers of the network and resets the other if freeze_layers is not zero.
        network_size = len(self.local_network.get_config().get("layers"))
        assert(freeze_layers <= network_size),\
        "Tried to freeze more layers than there are layers in local network!"
        
        self.load()
        
        if freeze_layers == 0:
            print("No layers frozen, using loaded weights")
        elif freeze_layers == network_size:
            for layer in range(network_size):
                self.local_network.layers[layer].trainable = False 
            print("Frozen all layers, and using loaded weights")
        else:             
            for layer in range(network_size):
                if layer < freeze_layers:
                    self.local_network.layers[layer].trainable = False 
                else:
                    new_weights = generate_blank_weights(self.local_network.layers[layer].get_weights()[0].shape)
                    self.local_network.layers[layer].set_weights(new_weights)
                    self.target_network.layers[layer].set_weights(new_weights)    
            
            self.local_network.compile(loss=self.q_network.loss_metric, 
                          optimizer=Adam(lr=self.q_network.learning_rate, 
                                         decay=self.q_network.learning_rate_decay))
            print("Networks first {} layers are succesfully frozen".format(freeze_layers))
            

In [None]:
# Do not edit - Assertion cell #
if should_assert:
    dummy_parameters, dummy_env = get_dummy_parameters_and_env()
    dummy_dqn = Agent(dummy_env, dummy_parameters)

    dummy_local_weights = deepcopy(dummy_dqn.local_network.get_weights())
    dummy_target_weights = deepcopy(dummy_dqn.target_network.get_weights())

    dummy_dqn.update_target_network()
    for i in range(len(dummy_target_weights)):
        np.testing.assert_array_equal(
            dummy_dqn.target_network.get_weights()[i], 
            dummy_parameters.get("tau") * dummy_local_weights[i] + (1 - dummy_parameters.get("tau")) * dummy_target_weights[i],
            err_msg="\nThe target network was not implemented correctly. Layer {} was incorrect\n".format(i))

    print("Great job, you implemented the update_target_network-method correctly!")


Great job, you implemented the update_target_network-method correctly!


## Train

The training method is more or less identical to the method we used in the previous notebook, with some modifications due to the introduction of a neural network to represent the Q-table.


### Task
Make sure you understand the method before moving on!
- Do you understand why the episodic reward is a better metric of the neural network's performance than the loss-metric when doing Reinforcement learning?


In [None]:
def train(agent, iterations, episodes):
    
    total_reward = 0
    total_reward_list, iterations_list = [], []
    agent.experience_replay.warm_up()
    
    for episode in range(episodes):
        
        state = env.reset()
        total_reward=0
        
        if (episode != 0): 
            agent.update_epsilon()
    
        for iteration in range(iterations):
            
            action, reward, done, new_state = agent.step(state)
            agent.experience_replay.add_experience(state, action, reward, done, new_state)
            
            state = new_state
            
            agent.update_local_network()
            agent.update_target_network()
            total_reward += reward
            
            if done: 
                break
        
        total_reward_list.append(total_reward)
        iterations_list.append(iteration+1)
        
        if episode % 5 == 0 and episode != 0:
            print("Episode: {0:d}-{1:d} | Avg. iterations: {2:0.2f}  | Max total reward: {3:0.2f} | Avg. total reward: {4:0.2f} | Epsilon: {5:0.4f}" \
                  .format(episode-10, episode, mean(iterations_list), max(total_reward_list), mean(total_reward_list), agent.epsilon))
            total_reward_list.clear()
            iterations_list.clear()

## Hyperparameters

The hyperparameters are the customizable parameters which are not optimized by the agent, but are set by us.

The number of parameters passed to the agent significantly increases when we begin using neural networks. To simplify the process, we define them in a dictionary and simply pass the dictionary to the agent. We have chosen the hyperparameters for some environments, and they should not be changed the first time you go through the notebook.

PS. You may change them as you please when you have passed through the notebook once.

In [None]:
def get_hyperparameters(env):
    
    env_id = env.spec.id
        
    if env_id == "CartPole-v1":
        print("Hyperparameters for {} chosen!".format("CartPole-v1"))
        parameters = {
            "tau" : 0.05,
            "gamma" : 0.99,
            "epsilon_init" : 1,
            "epsilon_decay" : 0.95,
            "epsilon_minimum": 0.01,
            "buffer_size" : 2000,
            "batch_size" : 64,
            "epochs": 1,
            "loss_metric" : "mse",
            "learning_rate" : 0.01,
            "learning_rate_decay": 0.01,
            "hidden_layer_1": 24,
            "hidden_layer_2": 24}
        
    else:
        print("Standard hyperparameters chosen!")
        parameters = {
            "tau" : 0.05,
            "gamma" : 0.95,
            "epsilon_init" : 1,
            "epsilon_decay" : 0.97,
            "epsilon_minimum": 0.01,
            "buffer_size" : 10000,
            "batch_size" : 32,
            "epochs": 1,
            "loss_metric" : "mse",
            "learning_rate" : 0.01,
            "learning_rate_decay": 0.01,
            "hidden_layer_1": 24,
            "hidden_layer_2": 24}
    
    return parameters

# Environment
For this walkthorugh we are going to use an environment called CartPole-v1.

<img src="https://github.com/acntech/reinforcement-learning-workshop/blob/master/Workshop/Images/CartPole-v1.gif?raw=1" alt="drawing" width="400" height="200"/>

The pendulum starts upright, and the goal is to prevent it from falling over by applying a horizontal force to the cart, of either +1 or -1. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, the cart moves more than 2.4 units from the center, or the agent manages to stay "alive" for 500 iterations.

### Task
- Create an agent by running the cell below

In [None]:
environment = "CartPole-v1"
env = gym.make(environment)
parameters = get_hyperparameters(env)

dqn_agent = Agent(env, parameters)

Hyperparameters for CartPole-v1 chosen!


## Play
To run a single episode using our agent we use a method we created for you called play(), which takes an Agent-object as input. The logic in the play method is identical to the method we used in the previous notebook.

### Task
- Test the performance of our agent before any training

In [None]:
play(dqn_agent)

Total Reward: 9.0


## Visualize performance
We can also visualize the performance of the agent by using another method we created for you, called generate_video(), which takes an Agent-object as input. 

Since the workshop is hosted on a headless servers, i.e. an EC2 instance on AWS, we need to make some hacks to visualize the performance. The generate_video() method will therefore first generate an .mp4-file, which we will play using the HTML snippet below.

### Task
- Visualize the performance of the agent by running the cell below

In [None]:
generate_video(dqn_agent)
show_video()

Total Reward: 8.0


## Train Agent

We will begin training our agent for 300 episodes, with a maximum iteration of 500. This will take some time, so be patient.

In [None]:
episodes = 100
iterations = 500

train(dqn_agent, iterations, episodes)

Episode: -5-5 | Avg. iterations: 64.33  | Max total reward: 80.00 | Avg. total reward: 64.33 | Epsilon: 0.1748
Episode: 0-10 | Avg. iterations: 65.20  | Max total reward: 97.00 | Avg. total reward: 65.20 | Epsilon: 0.1353
Episode: 5-15 | Avg. iterations: 67.40  | Max total reward: 75.00 | Avg. total reward: 67.40 | Epsilon: 0.1047
Episode: 10-20 | Avg. iterations: 90.20  | Max total reward: 172.00 | Avg. total reward: 90.20 | Epsilon: 0.0810
Episode: 15-25 | Avg. iterations: 81.80  | Max total reward: 100.00 | Avg. total reward: 81.80 | Epsilon: 0.0627
Episode: 20-30 | Avg. iterations: 96.40  | Max total reward: 125.00 | Avg. total reward: 96.40 | Epsilon: 0.0485
Episode: 25-35 | Avg. iterations: 86.20  | Max total reward: 110.00 | Avg. total reward: 86.20 | Epsilon: 0.0375
Episode: 30-40 | Avg. iterations: 104.00  | Max total reward: 127.00 | Avg. total reward: 104.00 | Epsilon: 0.0290
Episode: 35-45 | Avg. iterations: 109.80  | Max total reward: 180.00 | Avg. total reward: 109.80 | E

## Visualize performance
Now lets visualize the performance of our trained agent!

In [None]:
generate_video(dqn_agent)
show_video()

Total Reward: 47.0


## And we are through!

You have implemented the DQN-algorithm, congratulations!

Try optimizing the hyperparameters to improve the training results. Suggestions:
- Number of layers
- Number of neurons in each layer
- Learning rate
- Epsilon strategy

If you like to try a different environment, try the "Acrobat-v1" environment.

Remember to turn of the assertion by running the cell below!

In [None]:
should_assert = False