# Deep Reinforcement Learning
## Deep Q-Learning / Deep Q-Network (DQN)
(by: [Nicolaj Stache](mailto:Nicolaj.Stache@hs-heilbronn.de), and [Pascal Graf](mailto:pascal.graf@hs-heilbronn.de), both: Heilbronn University, Germany, June 2022) 

In this notebook we solve reinforcement learning problems utilizing DQN with networks created in Tensorflow.

The learning environments shown in this notebook have been created by **[OpenAI](https://openai.com/)**. OpenAI provides a library of diffrent reinforcement learning environments called **[Gym](https://www.gymlibrary.ml/)**.

<hr>

## Table of Contents:
### 1. [Imports](#imports)

### 2. [Environment](#environment)

### 3. [Training](#training)

### 4. [Evaluation](#evaluation)
<hr>

## 1. Imports <a class="anchor" id="imports"></a>

### Install OpenAI Gym
In order for this notebook to run you need to open a new terminal, activate your conda environment and install OpenAI Gym by typing `pip install -U gym`. You might also need to install PyGame which is used to render the environments by typing `pip install pygame`.

In [6]:
import gym
import numpy as np
import os
from collections import namedtuple, deque
from datetime import datetime
from tqdm import trange
from tensorflow.keras.layers import Dense, Input, Lambda, Subtract, Add
# from tensorflow_addons.layers import NoisyDense # --> first pip install tensorflow_addons
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import Model
from tensorflow import keras
import tensorflow as tf

## 2. Environment <a class="anchor" id="environment"></a>
The environment we're going to solve is called "Cart Pole v1". 

<hr>

**CartPole-v1**

*A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.*
<hr>


First however, let's see how many environments are available to us. After finishing the Cart Pole environment feel free to try any other environment (*Acrobot-v1* or *LunarLander-v2* are recommended). Some environments might require additional packages like *mujoco* or *box2d* which can be installed using pip.

In [7]:
for key, val in gym.envs.registry.items():
    print(key)

CartPole-v0
CartPole-v1
MountainCar-v0
MountainCarContinuous-v0
Pendulum-v1
Acrobot-v1
LunarLander-v2
LunarLanderContinuous-v2
BipedalWalker-v3
BipedalWalkerHardcore-v3
CarRacing-v1
CarRacingDomainRandomize-v1
CarRacingDiscrete-v1
CarRacingDomainRandomizeDiscrete-v1
Blackjack-v1
FrozenLake-v1
FrozenLake8x8-v1
CliffWalking-v0
Taxi-v3
Reacher-v2
Reacher-v4
Pusher-v2
Pusher-v4
InvertedPendulum-v2
InvertedPendulum-v4
InvertedDoublePendulum-v2
InvertedDoublePendulum-v4
HalfCheetah-v2
HalfCheetah-v3
HalfCheetah-v4
Hopper-v2
Hopper-v3
Hopper-v4
Swimmer-v2
Swimmer-v3
Swimmer-v4
Walker2d-v2
Walker2d-v3
Walker2d-v4
Ant-v2
Ant-v3
Ant-v4
Humanoid-v2
Humanoid-v3
Humanoid-v4
HumanoidStandup-v2
HumanoidStandup-v4


### Open the Environment and start interacting

Now take a look at todays environment along with its so called action and state/observation space. The meaning of each value in the observation space can be looked up in the documentation

In [8]:
# Load the OpenAI Gym Environment
env = gym.make('CartPole-v1')


# Get action space and observation space sizes
print('''
----------------------------------------------
Action Space: {}
Observation Space: {}
----------------------------------------------
'''.format(env.action_space, env.observation_space.shape))

# Get a sample observation and action
print('''
----------------------------------------------
Sample Action: {}
Sample Observation: 
     Cart Position: {} 
     Cart Velocity: {} 
     Pole Angle: {} 
     Pole Angular Velocity: {} 
----------------------------------------------
'''.format(env.action_space.sample(), *env.reset()))


----------------------------------------------
Action Space: Discrete(2)
Observation Space: (4,)
----------------------------------------------


----------------------------------------------
Sample Action: 1
Sample Observation: 
     Cart Position: 0.03574042767286301 
     Cart Velocity: -0.04664144292473793 
     Pole Angle: 0.02296554483473301 
     Pole Angular Velocity: -0.03243443742394447 
----------------------------------------------



The environment can also be rendered. Here, random actions are executed which will lead to a fast termination of the episode.

In [9]:
env = gym.make('CartPole-v1')
# Play some test episodes
for i_episode in range(20):
    observation = env.reset()
    episode_reward = 0
    for t in range(100):
        env.render()
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        episode_reward += reward
        if done:
            print("Episode finished after {} timesteps with a reward of {}".format(t+1, episode_reward))
            break
env.close()

Episode finished after 11 timesteps with a reward of 11.0
Episode finished after 19 timesteps with a reward of 19.0
Episode finished after 21 timesteps with a reward of 21.0
Episode finished after 38 timesteps with a reward of 38.0
Episode finished after 36 timesteps with a reward of 36.0
Episode finished after 13 timesteps with a reward of 13.0
Episode finished after 16 timesteps with a reward of 16.0
Episode finished after 20 timesteps with a reward of 20.0
Episode finished after 9 timesteps with a reward of 9.0
Episode finished after 30 timesteps with a reward of 30.0
Episode finished after 25 timesteps with a reward of 25.0
Episode finished after 15 timesteps with a reward of 15.0
Episode finished after 12 timesteps with a reward of 12.0
Episode finished after 14 timesteps with a reward of 14.0
Episode finished after 36 timesteps with a reward of 36.0
Episode finished after 23 timesteps with a reward of 23.0
Episode finished after 15 timesteps with a reward of 15.0
Episode finished

## 3. Training <a class="anchor" id="training"></a>

In this part of the notebook a simple Deep Q-Learning algorithm is implemented and trained on the previously explored Gym environment. 

### Deep Q - Learning / Deep Q - Networks (DQN) Pseudo Code
<hr>

1. Initialize:
    - Network $Q(s,a)$ and target network $\hat{Q}(s,a)$ with random weights.
    - $\epsilon \leftarrow 1.0$
    - Empty Replay Buffer
2. With probability $\epsilon$, select a random action $a$, otherwise, $a=argmax_a Q(s,a)$.
3. Execute action $a$ and observe the reward $r$ and the next state $s'$.
4. Store transition $(s,a,r,s')$ in the replay buffer.
5. Sample a random mini-batch of transitions from the replay buffer.
6. For every transition in the batch calculate the target $t=r$ if the episode has ended at this step, or $t=r+\gamma* max_{a'}\hat{Q}(s',a')$ otherwise.
7. Train the network $Q$ utilizing SGD with the mean squared error.
8. Every $N$ steps, copy weights from $Q$ to $\hat{Q}$.
9. Repeat from step 2 until converged.

<hr>

### DQN Components

In order to implement DQN we need multiple components which shall be implemented independently in an object oriented way.

- **DQN Agent:** Class containing the neural networks, chooses actions given a state, manages the learning process given a batch from the replay buffer.
- **Replay Buffer:** Stores transitions $(s,a,r,s')$ and yields batches for training.
- **Tensorboard Logger:** Tracks the training process by logging different parameters at each training step.
- **Training Loop:** Just a function perfoming the interaction between the previously listed classes. 

### > DQN Agent

**TODO:** 
- Build a dense network.
- Compile the built model.
- Implement the DQN update function.
- Implement an Epsilon-Greedy Policy.
- **After successful training:** 
Implement some of the discussed DQN improvements.

In [18]:
class DQNAgent:
    """
    DQN Agent Class
    A Deep Q-Learning Agent capable of acting in a discrete action space Gym environment utilizing
    an epsilon-greedy policy and a dense neural network.            
    """
    def __init__(self,
                 observation_shape: tuple,
                 action_shape: int,
                 sync_steps: int = 1000,
                 gamma: float = 0.95,
                 epsilon: float = 1.0,
                 epsilon_decay: float = 0.995,
                 epsilon_min: float = 0.01,
                 learning_rate: float = 1e-3,
                 double: bool = False,
                 dueling: bool = False,
                 noisy: bool = False,
                 units: int = 32,
                 batch_size: int = 32
                 ):
        # Store the given parameters as class variables
        self.observation_shape = observation_shape
        self.action_shape = action_shape
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_min = epsilon_min
        self.epsilon_decay = epsilon_decay
        self.learning_rate = learning_rate
        self.sync_steps = sync_steps
        self.n_step = 0
        self.double = double
        self.dueling = dueling
        self.noisy = noisy
        self.units = units
        self.batch_size = batch_size
        
        # Build the network model
        self.model= self._build_model(observation_shape, action_shape)
        
        # Create a target model by cloning the existing network
        self.target_model = keras.models.clone_model(self.model)
        self.target_model.set_weights(self.model.get_weights())
        
        # TODO: Compile the model using the Adam optimizer and the MSE loss with the given learning rate.
        self.model.compile(loss='mse', optimizer=Adam(learning_rate=self.learning_rate))

    def _build_model(self,
                     observation_shape: tuple,
                     action_shape: tuple) -> keras.Model:
        # TODO: 
            # Build a Dense Neural Network with two hidden layers.
            # Each layer should have "self.units" neurons and "relu" activation function.
            # The input size corresponds to the observation shape.
            # The output size corresponds to the action shape.
        # TODO: 
            # After you've successfully trained the vanilla DQN algorithm (or your waiting time is too long)
            # try to replace the normal Dense layer in your Network with Noisy Dense layer. This will replace the 
            # epsilon-greedy action-selection method.
        return model

    def act(self,
            state: tuple):
        ## Epsilon-Greedy Policy ##
        
        # Random action
        if np.random.random() <= self.epsilon and not self.noisy:
            # TODO: return a random integer between 0 and "self.action_shape".
        
        # Greedy action
        # TODO: Given the current state, return the greedy action.
        return action[0]

    def learn(self,
              replay_batch: list) -> float:        
        state_batch = replay_batch["state_batch"]
        action_batch = replay_batch["action_batch"]
        reward_batch = replay_batch["reward_batch"]
        next_state_batch = replay_batch["next_state_batch"]
        done_batch = replay_batch["done_batch"]
        
        # TODO: Set the target to the immediate reward
        target_batch = reward_batch
        # TODO:
            # If the state is not terminal, calculate the Q-Value target utilizing the next state and the current reward.
            # t = 𝑟 + 𝛾 * 𝑚𝑎𝑥_𝑎′ 𝑄̂(𝑠′,𝑎′) (see pseudo code above)
        # TODO:
            # After you've successfully trained the vanilla DQN algorithm (or your waiting time is too long)
            # try to deploy some of the improvement techniques that have been proposed.
            # Search for "Double DQN" and try to implement the simple changes in training process.
            
        # Set the Q value of the chosen action to the target.
        q_batch = self.model(state_batch).numpy()
        q_batch[np.arange(self.batch_size), action_batch.astype(int)] = target_batch

        # Train the network on the training batch.
        value_loss = self.model.train_on_batch(state_batch, q_batch)
        
        # Decrease Epsilon to approach a more deterministic policy.
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay
        
        # Every n steps synchronise the target model with the q-model.
        self.n_step += 1
        if self.n_step >= self.sync_steps:
            self.sync_models()

        return value_loss

    def sync_models(self):
        # Synchronise the target model with the q-model.
        self.target_model.set_weights(self.model.get_weights())
        self.n_step = 0

### > Replay Buffer

The replay buffer is filled with experiences from the agent's interaction with the environment. On demand it returns a random batch of samples from the buffer.

**TODO:** Read and try to understand the following code.

In [19]:
class ReplayBuffer:
    def __init__(self,
                 capacity: int):
        self.buffer = deque(maxlen=capacity)
        self.collected_samples = 0

    def __len__(self):
        return len(self.buffer)
      
    def append(self, s, a, r, next_s, done):
        self.buffer.append({"state": s, "action":a, "reward":r, "next_state":next_s, "done":done})
        self.collected_samples += 1

    def sample(self,
               batch_size: int) -> dict:
        indices = np.random.choice(len(self.buffer), batch_size, replace=False)
        replay_batch = [self.buffer[idx] for idx in indices]
        replay_batch = self.batch_to_arrays(replay_batch)
        self.collected_samples = 0
        return replay_batch
    
    def batch_to_arrays(self, replay_batch: list) -> dict:
        state_batch = np.zeros((len(replay_batch), *replay_batch[0]["state"].shape))
        action_batch = np.zeros(len(replay_batch))
        reward_batch = np.zeros(len(replay_batch))
        next_state_batch = np.zeros((len(replay_batch), *replay_batch[0]["state"].shape))
        done_batch = np.zeros(len(replay_batch), dtype=bool)
        
        for idx, sample in enumerate(replay_batch):
            state_batch[idx] = sample["state"]
            action_batch[idx] = sample["action"]
            reward_batch[idx] = sample["reward"]
            next_state_batch[idx] = sample["next_state"]
            done_batch[idx] = sample["done"]
            
        return {"state_batch": state_batch, "action_batch":action_batch, 
                "reward_batch": reward_batch, "next_state_batch": next_state_batch, 
                "done_batch": done_batch}

### > Tensorboard Logger
The tensorboard logger utilizes the tensorflow feature to log scalar values to an event file. This event file can be plotted by starting a tensorboard server in the local webbrowser by typing `tensorboard --logdir summaries` to a console.

**TODO:** Read and try to understand the following code.

In [20]:
class Logger:
    def __init__(self, log_dir="./summaries"):
        self.log_dir = log_dir
        self.writer = tf.summary.create_file_writer(log_dir)
        self.running_avg_dict = {}

    def log_scalar(self, tag, value, step):
        with self.writer.as_default():
            tf.summary.scalar(tag, value, step)

    def log_running_average(self, tag, value, run_avg_len=20):
        if tag not in self.running_avg_dict:
            self.running_avg_dict[tag] = deque(maxlen=run_avg_len)
        self.running_avg_dict[tag].append(value)

    def get_running_average(self, tag):
        return np.mean(self.running_avg_dict[tag])

### Training Parameters
**TODO:** Try different training parameters and see how they influence the convergence speed.

In [21]:
# Maximum Episodes Played
EPISODES = 5000
# Memorybatch Size
BATCH_SIZE = 32
# Discount Factor
GAMMA = 0.99
# Neurons per Dense layer
UNITS = 32
# Epsilon Greedy
EPSILON_DECAY = 0.999
# Model Synchronization Steps
SYNC_STEP = 500
# Adam learning Rate
LEARNING_RATE = 1e-3

# Replay Buffer capacity and minimum size before training
BUFFER_CAPACITY = 100000
TRAINING_START_SIZE = 10000

DUELING = False # Don't change this parameter at first.
DOUBLE = False # Don't change this parameter at first.
NOISY = False # Don't change this parameter at first.
MAX_STEPS = 500 # Don't change this parameter at all.

### > Training Loop

**TODO:** To monitor the training progress open a command line in your working directory and enter `tensorboard --logdir summaries`. 

In [None]:
# Load the OpenAI Gym Environment.
env = gym.make('CartPole-v1')
  
# Create a DQN Agent instance.
dqn_agent = DQNAgent(observation_shape=env.observation_space.shape, 
                     action_shape=env.action_space.n,
                     sync_steps=SYNC_STEP,
                     units=UNITS,
                     epsilon_decay=EPSILON_DECAY,
                     learning_rate=LEARNING_RATE,
                     double=DOUBLE,
                     dueling=DUELING,
                     noisy=NOISY,
                     batch_size=BATCH_SIZE)

# Create a Tensorboard Logger instance.
logging_name = datetime.strftime(datetime.now(), '%y%m%d_%H%M%S_DQN')
logger = Logger(os.path.join("summaries", logging_name))

# Initialize a replay buffer instance.
replay_buffer = ReplayBuffer(BUFFER_CAPACITY)


# Run episodes in the environment
training_step = 0
# Save the best reward (averaged over 20 episodes)
best_avg_reward = -1000

t = trange(EPISODES, desc='', leave=True)   
for e in t:
    # Reset the environment and obtain the first observation
    state = env.reset()

    # Keep track of the summed reward
    episode_reward = 0

    for time_step in range(MAX_STEPS):
        # Choose an action.
        action = dqn_agent.act(state)
        # Perform the action and observe the next state and reward
        next_state, reward, done, _ = env.step(action)
        episode_reward += np.sum(reward)

        # Append the current environment info to the replay buffer.
        replay_buffer.append(state, action, reward, next_state, done)
        # make next_state the new current state for the next frame.
        state = next_state

        # If enough samples have been collected acquire memories
        # from the replay buffer and use them for training.
        if len(replay_buffer) >= TRAINING_START_SIZE and time_step % 4 == 0:
                samples = replay_buffer.sample(BATCH_SIZE)
                value_loss = dqn_agent.learn(samples)
                logger.log_scalar("Misc/ValueLoss", value_loss, training_step)
                training_step += 1
        # If the episode is over break the loop and start with the next episode.
        if done:
            break

    # Print relevant information.
    t.set_description("episode: {}/{}, length: {:5.2f}, reward: {:5.2f}, training steps: {}"
          .format(e + 1, EPISODES, time_step, episode_reward, training_step))
    t.refresh()
    # Log the relevant parameters to the tensorboard.       
    logger.log_scalar("Performance/EpisodeLength", time_step, e)
    logger.log_scalar("Performance/Reward", episode_reward, e)
    logger.log_running_average("Reward", episode_reward)
    logger.log_running_average("EpisodeLength", time_step)
    logger.log_scalar("Misc/Epsilon", dqn_agent.epsilon, e)
    logger.log_scalar("Misc/BufferLength", len(replay_buffer),e)
    
    # Store the best model.
    if(logger.get_running_average("Reward") > best_avg_reward):
        best_avg_reward = logger.get_running_average("Reward")
        dqn_agent.model.save(os.path.join("summaries", logging_name, logging_name + "_model.h5"))
        
# NOTE: When the training has converged or the training takes too long it can be interrupted
# at any time. The progress in form of a neural network model will still be saved.

### Testing
Let's test the network we trained on the environment.

In [25]:
# Load the Gym environment.
env = gym.make('CartPole-v1')

# Create a DQN Agent instance.
dqn_agent = DQNAgent(observation_shape=env.observation_space.shape, 
                     action_shape=env.action_space.n,
                     epsilon=0.0)

# TODO: Load your saved model
dqn_agent.model = keras.models.load_model(os.path.join("summaries", logging_name, logging_name + "_model.h5"))

# Run episodes in the environment
for e in range(10):
    # Reset the environment and obtain the first observation
    state = env.reset()

    # Keep track of the summed reward
    episode_reward = 0

    for time_step in range(MAX_STEPS):
        env.render()
        # Choose an action.
        action = dqn_agent.act(state)
        # Perform the action and observe the next state and reward
        next_state, reward, done, _ = env.step(action)
        episode_reward += np.sum(reward)
        # make next_state the new current state for the next frame.
        state = next_state

        # If the episode is over break the loop and start with the next episode.
        if done:
            break
                
    print("episode: {}/{}, length: {:5.2f}, reward: {:5.2f}"
          .format(e + 1, 10, time_step, episode_reward))
env.close()

episode: 1/10, length: 278.00, reward: 279.00
episode: 2/10, length: 269.00, reward: 270.00
episode: 3/10, length: 214.00, reward: 215.00
episode: 4/10, length: 348.00, reward: 349.00
episode: 5/10, length: 303.00, reward: 304.00
episode: 6/10, length: 216.00, reward: 217.00
episode: 7/10, length: 332.00, reward: 333.00
episode: 8/10, length: 252.00, reward: 253.00
episode: 9/10, length: 219.00, reward: 220.00
episode: 10/10, length: 228.00, reward: 229.00
