## Rainbow IQN
The purpose of this notebook is to build an implementation of [DeepMind's Rainbow DQN](https://arxiv.org/abs/1710.02298) algorithm, substituting out the dated C51 tweak with the more modern IQN distributional tweak. 

Rainbow is a DQN implementation with the following 6 modifications:

- **[Double DQN](https://arxiv.org/abs/1509.06461)**: Addresses the overestimation of Q-values by decoupling selection and evaluation of the action.
- **[Dueling DQN](https://arxiv.org/abs/1511.06581)**: Separates the representation of state values and the advantages of each action, enabling the learning of state values without needing the effect of each action.
- **[N-step Learning]()**:  Uses a multi-step approach for updating Q-values, which helps in faster propagation of reward information.
- **[Prioritised Experience Replay (PER)](https://arxiv.org/abs/1511.05952)**: Improves sample efficiency by replaying more important transitions more frequently.
- **[Noisy Nets for Exploration](https://arxiv.org/abs/1706.10295)**: Integrates parameterised noise into network weights to facilitate exploration, as an alternative to epsilon-greedy exploration.
- **[Categorical DQN](https://arxiv.org/abs/1707.06887)**: Instead of learning to model the state-action value function, Categorical DQN (C51) models the return distribution over a reward range using discretised bins. 

We substitute C51 for the powerful [Implicit Quantile Networks](https://arxiv.org/abs/1806.06923) distributional model. IQN is similar to C51 in that it models the distribution of rewards, but instead of learning the Q-function as a set of discrete bins, IQN learns a full quantile function, modelling the full range of rewards for each quantile. For more information on IQN I have a tutorial on it here: [IQN Tutorial](https://www.kaggle.com/code/auxeno/implicit-quantile-networks-iqn-rl). I also have tutorials on the other 5 Rainbow modifications which can be found on my Kaggle page.

In [3]:
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F
import gymnasium as gym
from collections import deque
import random
import time

### Noisy Linear Layer ###
class NoisyLinear(nn.Module):
    "Noisy networks linear layer."
    def __init__(self, in_features, out_features, noise_std=0.4):
        super().__init__()
        
        self.in_features = in_features
        self.out_features = out_features
        self.std_init = noise_std
        
        # Learnable parameters for mean and standard deviation of weights and biases
        self.weight_mean = nn.Parameter(torch.Tensor(out_features, in_features))
        self.weight_std  = nn.Parameter(torch.Tensor(out_features, in_features))
        self.bias_mean   = nn.Parameter(torch.Tensor(out_features))
        self.bias_std    = nn.Parameter(torch.Tensor(out_features))

        # Buffers for noise
        self.register_buffer('weight_noise', torch.Tensor(out_features, in_features))
        self.register_buffer('bias_noise', torch.Tensor(out_features))
        
        self.initialize_parameters()
        self.reset_noise()

    def forward(self, x):
        weights = self.weight_mean + self.weight_std * self.weight_noise if self.training else self.weight_mean
        biases  = self.bias_mean   + self.bias_std   * self.bias_noise   if self.training else self.bias_mean
        return torch.nn.functional.linear(x, weights, biases)
    
    def initialize_parameters(self):
        # Initialise parameters using uniform distribution and std_init for scaling
        initialization_range = 1. / np.sqrt(self.in_features)
        
        nn.init.uniform_(self.weight_mean, -initialization_range, initialization_range)
        nn.init.uniform_(self.bias_mean, -initialization_range, initialization_range)
        nn.init.constant_(self.weight_std, self.std_init / np.sqrt(self.in_features))
        nn.init.constant_(self.bias_std, self.std_init / np.sqrt(self.out_features))
    
    def reset_noise(self):
        # Regenerates noise for weights and biases
        input_noise  = self.generate_noise(self.in_features)
        output_noise = self.generate_noise(self.out_features)
        
        self.weight_noise.copy_(output_noise.outer(input_noise))
        self.bias_noise.copy_(self.generate_noise(self.out_features))
    
    @staticmethod
    def generate_noise(size):
        # Generates scaled noise, transformed with sign of noise * square root of its abs value
        noise = torch.randn(size)
        return noise.sign() * torch.sqrt(torch.abs(noise))
    

### IQN Q-Network ###
class IQNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=64, num_eval_quantiles=16, cosine_embedding_dim=16, noisy=False):
        super().__init__()
        self.num_actions = action_dim
        self.hidden_dim = hidden_dim
        self.num_eval_quantiles = num_eval_quantiles     
        self.cosine_embedding_dim = cosine_embedding_dim
        
        self.input_layer  = nn.Linear(state_dim, hidden_dim)
        self.hidden_layer = nn.Linear(hidden_dim, hidden_dim)
        self.tau_embedding_layer = nn.Linear(cosine_embedding_dim, hidden_dim)
        
        if noisy:
            self.value_layer = NoisyLinear(hidden_dim, 1)
            self.advantage_layer = NoisyLinear(hidden_dim, self.num_actions)
        else:
            self.value_layer = nn.Linear(hidden_dim, 1)
            self.advantage_layer = nn.Linear(hidden_dim, self.num_actions)
        
    def forward(self, state, taus=None):
        if taus is None:
            taus = self.generate_taus(batch_size=state.shape[0], uniform=True).to(state.device)
        
        assert state.shape[0] == taus.shape[0], "Use same batch sizes for state and tau tensors."
        batch_size, n_quantiles = taus.shape
        
        # State encoding
        state_enc = F.relu(self.input_layer(state))
        state_enc = F.relu(self.hidden_layer(state_enc))
        
        # Tau encoding
        pi_i = torch.pi * torch.arange(self.cosine_embedding_dim, device=state.device)
        cos_pi_i_tau = torch.cos(taus.unsqueeze(-1) * pi_i)
        tau_enc = F.relu(self.tau_embedding_layer(cos_pi_i_tau.view(batch_size * n_quantiles, self.cosine_embedding_dim)))
        
        # Combine encodings with Hadamard product
        combined_enc = state_enc.unsqueeze(-1) * tau_enc.view(batch_size, self.hidden_dim, n_quantiles)
        
        # Dueling
        value = self.value_layer(combined_enc.view(batch_size * n_quantiles, self.hidden_dim))
        advantages = self.advantage_layer(combined_enc.view(batch_size * n_quantiles, self.hidden_dim))
        q_values = value + advantages - advantages.mean(dim=1, keepdim=True)
        
        return q_values.view(batch_size, self.num_actions, n_quantiles)
    
    def generate_taus(self, batch_size=1, uniform=False):
        if uniform:
            return torch.linspace(0, 1, self.num_eval_quantiles + 2)[1:-1].expand((batch_size, self.num_eval_quantiles))
        return torch.rand((batch_size, self.num_eval_quantiles))
    
    def reset_noise(self):
        self.value_layer.reset_noise()
        self.advantage_layer.reset_noise()

        
### Segment Trees ####
import operator

class SegmentTree(object):
    def __init__(self, capacity, operation, neutral_element):
        assert capacity > 0 and capacity & (capacity - 1) == 0, "Capacity must be a power of 2."
        self._capacity = capacity
        self._value = [neutral_element for _ in range(2 * capacity)]
        self._operation = operation

    def _reduce_helper(self, start, end, node, node_start, node_end):
        if start == node_start and end == node_end:
            return self._value[node]
        mid = (node_start + node_end) // 2
        if end <= mid:
            return self._reduce_helper(start, end, 2 * node, node_start, mid)
        else:
            if mid + 1 <= start:
                return self._reduce_helper(start, end, 2 * node + 1, mid + 1, node_end)
            else:
                return self._operation(
                    self._reduce_helper(start, mid, 2 * node, node_start, mid),
                    self._reduce_helper(mid + 1, end, 2 * node + 1, mid + 1, node_end)
                )
            
    def reduce(self, start=0, end=None):
        if end is None:
            end = self._capacity
        if end < 0:
            end += self._capacity
        end -= 1
        return self._reduce_helper(start, end, 1, 0, self._capacity - 1)

    def __setitem__(self, idx, val):
        idx += self._capacity
        self._value[idx] = val
        idx //= 2
        while idx >= 1:
            self._value[idx] = self._operation( self._value[2 * idx], self._value[2 * idx + 1])
            idx //= 2

    def __getitem__(self, idx):
        assert 0 <= idx < self._capacity
        return self._value[self._capacity + idx]

class SumSegmentTree(SegmentTree):
    def __init__(self, capacity):
        super().__init__(capacity=capacity, operation=operator.add, neutral_element=0.0)

    def sum(self, start=0, end=None):
        return super().reduce(start, end)

    def find_prefixsum_idx(self, prefixsum):
        assert 0 <= prefixsum <= self.sum() + 1e-5
        idx = 1
        while idx < self._capacity:  # while non-leaf
            if self._value[2 * idx] > prefixsum:
                idx = 2 * idx
            else:
                prefixsum -= self._value[2 * idx]
                idx = 2 * idx + 1
        return idx - self._capacity

class MinSegmentTree(SegmentTree):
    def __init__(self, capacity):
        super().__init__(capacity=capacity, operation=min, neutral_element=float('inf'))

    def min(self, start=0, end=None):
        return super().reduce(start, end)
    
    
### Priority Replay Buffer ###
import numpy as np
import random

class PriorityBuffer:
    def __init__(self, size, alpha, beta):
        self.max_size = size
        self.alpha = alpha
        self.beta = beta
        
        # Replay buffer storage
        self.storage = []
        self.next_idx = 0
        
        # Segment trees to manage priority
        tree_size = 1
        while tree_size <= size:
            tree_size *= 2
        self.sum_tree = SumSegmentTree(tree_size)
        self.min_tree = MinSegmentTree(tree_size)
        self.max_priority = 1.
        
    def append(self, experience):
        "Add an experience tuple to replay buffer."
        self.sum_tree[self.next_idx] = self.max_priority ** self.alpha
        self.min_tree[self.next_idx] = self.max_priority ** self.alpha
        
        if self.next_idx == len(self.storage):
            self.storage.append(experience)
        else:
            self.storage[self.next_idx] = experience
        self.next_idx = (self.next_idx + 1) % self.max_size
        
    def sample(self, batch_size):
        "Sample from replay buffer."
        indices = self.select_indices_by_priority(batch_size)
        weights = self.calculate_importance_weights(indices)
        experiences = self.collect_experiences(indices)
        return experiences, indices, weights
        
    def select_indices_by_priority(self, batch_size):
        "Select indices weighted by their relative priority."
        indices = []
        total_priority = self.sum_tree.sum(0, len(self.storage) - 1)
        segment_size = total_priority / batch_size
        for segment_idx in range(batch_size):
            mass = random.random() * segment_size + segment_idx * segment_size
            idx = self.sum_tree.find_prefixsum_idx(mass)
            indices.append(idx)
        return indices
    
    def calculate_importance_weights(self, indices):
        "Calculates importance weights which we scale losses for each sample by."
        weights = []
        min_priority = self.min_tree.min() / self.sum_tree.sum()
        max_weight = (min_priority * len(self.storage)) ** (-self.beta)
        for idx in indices:
            priority = self.sum_tree[idx] / self.sum_tree.sum()
            weight = (priority * len(self.storage)) ** (-self.beta)
            weights.append(weight / max_weight)
        weights = np.array(weights)
        return weights
    
    def collect_experiences(self, indices):
        "Unpacks experiences into numpy arrays."
        states, actions, rewards, next_states, dones = [], [], [], [], []
        for idx in indices:
            experience = self.storage[idx]
            state, action, reward, next_state, done = experience
            states.append(state)
            actions.append(action)
            rewards.append(reward)
            next_states.append(next_state)
            dones.append(done)
        return torch.tensor(np.array(states), dtype=torch.float32), \
               torch.tensor(np.array(actions), dtype=torch.int64), \
               torch.tensor(np.array(rewards), dtype=torch.float32), \
               torch.tensor(np.array(next_states), dtype=torch.float32), \
               torch.tensor(np.array(dones), dtype=torch.float32)
            
    def update_priorities(self, indices, priorities):
        "Updates priorities at provided indices based on observed td-error."
        assert len(indices) == len(priorities), "Priorities and indices have different lengths."
        for idx, priority in zip(indices, priorities):
            assert priority > 0, "Provided priority must be >= 0."
            assert 0 <= idx < len(self.storage), "Provided index out of bounds."
            self.sum_tree[idx] = priority ** self.alpha
            self.min_tree[idx] = priority ** self.alpha
            self.max_priority = max(self.max_priority, priority)
    
    def __len__(self):
        return len(self.storage)
    
    
### Experience Replay Buffer ###
class ReplayBuffer:
    def __init__(self, capacity, num_steps=1, gamma=0.99, prioritized=False, alpha=0.6, beta=0.4):
        if prioritized:
            self.buffer = PriorityBuffer(capacity, alpha, beta)
        else:
            self.buffer = deque(maxlen=capacity)
        self.prioritized = prioritized
        self.num_steps = num_steps
        self.gamma = gamma
        self.n_step_buffer = deque(maxlen=num_steps)
        
    def push(self, transition):
        "Pushes transition to buffer and handles n-step logic if required."
        assert len(transition) == 6, "Use new Gym step API: (s, a, r, s', ter, tru)"
        if self.num_steps == 1:
            state, action, reward, next_state, terminated, _ = transition
            self.buffer.append((state, action, reward, next_state, terminated))
        else:
            self.n_step_buffer.append(transition)
            
            # Calculate n-step reward
            _, _, _, final_state, final_termination, final_truncation = transition
            n_step_reward = 0.
            for _, _, reward, _, _, _ in reversed(self.n_step_buffer):
                  n_step_reward = n_step_reward * self.gamma + reward
            state, action, _, _, _, _ = self.n_step_buffer[0]

            # If n-step buffer is full, append to main buffer
            if len(self.n_step_buffer) == self.num_steps:
                self.buffer.append((state, action, n_step_reward, final_state, final_termination))
            
            # If done, clear n-step buffer
            if final_termination or final_truncation:
                self.n_step_buffer.clear()
        
    def sample(self, batch_size):
        "Samples a batch of experiences for learner to learn from."
        if self.prioritized:
            experiences, indices, weights = self.buffer.sample(batch_size)
            states, actions, rewards, next_states, dones = experiences
            return (states, actions, rewards, next_states, dones), indices, weights
        else:
            states, actions, rewards, next_states, dones = zip(*random.sample(self.buffer, batch_size))
            states      = torch.tensor(np.stack(states),      dtype=torch.float32)
            actions     = torch.tensor(actions,               dtype=torch.int64  )
            rewards     = torch.tensor(rewards,               dtype=torch.float32)
            next_states = torch.tensor(np.stack(next_states), dtype=torch.float32)
            dones       = torch.tensor(dones,                 dtype=torch.float32)
            return states, actions, rewards, next_states, dones
        
    def update_priorities(self, indices, priorities):
        assert self.prioritized, "Operation not available for non-prioritised buffers."
        self.buffer.update_priorities(indices, priorities)
        
    def __len__(self):
        return len(self.buffer)
    
    
### Linear Scheduler ###
class LinearScheduler:
    "Used to create variables whose values are linearly annealed over time."
    def __init__(self, start, end, total_duration, fraction=1.):
        self.start = start
        self.end = end
        self.total_duration = total_duration
        self.duration = int(total_duration * fraction)
        self.step = 0
        
    def get(self):
        "Gets current value without incrementing step counter."
        if self.step < self.duration:
            current_value = self.start + (self.end - self.start) * (self.step / self.duration)
        else:
            current_value = self.end
        return current_value

    def __call__(self):
        "Gets current value and increments step counter."
        current_value = self.get()
        self.step += 1
        return current_value    

    
### IQN Agent Class ###
class RIQN:
    def __init__(self, config):
        self.device = config['device']
        self.env = gym.make(config['env_name'])
        state_dim = np.prod(self.env.observation_space.shape)
        action_dim = self.env.action_space.n
        self.online_network = IQNetwork(state_dim, 
                                        action_dim, 
                                        config['hidden_dim'], 
                                        config['num_eval_quantiles'], 
                                        config['cosine_embedding_dim'],
                                        config['noisy']).to(self.device)
        self.target_network = IQNetwork(state_dim, 
                                        action_dim, 
                                        config['hidden_dim'], 
                                        config['num_eval_quantiles'], 
                                        config['cosine_embedding_dim'],
                                        config['noisy']).to(self.device)
        self.update_target_network(1.)
        self.optimizer = torch.optim.AdamW(self.online_network.parameters(), lr=config['lr'])
        self.buffer = ReplayBuffer(config['buffer_capacity'], config['num_steps'], config['gamma'], 
                                   config['per'], config['per_alpha'], config['per_beta'])
        self.epsilon = LinearScheduler(config['eps_start'], config['eps_final'], 
                                       config['total_steps'], config['eps_fraction'])
        self.config = config
        
    def update_target_network(self, tau):
        "Updates the parameters of the target network, tau controls how fully the weights are copied."
        for target_param, online_param in zip(self.target_network.parameters(), self.online_network.parameters()):
            target_param.data.copy_(tau * online_param.data + (1. - tau) * target_param.data)
                
    def select_action(self, state, epsilon):
        "Epsilon greedy action selection."
        if random.random() < epsilon and not self.config['noisy']:
            return self.env.action_space.sample()
        state_tensor = torch.tensor(state, device=self.config['device']).unsqueeze(0)
        return self.online_network(state_tensor).mean(-1).argmax().item()
    
    def learn(self):
        # Load batch and create tensors
        if self.config['per']:
            experiences, indices, weights = self.buffer.sample(self.config['batch_size'])
            states, actions, rewards, next_states, dones = experiences
            weights = torch.tensor(weights, dtype=torch.float32, device=self.device).view(-1, 1, 1)
        else:
            states, actions, rewards, next_states, dones = self.buffer.sample(self.config['batch_size'])
        states = states.to(self.device)
        actions = actions.to(self.device).view(-1, 1, 1)
        rewards = rewards.to(self.device).view(-1, 1, 1)
        next_states = next_states.to(self.device)
        dones = dones.to(self.device).view(-1, 1, 1)
        
        # Generate taus
        taus = self.online_network.generate_taus(batch_size=self.config['batch_size'], uniform=False).to(self.config['device'])
        
        # Get number of quantiles from config
        n_quantiles = self.config['num_eval_quantiles']
        
        # Predicted Q-value quantiles for current state
        current_state_q_values = self.online_network(states, taus)
        
        # Gather Q-value quantiles of actions actually taken
        current_action_q_values = torch.gather(current_state_q_values, dim=1, index=actions.expand(-1, -1, n_quantiles))
        
        # Compute targets
        with torch.no_grad():
            # Get best actions in next state with double DQN then gather Q-values with these actions
            next_state_q_values     = self.target_network(next_states, taus)
            next_state_best_actions = torch.argmax(self.online_network(next_states).mean(dim=2), dim=1, keepdims=True).unsqueeze(-1)
            next_state_max_q_values = torch.gather(next_state_q_values, dim=1, index=next_state_best_actions.expand(-1, -1, n_quantiles))
            
            # Bellman equation to compute target Q-values for not done states
            target_q_values = rewards + self.config['gamma'] ** self.config['num_steps'] * next_state_max_q_values * (1 - dones)
        
        # Calculate TD error and Quantile Huber loss
        kappa = self.config['kappa']
        td_error = target_q_values - current_action_q_values
        huber_loss = torch.where(td_error.abs() <= kappa, 
                                 0.5 * td_error.pow(2), 
                                 kappa * (td_error.abs() - 0.5 * kappa))
        quantile_loss = torch.abs(taus.unsqueeze(1) - (td_error < 0).float()) * huber_loss
        
        # Scale loss and update priorities if using PER
        if self.config['per']:
            quantile_loss = quantile_loss * weights
            new_priorities = td_error.detach().abs().mean(dim=-1).squeeze(1).cpu().numpy()
            self.buffer.update_priorities(indices, new_priorities)
        
        loss = quantile_loss.mean()
        
        # Backward pass
        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(self.online_network.parameters(), self.config['grad_norm_clip'])
        self.optimizer.step()
        
        if self.config['noisy']:
            self.online_network.reset_noise()
            self.target_network.reset_noise()
        
    def train(self):
        "Trains agent for a given number of steps according to given configuration."
        print("Training RIQN agent\n")
            
        # Logging information
        logs = {'episode_count': 0, 'episodic_reward': 0., 'episode_rewards': [], 'start_time': time.time()}
        
        # Reset episode
        state, _ = self.env.reset()
        
        # Main training loop
        for step in range(1, self.config['total_steps'] + 1):
            # Get action and execute in envrionment
            action = self.select_action(state, self.epsilon())
            next_state, reward, terminated, truncated, _ = self.env.step(action)
            
            # Update logs
            logs['episodic_reward'] += reward
            
            # Push experience to buffer
            self.buffer.push((state, action, reward, next_state, terminated, truncated))

            if terminated or truncated:
                state, _ = self.env.reset()
                
                # Update logs
                logs['episode_count'] += 1
                logs['episode_rewards'].append(logs['episodic_reward'])
                logs['episodic_reward'] = 0.
            else:
                state = next_state
            
            # Perform learning step
            if len(self.buffer) >= self.config['batch_size'] and step > self.config['learning_starts']:
                self.learn()
            
            # Update target network
            if step % self.config['target_update'] == 0:
                self.update_target_network(self.config['tau'])
                
            # If mean of last 20 rewards exceed target, end training
            if len(logs['episode_rewards']) > 0 and np.mean(logs['episode_rewards'][-20:]) >= self.config['target_reward']:
                break
            
            # Print training info if verbose
            if self.config['verbose'] and step % 100 == 0 and len(logs['episode_rewards']) > 0:
                print(f"\r--- {100 * step / self.config['total_steps']:.1f}%" 
                      f"\t Step: {step:,}"
                      f"\t Mean Reward: {np.mean(logs['episode_rewards'][-20:]):.2f}"
                      f"\t Epsilon: {(1-self.config['noisy']) * self.epsilon.get():.2f}"
                      f"\t Episode: {logs['episode_count']:,}"
                      f"\t Duration: {time.time() - logs['start_time']:,.1f}s  ---", end='')
                if step % 10000 == 0:
                    print()
                    
        # Training ended
        print("\n\nTraining done")
        logs['end_time'] = time.time()
        logs['duration'] = logs['end_time'] - logs['start_time']
        return logs
    
        
### IQN Configuration ###
riqn_config = {
    'env_name'            : 'LunarLander-v2',  # Gym environment to use
    'device'              :         'cpu',  # Device used for learning
    'total_steps'         :        1000000,  # Total training steps
    'hidden_dim'          :            64,  # Number of neurons in Q-network hidden layer
    'batch_size'          :            64,  # Number of experience tuples sampled per learning update
    'noisy'               :          True,  # Noisy networks tweak
    'per'                 :          True,  # Use prioritised experience replay
    'per_alpha'           :           0.6,  # Priority sampling weight (0 for maximum sampling entropy)
    'per_beta'            :           0.4,  # Importance sampling correction to reduce bias from frequently sampled experiences
    'buffer_capacity'     :        100000,  # Maximum length of replay buffer
    'target_update'       :            20,  # How often to perform target network weight synchronisations
    'tau'                 :           0.5,  # When copying online network weights to target network, what weight is given to online network weights
    'eps_start'           :           0.8,  # Initial epsilon to use
    'eps_final'           :          0.05,  # Lowest possible epsilon value
    'eps_fraction'        :           0.6,  # Fraction of entire training period over which the exploration rate is reduced
    'learning_starts'     :           512,  # Step to begin learning at
    'train_frequency'     :             1,  # Performs a learning update every `train_frequency` steps
    'lr'                  :          5e-4,  # Learning rate
    'grad_norm_clip'      :            10,  # Global grad norm clipping
    'gamma'               :          0.99,  # Discount factor
    'num_steps'           :             3,  # Multistep reward steps
    'kappa'               :           1.0,  # Huber loss kappa
    'num_eval_quantiles'  :            16,  # N in the IQN paper, resolution of the quantile distribution used
    'cosine_embedding_dim':            16,  # N' in IQN paper, dimensionality of cosine embeddings generated
    'target_reward'       :           195,  # If set to a number, training will stop when mean reward for recent episodes exceeds this
    'verbose'             :          True,  # Prints steps and rewards in output
}

In [4]:
agent = RIQN(riqn_config)
logs = agent.train()

Training RIQN agent



--- 1.0%	 Step: 10,000	 Mean Reward: -269.66	 Epsilon: 0.00	 Episode: 23	 Duration: 209.9s  ---

--- 2.0%	 Step: 20,000	 Mean Reward: -103.68	 Epsilon: 0.00	 Episode: 34	 Duration: 426.2s  ---

--- 3.0%	 Step: 30,000	 Mean Reward: -111.00	 Epsilon: 0.00	 Episode: 47	 Duration: 641.5s  ---

--- 4.0%	 Step: 40,000	 Mean Reward: -109.47	 Epsilon: 0.00	 Episode: 58	 Duration: 872.1s  ---

--- 5.0%	 Step: 50,000	 Mean Reward: -102.97	 Epsilon: 0.00	 Episode: 70	 Duration: 1,093.4s  ---

--- 6.0%	 Step: 60,000	 Mean Reward: -76.22	 Epsilon: 0.00	 Episode: 83	 Duration: 1,302.1s  ----

--- 7.0%	 Step: 70,000	 Mean Reward: -47.50	 Epsilon: 0.00	 Episode: 93	 Duration: 1,516.7s  ---

--- 8.0%	 Step: 80,000	 Mean Reward: -11.23	 Epsilon: 0.00	 Episode: 104	 Duration: 1,723.4s  ---

--- 9.0%	 Step: 90,000	 Mean Reward: 2.45	 Epsilon: 0.00	 Episode: 116	 Duration: 1,922.2s  -----

--- 10.0%	 Step: 100,000	 Mean Reward: 32.17	 Epsilon: 0.00	 Episode: 126	 Duration: 2,146.0s  