# DDDQN  (Double Dueling Deep Q Learning with Prioritized Experience Replay)  Doom🕹️
In this notebook we'll implement an agent <b>that plays Doom by using a Dueling Double Deep Q learning architecture with Prioritized Experience Replay.</b> <br>

Our agent playing Doom after 3 hours of training of **CPU**, remember that our agent needs about 2 days of **GPU** to have optimal score, we'll train from beginning to end the most important architectures (PPO with transfer):

<img src="https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/docs/assets/img/projects/doomdeathmatc.gif" alt="Doom Deathmatch"/>

But we can see that our agent **understand that he needs to kill enemies before being able to move forward (if he moves forward without killing ennemies he will be killed before getting the vest)**

# This is a notebook from [Deep Reinforcement Learning Course with Tensorflow](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)
<img src="https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/docs/assets/img/DRLC%20Environments.png" alt="Deep Reinforcement Course"/>
<br>
<p>  Deep Reinforcement Learning Course is a free series of articles and videos tutorials 🆕 about Deep Reinforcement Learning, where **we'll learn the main algorithms (Q-learning, Deep Q Nets, Dueling Deep Q Nets, Policy Gradients, A2C, Proximal Policy Gradients…), and how to implement them with Tensorflow.**
<br><br>
    
📜The articles explain the architectures from the big picture to the mathematical details behind them.
<br>
📹 The videos explain how to build the agents with Tensorflow </b></p>
<br>
This course will give you a **solid foundation for understanding and implementing the future state of the art algorithms**. And, you'll build a strong professional portfolio by creating **agents that learn to play awesome environments**: Doom© 👹, Space invaders 👾, Outrun, Sonic the Hedgehog©, Michael Jackson’s Moonwalker, agents that will be able to navigate in 3D environments with DeepMindLab (Quake) and able to walk with Mujoco. 
<br><br>
</p> 

## 📚 The complete [Syllabus HERE](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)


## Any questions 👨‍💻
<p> If you have any questions, feel free to ask me: </p>
<p> 📧: <a href="mailto:hello@simoninithomas.com">hello@simoninithomas.com</a>  </p>
<p> Github: https://github.com/simoninithomas/Deep_reinforcement_learning_Course </p>
<p> 🌐 : https://simoninithomas.github.io/Deep_reinforcement_learning_Course/ </p>
<p> Twitter: <a href="https://twitter.com/ThomasSimonini">@ThomasSimonini</a> </p>
<p> Don't forget to <b> follow me on <a href="https://twitter.com/ThomasSimonini">twitter</a>, <a href="https://github.com/simoninithomas/Deep_reinforcement_learning_Course">github</a> and <a href="https://medium.com/@thomassimonini">Medium</a> to be alerted of the new articles that I publish </b></p>
    
## How to help  🙌
3 ways:
- **Clap our articles and like our videos a lot**:Clapping in Medium means that you really like our articles. And the more claps we have, the more our article is shared Liking our videos help them to be much more visible to the deep learning community.
- **Share and speak about our articles and videos**: By sharing our articles and videos you help us to spread the word. 
- **Improve our notebooks**: if you found a bug or **a better implementation** you can send a pull request.
<br>

## Important note 🤔
<b> You can run it on your computer but it's better to run it on GPU based services</b>, personally I use Microsoft Azure and their Deep Learning Virtual Machine (they offer 170$)
https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.dsvm-deep-learning
<br>
⚠️ I don't have any business relations with them. I just loved their excellent customer service.

If you have some troubles to use Microsoft Azure follow the explainations of this excellent article here (without last the part fast.ai): https://medium.com/@manikantayadunanda/setting-up-deeplearning-machine-and-fast-ai-on-azure-a22eb6bd6429

## Prerequisites 🏗️
Before diving on the notebook **you need to understand**:
- The foundations of Reinforcement learning (MC, TD, Rewards hypothesis...) [Article](https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-4339519de419)
- Q-learning [Article](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe)
- Deep Q-Learning [Article](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8)
- Improvments in Deep Q-learning [Article]()
- You can follow this notebook using my [video tutorial](https://www.youtube.com/embed/-Ynjw0Vl3i4?showinfo=0)

In [1]:
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/-Ynjw0Vl3i4?showinfo=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')



## Step 1: Import the libraries 📚

In [2]:
import tensorflow as tf      # Deep Learning library
import numpy as np           # Handle matrices
from vizdoom import *        # Doom Environment

import random                # Handling random number generation
import time                  # Handling time calculation
from skimage import transform# Help us to preprocess the frames

from collections import deque# Ordered collection with ends
import matplotlib.pyplot as plt # Display graphs

import pickle

import warnings # This ignore all the warning messages that are normally printed during the training because of skiimage
warnings.filterwarnings('ignore') 

## Step 2: Create our environment 🎮
- Now that we imported the libraries/dependencies, we will create our environment.
- Doom environment takes:
    - A `configuration file` that **handle all the options** (size of the frame, possible actions...)
    - A `scenario file`: that **generates the correct scenario** (in our case basic **but you're invited to try other scenarios**).
- Note: We have 7 possible actions: turn left, turn right, move left, move right, shoot (attack)...`[[0,0,0,0,1]...]` so we don't need to do one hot encoding (thanks to <a href="https://stackoverflow.com/users/2237916/silgon">silgon</a> for figuring out). 

### Our environment
<img src="https://simoninithomas.github.io/Deep_reinforcement_learning_Course/assets/img/video%20projects/deadlycorridor.png" style="max-width:500px;" alt="Vizdoom deadly corridor"/>

The purpose of this scenario is to teach the agent to navigate towards his fundamental goal (the vest) and make sure he survives at the same time.

- Map is a corridor with shooting monsters on both sides (6 monsters in total). 
- A green vest is placed at the oposite end of the corridor. 
- **Reward is proportional (negative or positive) to change of the distance between the player and the vest.** 
- If player ignores monsters on the sides and runs straight for the vest he will be killed somewhere along the way. 
- To ensure this behavior doom_skill = 5 (config) is needed.

<br>
REWARDS:

- +dX for getting closer to the vest. -dX for getting further from the vest.
- death penalty = 100

In [3]:
"""
Here we create our environment
"""
def create_environment():
    game = DoomGame()
    
    # Load the correct configuration
    game.load_config("defend_the_center.cfg")
    
    # Load the correct scenario (in our case deadly_corridor scenario)
    game.set_doom_scenario_path("defend_the_center.wad")
    game.set_sound_enabled(False)
    game.set_screen_resolution(ScreenResolution.RES_640X480)
    game.set_window_visible(False)
    
    game.init()

    # Here we create an hot encoded version of our actions (5 possible actions)
    # possible_actions = [[1, 0, 0, 0, 0], [0, 1, 0, 0, 0]...]
    possible_actions = np.identity(3,dtype=int).tolist()
    
    return game, possible_actions

In [4]:
game, possible_actions = create_environment()

## Step 3: Define the preprocessing functions ⚙️
### preprocess_frame
Preprocessing is an important step, <b>because we want to reduce the complexity of our states to reduce the computation time needed for training.</b>
<br><br>
Our steps:
- Grayscale each of our frames (because <b> color does not add important information </b>). But this is already done by the config file.
- Crop the screen (in our case we remove the roof because it contains no information)
- We normalize pixel values
- Finally we resize the preprocessed frame

In [None]:
"""
    preprocess_frame:
    Take a frame.
    Resize it.
        __________________
        |                 |
        |                 |
        |                 |
        |                 |
        |_________________|
        
        to
        _____________
        |            |
        |            |
        |            |
        |____________|
    Normalize it.
    
    return preprocessed_frame
    
    """
def preprocess_frame(frame):
    # Crop the screen (remove part that contains no information)
    # [Up: Down, Left: right]
    # cropped_frame = frame[15:-5,20:-20]
    
    # Normalize Pixel Values
    # normalized_frame = cropped_frame/255.0
    
    # Resize
    preprocessed_frame = transform.resize(frame, [64, 64])
    
    return preprocessed_frame # 100x120x1 frame

### stack_frames
👏 This part was made possible thanks to help of <a href="https://github.com/Miffyli">Anssi</a><br>

As explained in this really <a href="https://danieltakeshi.github.io/2016/11/25/frame-skipping-and-preprocessing-for-deep-q-networks-on-atari-2600-games/">  good article </a> we stack frames.

Stacking frames is really important because it helps us to **give have a sense of motion to our Neural Network.**

- First we preprocess frame
- Then we append the frame to the deque that automatically **removes the oldest frame**
- Finally we **build the stacked state**

This is how work stack:
- For the first frame, we feed 4 frames
- At each timestep, **we add the new frame to deque and then we stack them to form a new stacked frame**
- And so on
<img src="https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/DQN/Space%20Invaders/assets/stack_frames.png" alt="stack">
- If we're done, **we create a new stack with 4 new frames (because we are in a new episode)**.

In [None]:
stack_size = 4 # We stack 4 frames

# Initialize deque with zero-images one array for each image
stacked_frames  =  deque([np.zeros((64,64), dtype=np.int) for i in range(stack_size)], maxlen=4) 

def stack_frames(stacked_frames, state, is_new_episode):
    # Preprocess frame
    frame = preprocess_frame(state)
    
    if is_new_episode:
        # Clear our stacked_frames
        stacked_frames = deque([np.zeros((64,64), dtype=np.int) for i in range(stack_size)], maxlen=4)
        
        # Because we're in a new episode, copy the same frame 4x
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        
        # Stack the frames
        stacked_state = np.stack(stacked_frames, axis=2)

    else:
        # Append frame to deque, automatically removes the oldest frame
        stacked_frames.append(frame)

        # Build the stacked state (first dimension specifies different frames)
        stacked_state = np.stack(stacked_frames, axis=2) 
    
    return stacked_state, stacked_frames

# Add by Karim
# We want to adapt the computation of our reward with this function
def shape_reward(r_t, misc, prev_misc):
    """
    Reward design:
        Will be the inverted time in Bonseyes (x = -x) because
        the time is the thing we want to minimize, therrefore we
        maximize the invert time
    """
    # Check any kill count
    if (misc[0] > prev_misc[0]):
        r_t = r_t + 1

    if (misc[1] < prev_misc[1]): # Use ammo
        r_t = r_t - 0.1

    if (misc[2] < prev_misc[2]): # Loss HEALTH
        r_t = r_t - 0.1

    return r_t

## Step 4: Set up our hyperparameters ⚗️
In this part we'll set up our different hyperparameters. But when you implement a Neural Network by yourself you will **not implement hyperparamaters at once but progressively**.

- First, you begin by defining the neural networks hyperparameters when you implement the model.
- Then, you'll add the training hyperparameters when you implement the training algorithm.

In [None]:
### MODEL HYPERPARAMETERS
state_size = [64,64,4]      # Our input is a stack of 4 frames hence 100x120x4 (Width, height, channels) 
action_size = game.get_available_buttons_size()              # 7 possible actions
learning_rate =  0.00025      # Alpha (aka learning rate)

### TRAINING HYPERPARAMETERS
total_episodes = 12000         # Total episodes for training
max_steps = 8000               # Max possible steps in an episode 500
batch_size = 64                # 64

# FIXED Q TARGETS HYPERPARAMETERS 
max_tau = 10000 #Tau is the C step where we update our target network

# EXPLORATION HYPERPARAMETERS for epsilon greedy strategy
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.00005            # exponential decay rate for exploration prob

# Q LEARNING hyperparameters
gamma = 0.95               # Discounting rate

### MEMORY HYPERPARAMETERS
## If you have GPU change to 1million
pretrain_length = 300000   # Number of experiences stored in the Memory when initialized for the first time
memory_size = 300000      # Number of experiences the Memory can keep

### MODIFY THIS TO FALSE IF YOU JUST WANT TO SEE THE TRAINED AGENT
training = True

## TURN THIS TO TRUE IF YOU WANT TO RENDER THE ENVIRONMENT
episode_render = False

## Step 5: Create our Dueling Double Deep Q-learning Neural Network model (aka DDDQN) 🧠
<img src="https://cdn-images-1.medium.com/max/1500/1*FkHqwA2eSGixdS-3dvVoMA.png" alt="Dueling Double Deep Q Learning Model" />
This is our Dueling Double Deep Q-learning model:
- We take a stack of 4 frames as input
- It passes through 3 convnets
- Then it is flatened
- Then it is passed through 2 streams
    - One that calculates V(s)
    - The other that calculates A(s,a)
- Finally an agregating layer
- It outputs a Q value for each actions

In [None]:
class DDDQNNet:
    def __init__(self, state_size, action_size, learning_rate, name):
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate
        self.name = name
        
        
        # We use tf.variable_scope here to know which network we're using (DQN or target_net)
        # it will be useful when we will update our w- parameters (by copy the DQN parameters)
        with tf.variable_scope(self.name):
            
            # We create the placeholders
            # *state_size means that we take each elements of state_size in tuple hence is like if we wrote
            # [None, 100, 120, 4]
            self.inputs_ = tf.placeholder(tf.float32, [None, *state_size], name="inputs")
            
            #
            self.ISWeights_ = tf.placeholder(tf.float32, [None,1], name='IS_weights')
            
            self.actions_ = tf.placeholder(tf.float32, [None, action_size], name="actions_")
            
            # Remember that target_Q is the R(s,a) + ymax Qhat(s', a')
            self.target_Q = tf.placeholder(tf.float32, [None], name="target")
            
            """
            First convnet:
            CNN
            ELU
            """
            # Input is 100x120x4
            self.conv1 = tf.layers.conv2d(inputs = self.inputs_,
                                         filters = 32,
                                         kernel_size = [8,8],
                                         strides = [4,4],
                                         padding = "VALID",
                                         kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),
                                         name = "conv1")
            
            self.conv1_out = tf.nn.elu(self.conv1, name="conv1_out")
            
            
            """
            Second convnet:
            CNN
            ELU
            """
            self.conv2 = tf.layers.conv2d(inputs = self.conv1_out,
                                 filters = 64,
                                 kernel_size = [4,4],
                                 strides = [2,2],
                                 padding = "VALID",
                                 kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),
                                 name = "conv2")

            self.conv2_out = tf.nn.elu(self.conv2, name="conv2_out")
            
            
            """
            Third convnet:
            CNN
            ELU
            """
            self.conv3 = tf.layers.conv2d(inputs = self.conv2_out,
                                 filters = 64,
                                 kernel_size = [3,3],
                                 strides = [1,1],
                                 padding = "VALID",
                                kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),
                                 name = "conv3")

            self.conv3_out = tf.nn.elu(self.conv3, name="conv3_out")
            
            
            self.flatten = tf.layers.flatten(self.conv3_out)
            
            
            ## Here we separate into two streams
            # The one that calculate V(s)
            self.value_fc = tf.layers.dense(inputs = self.flatten,
                                  units = 512,
                                  activation = tf.nn.elu,
                                       kernel_initializer=tf.contrib.layers.xavier_initializer(),
                                name="value_fc")
            
            self.value = tf.layers.dense(inputs = self.value_fc,
                                        units = 1,
                                        activation = None,
                                        kernel_initializer=tf.contrib.layers.xavier_initializer(),
                                name="value")
            
            # The one that calculate A(s,a)
            self.advantage_fc = tf.layers.dense(inputs = self.flatten,
                                  units = 512,
                                  activation = tf.nn.elu,
                                       kernel_initializer=tf.contrib.layers.xavier_initializer(),
                                name="advantage_fc")
            
            self.advantage = tf.layers.dense(inputs = self.advantage_fc,
                                        units = self.action_size,
                                        activation = None,
                                        kernel_initializer=tf.contrib.layers.xavier_initializer(),
                                name="advantages")
            
            # Agregating layer
            # Q(s,a) = V(s) + (A(s,a) - 1/|A| * sum A(s,a'))
            self.output = self.value + tf.subtract(self.advantage, tf.reduce_mean(self.advantage, axis=1, keepdims=True))
              
            # Q is our predicted Q value.
            self.Q = tf.reduce_sum(tf.multiply(self.output, self.actions_), axis=1)
            
            # The loss is modified because of PER 
            self.absolute_errors = tf.abs(self.target_Q - self.Q)# for updating Sumtree
            
            self.loss = tf.reduce_mean(self.ISWeights_ * tf.squared_difference(self.target_Q, self.Q))
            
            self.optimizer = tf.train.RMSPropOptimizer(self.learning_rate).minimize(self.loss)

In [None]:
# Reset the graph
tf.reset_default_graph()

# Instantiate the DQNetwork
DQNetwork = DDDQNNet(state_size, action_size, learning_rate, name="DQNetwork")

# Instantiate the target network
TargetNetwork = DDDQNNet(state_size, action_size, learning_rate, name="TargetNetwork")

## Step 6: Prioritized Experience Replay 🔁
Now that we create our Neural Network, **we need to implement the Prioritized Experience Replay method.** <br>

As explained in the article, **we can't use a simple array to do that because sampling from it will be not efficient, so we use a binary tree data type (in a binary tree each node has no + than 2 children).** More precisely, a sumtree, which is a binary tree where parents nodes are the sum of the children nodes.

If you don't know what is a binary tree check this awesome video https://www.youtube.com/watch?v=oSWTXtMglKE


This SumTree implementation was taken from Morvan Zhou in his chinese course about Reinforcement Learning

To summarize:
- **Step 1**: We construct a SumTree, which is a Binary Sum tree where leaves contains the priorities and a data array where index points to the index of leaves.
    <img src="https://cdn-images-1.medium.com/max/1200/1*Go9DNr7YY-wMGdIQ7HQduQ.png" alt="SumTree"/>
    <br><br>
    - **def __init__**: Initialize our SumTree data object with all nodes = 0 and data (data array) with all = 0.
    - **def add**: add our priority score in the sumtree leaf and experience (S, A, R, S', Done) in data.
    - **def update**: we update the leaf priority score and propagate through tree.
    - **def get_leaf**: retrieve priority score, index and experience associated with a leaf.
    - **def total_priority**: get the root node value to calculate the total priority score of our replay buffer.
<br><br>
- **Step 2**: We create a Memory object that will contain our sumtree and data.
    - **def __init__**: generates our sumtree and data by instantiating the SumTree object.
    - **def store**: we store a new experience in our tree. Each new experience will **have priority = max_priority** (and then this priority will be corrected during the training (when we'll calculating the TD error hence the priority score).
    - **def sample**:
         - First, to sample a minibatch of k size, the range [0, priority_total] is / into k ranges.
         - Then a value is uniformly sampled from each range
         - We search in the sumtree, the experience where priority score correspond to sample values are retrieved from.
         - Then, we calculate IS weights for each minibatch element
    - **def update_batch**: update the priorities on the tree

In [None]:
class SumTree(object):
    """
    This SumTree code is modified version of Morvan Zhou: 
    https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/5.2_Prioritized_Replay_DQN/RL_brain.py
    """
    data_pointer = 0
    
    """
    Here we initialize the tree with all nodes = 0, and initialize the data with all values = 0
    """
    def __init__(self, capacity):
        self.capacity = capacity # Number of leaf nodes (final nodes) that contains experiences
        
        # Generate the tree with all nodes values = 0
        # To understand this calculation (2 * capacity - 1) look at the schema above
        # Remember we are in a binary node (each node has max 2 children) so 2x size of leaf (capacity) - 1 (root node)
        # Parent nodes = capacity - 1
        # Leaf nodes = capacity
        self.tree = np.zeros(2 * capacity - 1)
        
        """ tree:
            0
           / \
          0   0
         / \ / \
        0  0 0  0  [Size: capacity] it's at this line that there is the priorities score (aka pi)
        """
        
        # Contains the experiences (so the size of data is capacity)
        self.data = np.zeros(capacity, dtype=object)
    
    
    """
    Here we add our priority score in the sumtree leaf and add the experience in data
    """
    def add(self, priority, data):
        # Look at what index we want to put the experience
        tree_index = self.data_pointer + self.capacity - 1
        
        """ tree:
            0
           / \
          0   0
         / \ / \
tree_index  0 0  0  We fill the leaves from left to right
        """
        
        # Update data frame
        self.data[self.data_pointer] = data
        
        # Update the leaf
        self.update (tree_index, priority)
        
        # Add 1 to data_pointer
        self.data_pointer += 1
        
        if self.data_pointer >= self.capacity:  # If we're above the capacity, you go back to first index (we overwrite)
            self.data_pointer = 0
            
    
    """
    Update the leaf priority score and propagate the change through tree
    """
    def update(self, tree_index, priority):
        # Change = new priority score - former priority score
        change = priority - self.tree[tree_index]
        self.tree[tree_index] = priority
        
        # then propagate the change through tree
        while tree_index != 0:    # this method is faster than the recursive loop in the reference code
            
            """
            Here we want to access the line above
            THE NUMBERS IN THIS TREE ARE THE INDEXES NOT THE PRIORITY VALUES
            
                0
               / \
              1   2
             / \ / \
            3  4 5  [6] 
            
            If we are in leaf at index 6, we updated the priority score
            We need then to update index 2 node
            So tree_index = (tree_index - 1) // 2
            tree_index = (6-1)//2
            tree_index = 2 (because // round the result)
            """
            tree_index = (tree_index - 1) // 2
            self.tree[tree_index] += change
    
    
    """
    Here we get the leaf_index, priority value of that leaf and experience associated with that index
    """
    def get_leaf(self, v):
        """
        Tree structure and array storage:
        Tree index:
             0         -> storing priority sum
            / \
          1     2
         / \   / \
        3   4 5   6    -> storing priority for experiences
        Array type for storing:
        [0,1,2,3,4,5,6]
        """
        parent_index = 0
        
        while True: # the while loop is faster than the method in the reference code
            left_child_index = 2 * parent_index + 1
            right_child_index = left_child_index + 1
            
            # If we reach bottom, end the search
            if left_child_index >= len(self.tree):
                leaf_index = parent_index
                break
            
            else: # downward search, always search for a higher priority node
                
                if v <= self.tree[left_child_index]:
                    parent_index = left_child_index
                    
                else:
                    v -= self.tree[left_child_index]
                    parent_index = right_child_index
            
        data_index = leaf_index - self.capacity + 1

        return leaf_index, self.tree[leaf_index], self.data[data_index]
    
    @property
    def total_priority(self):
        return self.tree[0] # Returns the root node

Here we don't use deque anymore

In [None]:
class Memory(object):  # stored as ( s, a, r, s_ ) in SumTree
    """
    This SumTree code is modified version and the original code is from:
    https://github.com/jaara/AI-blog/blob/master/Seaquest-DDQN-PER.py
    """
    PER_e = 0.01  # Hyperparameter that we use to avoid some experiences to have 0 probability of being taken
    PER_a = 0.6  # Hyperparameter that we use to make a tradeoff between taking only exp with high priority and sampling randomly
    PER_b = 0.4  # importance-sampling, from initial value increasing to 1
    
    PER_b_increment_per_sampling = 0.001
    
    absolute_error_upper = 1.  # clipped abs error

    def __init__(self, capacity):
        # Making the tree 
        """
        Remember that our tree is composed of a sum tree that contains the priority scores at his leaf
        And also a data array
        We don't use deque because it means that at each timestep our experiences change index by one.
        We prefer to use a simple array and to overwrite when the memory is full.
        """
        self.tree = SumTree(capacity)
        
    """
    Store a new experience in our tree
    Each new experience have a score of max_prority (it will be then improved when we use this exp to train our DDQN)
    """
    def store(self, experience):
        # Find the max priority
        max_priority = np.max(self.tree.tree[-self.tree.capacity:])
        
        # If the max priority = 0 we can't put priority = 0 since this exp will never have a chance to be selected
        # So we use a minimum priority
        if max_priority == 0:
            max_priority = self.absolute_error_upper
        
        self.tree.add(max_priority, experience)   # set the max p for new p

        
    """
    - First, to sample a minibatch of k size, the range [0, priority_total] is / into k ranges.
    - Then a value is uniformly sampled from each range
    - We search in the sumtree, the experience where priority score correspond to sample values are retrieved from.
    - Then, we calculate IS weights for each minibatch element
    """
    def sample(self, n):
        # Create a sample array that will contains the minibatch
        memory_b = []
        
        b_idx, b_ISWeights = np.empty((n,), dtype=np.int32), np.empty((n, 1), dtype=np.float32)
        
        # Calculate the priority segment
        # Here, as explained in the paper, we divide the Range[0, ptotal] into n ranges
        priority_segment = self.tree.total_priority / n       # priority segment
    
        # Here we increasing the PER_b each time we sample a new minibatch
        self.PER_b = np.min([1., self.PER_b + self.PER_b_increment_per_sampling])  # max = 1
        
        # Calculating the max_weight
        p_min = np.min(self.tree.tree[-self.tree.capacity:]) / self.tree.total_priority
        max_weight = (p_min * n) ** (-self.PER_b)
        
        for i in range(n):
            """
            A value is uniformly sample from each range
            """
            a, b = priority_segment * i, priority_segment * (i + 1)
            value = np.random.uniform(a, b)
            
            """
            Experience that correspond to each value is retrieved
            """
            index, priority, data = self.tree.get_leaf(value)
            
            #P(j)
            sampling_probabilities = priority / self.tree.total_priority
            
            #  IS = (1/N * 1/P(i))**b /max wi == (N*P(i))**-b  /max wi
            b_ISWeights[i, 0] = np.power(n * sampling_probabilities, -self.PER_b)/ max_weight
                                   
            b_idx[i]= index
            
            experience = [data]
            
            memory_b.append(experience)
        
        return b_idx, memory_b, b_ISWeights
    
    """
    Update the priorities on the tree
    """
    def batch_update(self, tree_idx, abs_errors):
        abs_errors += self.PER_e  # convert to abs and avoid 0
        clipped_errors = np.minimum(abs_errors, self.absolute_error_upper)
        ps = np.power(clipped_errors, self.PER_a)

        for ti, p in zip(tree_idx, ps):
            self.tree.update(ti, p)

Here we'll **deal with the empty memory problem**: we pre-populate our memory by taking random actions and storing the experience.

In [None]:
# Instantiate memory
memory = Memory(memory_size)

# Render the environment
game.new_episode()
reward = 0

game_state = game.get_state()
misc = game_state.game_variables  # [KILLCOUNT, AMMO, HEALTH]
prev_misc = misc

for i in range(pretrain_length):
    # If it's the first step
    if i == 0:
        # First we need a state
        state = game.get_state().screen_buffer
        state, stacked_frames = stack_frames(stacked_frames, state, True)  
    # Random action
    action = random.choice(possible_actions)
    # Add by Karim
    game_state = game.get_state()
    misc = game_state.game_variables  # [KILLCOUNT, AMMO, HEALTH]
    game.set_action(action)
    # 4 for the skip rate as we stack 4 frames, we ask to take the same action four times
    game.advance_action(4)     
    game.get_state()
    reward = game.get_last_reward()
    # Add by Karim
    game_state = game.get_state()  
    reward = shape_reward(reward, misc, prev_misc)
    # Get the rewards
    # reward = game.make_action(action)
    
    # Look if the episode is finished
    done = game.is_episode_finished()

    # If we're dead
    if done:
        # We finished the episode
        next_state = np.zeros(state.shape)
        
        # Add experience to memory
        #experience = np.hstack((state, [action, reward], next_state, done))
        
        experience = state, action, reward, next_state, done
        memory.store(experience)
        reward = 0
        # Start a new episode
        game.new_episode()
        
        # First we need a state
        state = game.get_state().screen_buffer
        
        # Stack the frames
        state, stacked_frames = stack_frames(stacked_frames, state, True)
        
    else:
        # Get the next state
        next_state = game.get_state().screen_buffer
        next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)
        
        # Add experience to memory
        experience = state, action, reward, next_state, done
        memory.store(experience)
        
        # Our state is now the next_state
        state = next_state
    if i % 30000 == 0:
        print("processed " + str(i) + " items")
    # print("MISC: " + str(misc) + "\tPrev Misc: " + str(prev_misc))
    prev_misc = misc    

processed 0 items
processed 30000 items
processed 60000 items
processed 90000 items
processed 120000 items
processed 150000 items
processed 180000 items
processed 210000 items
processed 240000 items
processed 270000 items


## Step 7: Set up Tensorboard 📊
For more information about tensorboard, please watch this <a href="https://www.youtube.com/embed/eBbEDRsCmv4">excellent 30min tutorial</a> <br><br>
To launch tensorboard : `tensorboard --logdir=/tensorboard/dddqn/1`

In [None]:
# Setup TensorBoard Writer
writer = tf.summary.FileWriter(r".\tensorboard\dddqn\1")

## Losses
tf.summary.scalar("Loss", DQNetwork.loss)

write_op = tf.summary.merge_all()

## Step 8: Train our Agent 🏃‍♂️

Our algorithm:
<br>
* Initialize the weights for DQN
* Initialize target value weights w- <- w
* Init the environment
* Initialize the decay rate (that will use to reduce epsilon) 
<br><br>
* **For** episode to max_episode **do** 
    * Make new episode
    * Set step to 0
    * Observe the first state $s_0$
    <br><br>
    * **While** step < max_steps **do**:
        * Increase decay_rate
        * With $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s_t,a)$
        * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
        * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
        
        * Sample random mini-batch from $D$: $<s, a, r, s'>$
        * Set target $\hat{Q} = r$ if the episode ends at $+1$, otherwise set $\hat{Q} = r + \gamma Q(s',argmax_{a'}{Q(s', a', w), w^-)}$
        * Make a gradient descent step with loss $(\hat{Q} - Q(s, a))^2$
        * Every C steps, reset: $w^- \leftarrow w$
    * **endfor**
    <br><br>
* **endfor**

    

In [None]:
"""
This function will do the part
With ϵ select a random action atat, otherwise select at=argmaxaQ(st,a)
"""
def predict_action(explore_start, explore_stop, decay_rate, decay_step, state, actions):
    ## EPSILON GREEDY STRATEGY
    # Choose action a from state s using epsilon greedy.
    ## First we randomize a number
    exp_exp_tradeoff = np.random.rand()

    # Here we'll use an improved version of our epsilon greedy strategy used in Q-learning notebook
    explore_probability = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * decay_step)
    
    if (explore_probability > exp_exp_tradeoff):
        # Make a random action (exploration)
        action = random.choice(possible_actions)
        
    else:
        # Get action from Q-network (exploitation)
        # Estimate the Qs values state
        Qs = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: state.reshape((1, *state.shape))})
        
        # Take the biggest Q value (= the best action)
        choice = np.argmax(Qs)
        action = possible_actions[int(choice)]
                
    return action, explore_probability

In [None]:
# This function helps us to copy one set of variables to another
# In our case we use it when we want to copy the parameters of DQN to Target_network
# Thanks of the very good implementation of Arthur Juliani https://github.com/awjuliani
def update_target_graph():
    
    # Get the parameters of our DQNNetwork
    from_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "DQNetwork")
    
    # Get the parameters of our Target_network
    to_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "TargetNetwork")

    op_holder = []
    
    # Update our target_network parameters with DQNNetwork parameters
    for from_var,to_var in zip(from_vars,to_vars):
        op_holder.append(to_var.assign(from_var))
    return op_holder

In [None]:
# Saver will help us to save our model
saver = tf.train.Saver()
GAME = 0
t = 0
max_life = 0    # Maximum episode life (Proxy for agent performance)
life = 0
stats_store = []

# Buffer to compute rolling statistics 
tot_reward_buffer, life_buffer, ammo_buffer, kills_buffer, mavg_score, var_score, mavg_ammo_left, mavg_kill_counts, mavg_tot_rewards = [], [] , [], [], [], [], [], [], [] 
losses_buffer, epsilon_buffer = [], []

game_state = game.get_state()
misc = game_state.game_variables  # [KILLCOUNT, AMMO, HEALTH]
prev_misc = misc


if training == True:
    with tf.Session() as sess:
        # Initialize the variables
        sess.run(tf.global_variables_initializer())
        
        # Initialize the decay rate (that will use to reduce epsilon) 
        decay_step = 0
        
        # Set tau = 0
        tau = 0

        # Init the game
        game.init()
        
        # Update the parameters of our TargetNetwork with DQN_weights
        update_target = update_target_graph()
        sess.run(update_target)
        
        for episode in range(total_episodes):
            # Set step to 0
            step = 0
            # set the reward
            reward = 0
            
            # Initialize the rewards of the episode
            episode_rewards = []
            
            # Make a new episode and observe the first state
            game.new_episode(r".\ep_rec_last\ep" + str(episode) + "_rec.lmp")
            
            state = game.get_state().screen_buffer
            if episode == 0:
                misc = game.get_state().game_variables  # [KILLCOUNT, AMMO, HEALTH]
                prev_misc = misc
            
            # Remember that stack frame function also call our preprocess function.
            state, stacked_frames = stack_frames(stacked_frames, state, True)
            
            
            while step < max_steps:
                step += 1
                
                # Increase the C step
                tau += 1
                
                # Increase decay_step
                decay_step += 1
                
                # With ϵ select a random action atat, otherwise select a = argmaxQ(st,a)
                action, explore_probability = predict_action(explore_start, explore_stop, decay_rate, decay_step, state, possible_actions)

                # Do the action
                # Add by Karim
                misc = game.get_state().game_variables  # [KILLCOUNT, AMMO, HEALTH]
                game.set_action(action)
                # 4 for the skip rate as we stack 4 frames, we ask to take the same action four times
                game.advance_action(4)
                game_state = game.get_state()
                # Original version
                # reward = game.make_action(action)
                reward = game.get_last_reward()
                # Look if the episode is finished
                done = game.is_episode_finished()
                # Add by Karim   
                reward = shape_reward(reward, misc, prev_misc)
                # Add the reward to total reward
                episode_rewards.append(reward)
                
                # If the game is finished
                if done:
                    # Add by Karim
                    if life > max_life:
                        max_life = life
                    GAME += 1
                    life_buffer.append(life)
                    ammo_buffer.append(misc[1])
                    kills_buffer.append(misc[0])
                    life = 0
                    print("Episode Finish ", misc)
                    
                    # the episode ends so no next state
                    # 120, 140 => 64, 64
                    next_state = np.zeros((120,140), dtype=np.int)
                    next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)

                    # Set step = max_steps to end the episode
                    step = max_steps

                    # Get the total reward of the episode
                    total_reward = np.sum(episode_rewards)
                    tot_reward_buffer.append(total_reward)
                    
                    losses_buffer.append(loss)
                    epsilon_buffer.append(explore_probability)
                    
                    print('Episode: {}'.format(episode),
                          'Total reward: {}'.format(total_reward),
                          'Training loss: {:.4f}'.format(loss),
                          'Epsilon: {:.4f}'.format(explore_probability),
                          'Life: {}'.format(max_life)
                         )
                    
                    # Set reward to 0
                    reward = 0
                    # Add experience to memory
                    experience = state, action, reward, next_state, done
                    memory.store(experience)

                else:
                    # Get the next state
                    next_state = game.get_state().screen_buffer
                    
                    # Stack the frame of the next_state
                    next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)
                    

                    # Add experience to memory
                    experience = state, action, reward, next_state, done
                    memory.store(experience)
                    
                    # st+1 is now our current state
                    state = next_state
                    life += 1

                # print("Misc: " + str(misc) + "\tPrev Misc: " + str(prev_misc))
                # Add by Karim
                prev_misc = misc

                ### LEARNING PART            
                # Obtain random mini-batch from memory
                tree_idx, batch, ISWeights_mb = memory.sample(batch_size)
                
                states_mb = np.array([each[0][0] for each in batch], ndmin=3)
                actions_mb = np.array([each[0][1] for each in batch])
                rewards_mb = np.array([each[0][2] for each in batch]) 
                next_states_mb = np.array([each[0][3] for each in batch], ndmin=3)
                dones_mb = np.array([each[0][4] for each in batch])

                target_Qs_batch = []

                
                ### DOUBLE DQN Logic
                # Use DQNNetwork to select the action to take at next_state (a') (action with the highest Q-value)
                # Use TargetNetwork to calculate the Q_val of Q(s',a')
                
                # Get Q values for next_state 
                q_next_state = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: next_states_mb})
                
                # Calculate Qtarget for all actions that state
                q_target_next_state = sess.run(TargetNetwork.output, feed_dict = {TargetNetwork.inputs_: next_states_mb})
                
                
                # Set Q_target = r if the episode ends at s+1, otherwise set Q_target = r + gamma * Qtarget(s',a') 
                for i in range(0, len(batch)):
                    terminal = dones_mb[i]
                    
                    # We got a'
                    action = np.argmax(q_next_state[i])

                    # If we are in a terminal state, only equals reward
                    if terminal:
                        target_Qs_batch.append(rewards_mb[i])
                        
                    else:
                        # Take the Qtarget for action a'
                        target = rewards_mb[i] + gamma * q_target_next_state[i][action]
                        target_Qs_batch.append(target)
                        

                targets_mb = np.array([each for each in target_Qs_batch])

                
                _, loss, absolute_errors = sess.run([DQNetwork.optimizer, DQNetwork.loss, DQNetwork.absolute_errors],
                                    feed_dict={DQNetwork.inputs_: states_mb,
                                               DQNetwork.target_Q: targets_mb,
                                               DQNetwork.actions_: actions_mb,
                                              DQNetwork.ISWeights_: ISWeights_mb})
              
                
                
                # Update priority
                memory.batch_update(tree_idx, absolute_errors)
                
                
                # Write TF Summaries
                summary = sess.run(write_op, feed_dict={DQNetwork.inputs_: states_mb,
                                                   DQNetwork.target_Q: targets_mb,
                                                   DQNetwork.actions_: actions_mb,
                                              DQNetwork.ISWeights_: ISWeights_mb})
                writer.add_summary(summary, episode)
                writer.flush()
                
                if tau > max_tau:
                    # Update the parameters of our TargetNetwork with DQN_weights
                    update_target = update_target_graph()
                    sess.run(update_target)
                    tau = 0
                    print("Model updated")

            # Save model every 200 episodes
            if episode % 200 == 0:
                save_path = saver.save(sess, "./models/model.ckpt")
                print("Model Saved at " + str(episode))
                
            # Save stats every 50 episodes by Karim
            if episode % 50 == 0:
                print("Stats Saved at " + str(episode))
                mavg_tot_rewards.append(np.mean(np.array(tot_reward_buffer)))
                mavg_score.append(np.mean(np.array(life_buffer)))
                var_score.append(np.var(np.array(life_buffer)))
                mavg_ammo_left.append(np.mean(np.array(ammo_buffer)))
                mavg_kill_counts.append(np.mean(np.array(kills_buffer)))
                with open(r".\ddqn_pr_steps_stats.txt", "w") as stats_file:
                    stats_file.write('Game: ' + str(GAME) + '\n')
                    stats_file.write('Max Score: ' + str(max_life) + '\n')
                    stats_file.write('mavg_score: ' + str(mavg_score) + '\n')
                    stats_file.write('var_score: ' + str(var_score) + '\n')
                    stats_file.write('mavg_ammo_left: ' + str(mavg_ammo_left) + '\n')
                    stats_file.write('mavg_kill_counts: ' + str(mavg_kill_counts) + '\n')
                    stats_file.write('mavg_rewards: ' + str(total_reward) + "\n")
                with open(r".\ddqn_pr_steps_stats" + str(episode) + ".pickle", 'wb') as handle:
                    pickle.dump(stats_store.append({'game': GAME, 'max_score': max_life, 'mavg_score': mavg_score, 
                                                   'var_score': var_score, 'mavg_ammo_left': mavg_ammo_left,
                                                   'mavg_kill_counts': mavg_kill_counts,
                                                   'mavg_tot_rewards': mavg_tot_rewards,
                                                   'life_buffer': life_buffer, 'ammo_buffer': ammo_buffer, 
                                                   'kills_buffer': kills_buffer, 'tot_reward_buffer': tot_reward_buffer, 
                                                   'losses': losses_buffer, 'epsilon': epsilon_buffer}), 
                                handle, protocol=pickle.HIGHEST_PROTOCOL)
                                        
    with open(r".\buffer_dic_data.pickle", 'wb') as handle:
        pickle.dump(stats_store.append({'life_buffer': life_buffer, 'ammo_buffer': ammo_buffer, 
                                        'kills_buffer': kills_buffer, 'tot_reward_buffer': tot_reward_buffer, 
                                        'losses': losses_buffer, 'epsilon': epsilon_buffer}), 
                    handle, protocol=pickle.HIGHEST_PROTOCOL)

Episode Finish  [ 0. 14. 24.]
Episode: 0 Total reward: -2.5 Training loss: 0.0175 Epsilon: 0.9967 Life: 65
Model Saved at 0
Stats Saved at 0
Episode Finish  [ 3.  6. 24.]
Episode: 1 Total reward: 2.3 Training loss: 0.0045 Epsilon: 0.9917 Life: 102
Episode Finish  [ 1. 12. 24.]
Episode: 2 Total reward: -0.6999999999999998 Training loss: 0.0150 Epsilon: 0.9877 Life: 102
Episode Finish  [ 0. 15. 24.]
Episode: 3 Total reward: -2.4 Training loss: 0.0021 Epsilon: 0.9845 Life: 102
Episode Finish  [ 1. 10. 36.]
Episode: 4 Total reward: -1.0 Training loss: 0.0084 Epsilon: 0.9806 Life: 102
Episode Finish  [ 1. 16. 24.]
Episode: 5 Total reward: -0.30000000000000004 Training loss: 0.0052 Epsilon: 0.9774 Life: 102
Episode Finish  [ 1. 16. 24.]
Episode: 6 Total reward: -0.30000000000000004 Training loss: 0.0034 Epsilon: 0.9743 Life: 102
Episode Finish  [ 0. 15. 24.]
Episode: 7 Total reward: -2.4 Training loss: 0.0041 Epsilon: 0.9711 Life: 102
Episode Finish  [ 2.  8. 10.]
Episode: 8 Total reward: 0.

Episode Finish  [2. 8. 4.]
Episode: 71 Total reward: 0.6000000000000001 Training loss: 0.0042 Epsilon: 0.7566 Life: 110
Episode Finish  [2. 8. 6.]
Episode: 72 Total reward: 0.7000000000000002 Training loss: 0.0179 Epsilon: 0.7527 Life: 110
Episode Finish  [ 1.  8. 24.]
Episode: 73 Total reward: -1.0999999999999999 Training loss: 0.0095 Epsilon: 0.7497 Life: 110
Episode Finish  [ 2. 10. 24.]
Episode: 74 Total reward: 1.1000000000000005 Training loss: 0.0133 Epsilon: 0.7465 Life: 110
Episode Finish  [ 0. 10.  8.]
Episode: 75 Total reward: -3.1 Training loss: 0.0135 Epsilon: 0.7434 Life: 110
Episode Finish  [ 2. 14. 20.]
Episode: 76 Total reward: 1.4999999999999996 Training loss: 0.0149 Epsilon: 0.7406 Life: 110
Episode Finish  [ 4.  6. 10.]
Episode: 77 Total reward: 4.5 Training loss: 0.0219 Epsilon: 0.7361 Life: 123
Episode Finish  [ 3.  3. 10.]
Episode: 78 Total reward: 2.3000000000000003 Training loss: 0.0220 Epsilon: 0.7325 Life: 123
Episode Finish  [ 0. 14. 24.]
Episode: 79 Total re

Episode Finish  [ 1. 10.  8.]
Episode: 141 Total reward: -1.0999999999999996 Training loss: 0.0333 Epsilon: 0.5720 Life: 123
Episode Finish  [2. 6. 4.]
Episode: 142 Total reward: 0.49999999999999956 Training loss: 0.0239 Epsilon: 0.5693 Life: 123
Episode Finish  [ 1. 12. 28.]
Episode: 143 Total reward: -0.9 Training loss: 0.0228 Epsilon: 0.5677 Life: 123
Episode Finish  [ 1. 11. 24.]
Episode: 144 Total reward: -0.8 Training loss: 0.0204 Epsilon: 0.5654 Life: 123
Episode Finish  [4. 9. 8.]
Episode: 145 Total reward: 4.5 Training loss: 0.0309 Epsilon: 0.5623 Life: 123
Episode Finish  [ 1. 15. 24.]
Episode: 146 Total reward: -0.3999999999999999 Training loss: 0.0261 Epsilon: 0.5608 Life: 123
Episode Finish  [ 2. 11. 24.]
Episode: 147 Total reward: 1.2000000000000002 Training loss: 0.0252 Epsilon: 0.5588 Life: 123
Episode Finish  [2. 3. 8.]
Episode: 148 Total reward: 0.09999999999999987 Training loss: 0.0351 Epsilon: 0.5561 Life: 123
Episode Finish  [ 0. 14.  8.]
Episode: 149 Total reward:

Episode Finish  [ 1. 14. 24.]
Episode: 210 Total reward: -0.3999999999999997 Training loss: 0.0389 Epsilon: 0.4378 Life: 123
Episode Finish  [ 2. 12. 24.]
Episode: 211 Total reward: 1.3000000000000003 Training loss: 0.0371 Epsilon: 0.4363 Life: 123
Episode Finish  [2. 8. 4.]
Episode: 212 Total reward: 1.7999999999999998 Training loss: 0.0378 Epsilon: 0.4346 Life: 123
Episode Finish  [ 1. 13.  4.]
Episode: 213 Total reward: -0.6000000000000001 Training loss: 0.0306 Epsilon: 0.4334 Life: 123
Episode Finish  [ 2. 10.  6.]
Episode: 214 Total reward: 0.8999999999999999 Training loss: 0.0140 Epsilon: 0.4315 Life: 123
Episode Finish  [ 4.  3. 28.]
Episode: 215 Total reward: 4.200000000000001 Training loss: 0.0364 Epsilon: 0.4292 Life: 123
Episode Finish  [ 1.  8. 24.]
Episode: 216 Total reward: -1.0999999999999996 Training loss: 0.0170 Epsilon: 0.4275 Life: 123
Episode Finish  [ 2. 11. 24.]
Episode: 217 Total reward: 1.2000000000000002 Training loss: 0.0324 Epsilon: 0.4262 Life: 123
Episode F

Episode Finish  [ 2. 12.  8.]
Episode: 280 Total reward: 1.2000000000000002 Training loss: 0.0255 Epsilon: 0.3290 Life: 123
Episode Finish  [4. 3. 4.]
Episode: 281 Total reward: 4.2 Training loss: 0.0234 Epsilon: 0.3272 Life: 123
Episode Finish  [ 3.  9. 24.]
Episode: 282 Total reward: 3.0 Training loss: 0.0245 Epsilon: 0.3259 Life: 123
Episode Finish  [ 4.  0. 16.]
Episode: 283 Total reward: 4.0 Training loss: 0.0259 Epsilon: 0.3242 Life: 123
Episode Finish  [ 2. 12. 12.]
Episode: 284 Total reward: 1.2000000000000002 Training loss: 0.0125 Epsilon: 0.3230 Life: 123
Episode Finish  [ 4. 11. 24.]
Episode: 285 Total reward: 5.2 Training loss: 0.0182 Epsilon: 0.3220 Life: 123
Episode Finish  [ 1. 12. 24.]
Episode: 286 Total reward: -0.7 Training loss: 0.0188 Epsilon: 0.3210 Life: 123
Episode Finish  [ 1. 13. 24.]
Episode: 287 Total reward: -0.5999999999999998 Training loss: 0.0284 Epsilon: 0.3201 Life: 123
Episode Finish  [ 1. 15. 24.]
Episode: 288 Total reward: -0.3999999999999998 Trainin

Episode Finish  [ 1. 11. 24.]
Episode: 350 Total reward: -0.8 Training loss: 0.0265 Epsilon: 0.2514 Life: 123
Stats Saved at 350
Episode Finish  [ 2.  7. 24.]
Episode: 351 Total reward: 0.8000000000000003 Training loss: 0.0114 Epsilon: 0.2505 Life: 123
Episode Finish  [ 2.  9. 28.]
Episode: 352 Total reward: 1.0 Training loss: 0.0338 Epsilon: 0.2496 Life: 123
Episode Finish  [ 1. 14.  8.]
Episode: 353 Total reward: -0.6 Training loss: 0.0127 Epsilon: 0.2488 Life: 123
Episode Finish  [ 1. 13. 24.]
Episode: 354 Total reward: -0.6000000000000001 Training loss: 0.0207 Epsilon: 0.2481 Life: 123
Episode Finish  [ 2. 13.  4.]
Episode: 355 Total reward: 1.2999999999999998 Training loss: 0.0185 Epsilon: 0.2473 Life: 123
Episode Finish  [ 0. 13. 24.]
Episode: 356 Total reward: -2.6 Training loss: 0.0189 Epsilon: 0.2465 Life: 123
Episode Finish  [ 2. 12. 24.]
Episode: 357 Total reward: 1.4000000000000004 Training loss: 0.0133 Epsilon: 0.2457 Life: 123
Episode Finish  [4. 8. 8.]
Episode: 358 Total

Episode Finish  [ 2. 11. 24.]
Episode: 421 Total reward: 1.1 Training loss: 0.0192 Epsilon: 0.1821 Life: 209
Episode Finish  [ 5. 11.  6.]
Episode: 422 Total reward: 7.000000000000001 Training loss: 0.0237 Epsilon: 0.1812 Life: 209
Episode Finish  [ 8.  0. 28.]
Episode: 423 Total reward: 11.999999999999998 Training loss: 0.0209 Epsilon: 0.1797 Life: 209
Episode Finish  [ 2. 13. 24.]
Episode: 424 Total reward: 1.2999999999999998 Training loss: 0.0184 Epsilon: 0.1790 Life: 209
Episode Finish  [8. 4. 2.]
Episode: 425 Total reward: 12.000000000000002 Training loss: 0.0254 Epsilon: 0.1777 Life: 209
Episode Finish  [ 6.  7. 28.]
Episode: 426 Total reward: 8.600000000000001 Training loss: 0.0249 Epsilon: 0.1767 Life: 209
Episode Finish  [4. 7. 8.]
Episode: 427 Total reward: 4.6000000000000005 Training loss: 0.0214 Epsilon: 0.1758 Life: 209
Episode Finish  [ 2. 13. 24.]
Episode: 428 Total reward: 1.4 Training loss: 0.0211 Epsilon: 0.1752 Life: 209
Episode Finish  [ 3. 12.  8.]
Episode: 429 Tot

Episode Finish  [ 1. 12.  8.]
Episode: 492 Total reward: -0.7 Training loss: 0.0283 Epsilon: 0.1279 Life: 209
Episode Finish  [5. 6. 4.]
Episode: 493 Total reward: 6.4 Training loss: 0.0169 Epsilon: 0.1272 Life: 209
Episode Finish  [4. 4. 8.]
Episode: 494 Total reward: 4.199999999999999 Training loss: 0.0213 Epsilon: 0.1266 Life: 209
Episode Finish  [ 1. 18. 24.]
Episode: 495 Total reward: -0.09999999999999998 Training loss: 0.0238 Epsilon: 0.1263 Life: 209
Episode Finish  [ 6. 12.  8.]
Episode: 496 Total reward: 9.1 Training loss: 0.0199 Epsilon: 0.1256 Life: 209
Episode Finish  [ 2. 12. 24.]
Episode: 497 Total reward: 1.2999999999999998 Training loss: 0.0250 Epsilon: 0.1251 Life: 209
Episode Finish  [4. 9. 8.]
Episode: 498 Total reward: 4.7 Training loss: 0.0183 Epsilon: 0.1245 Life: 209
Episode Finish  [10.  5.  4.]
Episode: 499 Total reward: 16.0 Training loss: 0.0187 Epsilon: 0.1236 Life: 209
Episode Finish  [11.  1.  8.]
Episode: 500 Total reward: 18.200000000000003 Training loss

Episode Finish  [7. 5. 4.]
Episode: 562 Total reward: 10.200000000000001 Training loss: 0.0165 Epsilon: 0.0877 Life: 238
Episode Finish  [ 6. 11. 14.]
Episode: 563 Total reward: 8.8 Training loss: 0.0281 Epsilon: 0.0873 Life: 238
Episode Finish  [ 3. 14. 24.]
Episode: 564 Total reward: 3.5 Training loss: 0.0184 Epsilon: 0.0870 Life: 238
Episode Finish  [ 9.  5. 10.]
Episode: 565 Total reward: 14.5 Training loss: 0.0263 Epsilon: 0.0863 Life: 238
Episode Finish  [10.  4. 12.]
Episode: 566 Total reward: 16.4 Training loss: 0.0212 Epsilon: 0.0857 Life: 238
Episode Finish  [4. 9. 8.]
Episode: 567 Total reward: 4.8 Training loss: 0.0301 Epsilon: 0.0852 Life: 238
Episode Finish  [12.  0.  2.]
Episode: 568 Total reward: 19.900000000000002 Training loss: 0.0218 Epsilon: 0.0845 Life: 238
Episode Finish  [10.  7.  4.]
Episode: 569 Total reward: 16.2 Training loss: 0.0295 Epsilon: 0.0838 Life: 238
Episode Finish  [7. 4. 2.]
Episode: 570 Total reward: 11.100000000000001 Training loss: 0.0370 Epsilo

Episode Finish  [9. 1. 8.]
Episode: 634 Total reward: 13.899999999999999 Training loss: 0.0277 Epsilon: 0.0581 Life: 238
Episode Finish  [8. 2. 8.]
Episode: 635 Total reward: 12.0 Training loss: 0.0262 Epsilon: 0.0578 Life: 238
Episode Finish  [ 5. 15. 24.]
Episode: 636 Total reward: 7.6 Training loss: 0.0225 Epsilon: 0.0576 Life: 238
Episode Finish  [ 8.  7. 10.]
Episode: 637 Total reward: 12.7 Training loss: 0.0152 Epsilon: 0.0573 Life: 238
Episode Finish  [ 9.  0. 24.]
Episode: 638 Total reward: 14.1 Training loss: 0.0179 Epsilon: 0.0569 Life: 238
Episode Finish  [ 8.  1. 14.]
Episode: 639 Total reward: 11.7 Training loss: 0.0187 Epsilon: 0.0565 Life: 238
Episode Finish  [10.  0.  8.]
Episode: 640 Total reward: 16.0 Training loss: 0.0243 Epsilon: 0.0561 Life: 238
Episode Finish  [ 4. 13. 10.]
Episode: 641 Total reward: 5.3 Training loss: 0.0288 Epsilon: 0.0559 Life: 238
Episode Finish  [ 4. 17. 24.]
Episode: 642 Total reward: 5.9 Training loss: 0.0166 Epsilon: 0.0557 Life: 238
Episo

Episode Finish  [9. 0. 6.]
Episode: 705 Total reward: 13.6 Training loss: 0.0165 Epsilon: 0.0407 Life: 238
Episode Finish  [14.  0. 14.]
Episode: 706 Total reward: 24.0 Training loss: 0.0169 Epsilon: 0.0404 Life: 238
Episode Finish  [9. 0. 4.]
Episode: 707 Total reward: 13.899999999999999 Training loss: 0.0329 Epsilon: 0.0401 Life: 238
Episode Finish  [ 8. 12. 12.]
Episode: 708 Total reward: 13.200000000000001 Training loss: 0.0226 Epsilon: 0.0399 Life: 238
Model updated
Episode Finish  [ 6. 10.  6.]
Episode: 709 Total reward: 8.9 Training loss: 0.0255 Epsilon: 0.0398 Life: 238
Episode Finish  [ 3. 17. 24.]
Episode: 710 Total reward: 4.800000000000001 Training loss: 0.0323 Epsilon: 0.0397 Life: 238
Episode Finish  [ 5. 16.  4.]
Episode: 711 Total reward: 7.500000000000002 Training loss: 0.0266 Epsilon: 0.0395 Life: 238
Episode Finish  [10.  2.  4.]
Episode: 712 Total reward: 15.9 Training loss: 0.0308 Epsilon: 0.0393 Life: 238
Episode Finish  [ 5.  9. 10.]
Episode: 713 Total reward: 6.

Episode Finish  [10.  3. 16.]
Episode: 778 Total reward: 16.4 Training loss: 0.0282 Epsilon: 0.0278 Life: 238
Episode Finish  [13.  0.  8.]
Episode: 779 Total reward: 21.7 Training loss: 0.0198 Epsilon: 0.0276 Life: 238
Episode Finish  [11.  0. 24.]
Episode: 780 Total reward: 18.1 Training loss: 0.0201 Epsilon: 0.0275 Life: 238
Episode Finish  [ 5. 10. 24.]
Episode: 781 Total reward: 7.000000000000002 Training loss: 0.0238 Epsilon: 0.0274 Life: 238
Episode Finish  [11.  0.  4.]
Episode: 782 Total reward: 17.799999999999997 Training loss: 0.0202 Epsilon: 0.0272 Life: 238
Episode Finish  [ 7.  8. 20.]
Episode: 783 Total reward: 10.9 Training loss: 0.0270 Epsilon: 0.0271 Life: 238
Episode Finish  [8. 0. 2.]
Episode: 784 Total reward: 11.700000000000003 Training loss: 0.0171 Epsilon: 0.0270 Life: 238
Episode Finish  [14.  0. 24.]
Episode: 785 Total reward: 24.1 Training loss: 0.0297 Epsilon: 0.0268 Life: 238
Episode Finish  [12.  0.  2.]
Episode: 786 Total reward: 19.7 Training loss: 0.026

Episode Finish  [ 4. 10.  8.]
Episode: 850 Total reward: 4.800000000000001 Training loss: 0.0323 Epsilon: 0.0203 Life: 269
Stats Saved at 850
Episode Finish  [ 8.  0. 12.]
Episode: 851 Total reward: 11.600000000000001 Training loss: 0.0235 Epsilon: 0.0202 Life: 269
Episode Finish  [ 6. 13. 24.]
Episode: 852 Total reward: 9.4 Training loss: 0.0255 Epsilon: 0.0202 Life: 269
Episode Finish  [ 8.  0. 24.]
Episode: 853 Total reward: 12.100000000000001 Training loss: 0.0262 Epsilon: 0.0201 Life: 269
Episode Finish  [11.  0. 16.]
Episode: 854 Total reward: 18.0 Training loss: 0.0246 Epsilon: 0.0200 Life: 269
Episode Finish  [13.  0. 24.]
Episode: 855 Total reward: 22.1 Training loss: 0.0217 Epsilon: 0.0200 Life: 269
Episode Finish  [ 4. 14. 24.]
Episode: 856 Total reward: 5.5 Training loss: 0.0315 Epsilon: 0.0199 Life: 269
Episode Finish  [ 7. 10. 24.]
Episode: 857 Total reward: 11.1 Training loss: 0.0253 Epsilon: 0.0199 Life: 269
Episode Finish  [ 8.  4. 20.]
Episode: 858 Total reward: 12.29

Episode Finish  [10.  0.  8.]
Episode: 921 Total reward: 15.600000000000001 Training loss: 0.0334 Epsilon: 0.0160 Life: 269
Episode Finish  [ 9.  3. 10.]
Episode: 922 Total reward: 14.100000000000001 Training loss: 0.0232 Epsilon: 0.0159 Life: 269
Episode Finish  [6. 0. 8.]
Episode: 923 Total reward: 7.800000000000001 Training loss: 0.0266 Epsilon: 0.0159 Life: 269
Episode Finish  [14.  0. 24.]
Episode: 924 Total reward: 24.0 Training loss: 0.0265 Epsilon: 0.0158 Life: 269
Episode Finish  [13.  0. 16.]
Episode: 925 Total reward: 22.0 Training loss: 0.0304 Epsilon: 0.0157 Life: 269
Episode Finish  [10.  0. 48.]
Episode: 926 Total reward: 15.899999999999999 Training loss: 0.0230 Epsilon: 0.0157 Life: 269
Episode Finish  [13.  0. 34.]
Episode: 927 Total reward: 22.0 Training loss: 0.0265 Epsilon: 0.0156 Life: 269
Episode Finish  [10.  0. 24.]
Episode: 928 Total reward: 15.9 Training loss: 0.0187 Epsilon: 0.0156 Life: 269
Episode Finish  [ 9.  4. 10.]
Episode: 929 Total reward: 13.89999999

Episode Finish  [12.  0. 24.]
Episode: 993 Total reward: 20.099999999999998 Training loss: 0.0190 Epsilon: 0.0132 Life: 269
Episode Finish  [ 5. 12.  8.]
Episode: 994 Total reward: 7.1 Training loss: 0.0310 Epsilon: 0.0132 Life: 269
Episode Finish  [14.  0. 24.]
Episode: 995 Total reward: 24.1 Training loss: 0.0236 Epsilon: 0.0132 Life: 269
Episode Finish  [14.  0.  4.]
Episode: 996 Total reward: 23.799999999999997 Training loss: 0.0226 Epsilon: 0.0131 Life: 269
Episode Finish  [10.  2. 24.]
Episode: 997 Total reward: 16.3 Training loss: 0.0263 Epsilon: 0.0131 Life: 269
Episode Finish  [12.  3. 24.]
Episode: 998 Total reward: 20.4 Training loss: 0.0231 Epsilon: 0.0131 Life: 269
Episode Finish  [10.  0. 10.]
Episode: 999 Total reward: 15.899999999999999 Training loss: 0.0319 Epsilon: 0.0131 Life: 269
Episode Finish  [5. 6. 8.]
Episode: 1000 Total reward: 6.399999999999999 Training loss: 0.0227 Epsilon: 0.0130 Life: 269
Model Saved at 1000
Stats Saved at 1000
Episode Finish  [ 8.  9. 24.

Episode Finish  [9. 4. 4.]
Episode: 1064 Total reward: 14.2 Training loss: 0.0199 Epsilon: 0.0118 Life: 269
Episode Finish  [10.  4. 24.]
Episode: 1065 Total reward: 16.4 Training loss: 0.0264 Epsilon: 0.0118 Life: 269
Episode Finish  [ 9. 13. 24.]
Episode: 1066 Total reward: 16.4 Training loss: 0.0232 Epsilon: 0.0118 Life: 269
Episode Finish  [11.  4. 24.]
Episode: 1067 Total reward: 18.5 Training loss: 0.0154 Epsilon: 0.0118 Life: 269
Episode Finish  [8. 6. 8.]
Episode: 1068 Total reward: 12.399999999999999 Training loss: 0.0260 Epsilon: 0.0117 Life: 269
Episode Finish  [ 4. 17. 24.]
Episode: 1069 Total reward: 5.8 Training loss: 0.0300 Epsilon: 0.0117 Life: 269
Episode Finish  [ 7. 10. 24.]
Episode: 1070 Total reward: 11.1 Training loss: 0.0261 Epsilon: 0.0117 Life: 269
Episode Finish  [ 5. 11.  8.]
Episode: 1071 Total reward: 6.700000000000001 Training loss: 0.0255 Epsilon: 0.0117 Life: 269
Episode Finish  [10.  0. 14.]
Episode: 1072 Total reward: 16.0 Training loss: 0.0242 Epsilon

Episode Finish  [11.  1. 24.]
Episode: 1135 Total reward: 18.2 Training loss: 0.0268 Epsilon: 0.0110 Life: 269
Episode Finish  [12.  2. 22.]
Episode: 1136 Total reward: 20.0 Training loss: 0.0205 Epsilon: 0.0110 Life: 269
Episode Finish  [12.  1.  6.]
Episode: 1137 Total reward: 20.0 Training loss: 0.0380 Epsilon: 0.0110 Life: 269
Episode Finish  [12.  4.  8.]
Episode: 1138 Total reward: 20.200000000000003 Training loss: 0.0264 Epsilon: 0.0110 Life: 269
Episode Finish  [14.  0. 28.]
Episode: 1139 Total reward: 24.0 Training loss: 0.0238 Epsilon: 0.0110 Life: 269
Episode Finish  [ 6. 18. 24.]
Episode: 1140 Total reward: 9.9 Training loss: 0.0151 Epsilon: 0.0110 Life: 269
Episode Finish  [ 8. 13. 32.]
Episode: 1141 Total reward: 13.5 Training loss: 0.0238 Epsilon: 0.0110 Life: 269
Episode Finish  [ 9.  4. 16.]
Episode: 1142 Total reward: 14.5 Training loss: 0.0402 Epsilon: 0.0110 Life: 269
Episode Finish  [ 8.  6. 16.]
Episode: 1143 Total reward: 12.7 Training loss: 0.0202 Epsilon: 0.011

Episode Finish  [ 8.  5. 16.]
Episode: 1206 Total reward: 12.3 Training loss: 0.0295 Epsilon: 0.0106 Life: 300
Episode Finish  [10.  7.  4.]
Episode: 1207 Total reward: 16.5 Training loss: 0.0298 Epsilon: 0.0106 Life: 300
Episode Finish  [7. 9. 4.]
Episode: 1208 Total reward: 10.7 Training loss: 0.0198 Epsilon: 0.0106 Life: 300
Episode Finish  [11.  3. 10.]
Episode: 1209 Total reward: 18.299999999999997 Training loss: 0.0296 Epsilon: 0.0106 Life: 300
Episode Finish  [ 6. 10. 12.]
Episode: 1210 Total reward: 8.7 Training loss: 0.0250 Epsilon: 0.0106 Life: 300
Episode Finish  [ 8. 10.  6.]
Episode: 1211 Total reward: 12.6 Training loss: 0.0223 Epsilon: 0.0106 Life: 300
Episode Finish  [ 5. 11.  8.]
Episode: 1212 Total reward: 6.800000000000001 Training loss: 0.0201 Epsilon: 0.0106 Life: 300
Episode Finish  [ 9.  0. 24.]
Episode: 1213 Total reward: 14.100000000000001 Training loss: 0.0225 Epsilon: 0.0106 Life: 300
Episode Finish  [15.  0. 24.]
Episode: 1214 Total reward: 26.1 Training los

Episode Finish  [10.  0. 20.]
Episode: 1277 Total reward: 16.099999999999998 Training loss: 0.0401 Epsilon: 0.0103 Life: 300
Episode Finish  [11.  0. 16.]
Episode: 1278 Total reward: 17.7 Training loss: 0.0185 Epsilon: 0.0103 Life: 300
Model updated
Episode Finish  [10.  0.  4.]
Episode: 1279 Total reward: 15.800000000000002 Training loss: 0.0246 Epsilon: 0.0103 Life: 300
Episode Finish  [10.  0. 24.]
Episode: 1280 Total reward: 16.1 Training loss: 0.0263 Epsilon: 0.0103 Life: 300
Episode Finish  [11.  0. 32.]
Episode: 1281 Total reward: 17.8 Training loss: 0.0200 Epsilon: 0.0103 Life: 300
Episode Finish  [11.  1. 24.]
Episode: 1282 Total reward: 18.200000000000003 Training loss: 0.0384 Epsilon: 0.0103 Life: 300
Episode Finish  [10.  2.  4.]
Episode: 1283 Total reward: 16.0 Training loss: 0.0206 Epsilon: 0.0103 Life: 300
Episode Finish  [12.  0. 24.]
Episode: 1284 Total reward: 20.1 Training loss: 0.0335 Epsilon: 0.0103 Life: 300
Episode Finish  [14.  0. 24.]
Episode: 1285 Total reward

Episode Finish  [ 8.  7. 32.]
Episode: 1348 Total reward: 12.500000000000002 Training loss: 0.0254 Epsilon: 0.0102 Life: 300
Episode Finish  [11.  0.  8.]
Episode: 1349 Total reward: 18.099999999999998 Training loss: 0.0298 Epsilon: 0.0102 Life: 300
Episode Finish  [15.  0. 24.]
Episode: 1350 Total reward: 26.1 Training loss: 0.0291 Epsilon: 0.0102 Life: 300
Stats Saved at 1350
Episode Finish  [14.  0.  4.]
Episode: 1351 Total reward: 23.700000000000003 Training loss: 0.0338 Epsilon: 0.0102 Life: 300
Episode Finish  [13.  0. 24.]
Episode: 1352 Total reward: 22.1 Training loss: 0.0315 Epsilon: 0.0102 Life: 300
Episode Finish  [14.  0. 24.]
Episode: 1353 Total reward: 24.1 Training loss: 0.0238 Epsilon: 0.0102 Life: 300
Episode Finish  [7. 6. 8.]
Episode: 1354 Total reward: 10.4 Training loss: 0.0229 Epsilon: 0.0102 Life: 300
Episode Finish  [15.  0.  8.]
Episode: 1355 Total reward: 25.7 Training loss: 0.0206 Epsilon: 0.0102 Life: 300
Episode Finish  [ 6. 13.  2.]
Episode: 1356 Total rew

Episode Finish  [7. 8. 8.]
Episode: 1420 Total reward: 10.6 Training loss: 0.0247 Epsilon: 0.0101 Life: 300
Episode Finish  [13.  0. 24.]
Episode: 1421 Total reward: 22.1 Training loss: 0.0281 Epsilon: 0.0101 Life: 300
Episode Finish  [13.  0. 24.]
Episode: 1422 Total reward: 22.1 Training loss: 0.0234 Epsilon: 0.0101 Life: 300
Episode Finish  [ 9.  8. 28.]
Episode: 1423 Total reward: 14.7 Training loss: 0.0189 Epsilon: 0.0101 Life: 300
Episode Finish  [13.  0. 24.]
Episode: 1424 Total reward: 22.2 Training loss: 0.0244 Epsilon: 0.0101 Life: 300
Episode Finish  [11.  0. 36.]
Episode: 1425 Total reward: 18.099999999999998 Training loss: 0.0224 Epsilon: 0.0101 Life: 300
Episode Finish  [ 7.  1. 12.]
Episode: 1426 Total reward: 9.7 Training loss: 0.0293 Epsilon: 0.0101 Life: 300
Episode Finish  [11.  0. 10.]
Episode: 1427 Total reward: 18.0 Training loss: 0.0377 Epsilon: 0.0101 Life: 300
Episode Finish  [12.  0. 24.]
Episode: 1428 Total reward: 20.1 Training loss: 0.0297 Epsilon: 0.0101 L

Episode Finish  [11.  0. 18.]
Episode: 1492 Total reward: 18.0 Training loss: 0.0239 Epsilon: 0.0101 Life: 300
Episode Finish  [ 9.  1. 18.]
Episode: 1493 Total reward: 14.100000000000001 Training loss: 0.0210 Epsilon: 0.0101 Life: 300
Episode Finish  [15.  0. 18.]
Episode: 1494 Total reward: 26.1 Training loss: 0.0211 Epsilon: 0.0101 Life: 300
Episode Finish  [16.  0. 24.]
Episode: 1495 Total reward: 28.1 Training loss: 0.0184 Epsilon: 0.0101 Life: 300
Episode Finish  [13.  0. 24.]
Episode: 1496 Total reward: 22.099999999999998 Training loss: 0.0267 Epsilon: 0.0101 Life: 300
Episode Finish  [9. 0. 2.]
Episode: 1497 Total reward: 13.5 Training loss: 0.0202 Epsilon: 0.0101 Life: 300
Episode Finish  [13.  2. 12.]
Episode: 1498 Total reward: 21.799999999999997 Training loss: 0.0227 Epsilon: 0.0101 Life: 300
Episode Finish  [ 8.  5. 10.]
Episode: 1499 Total reward: 12.500000000000002 Training loss: 0.0220 Epsilon: 0.0101 Life: 300
Episode Finish  [11.  0. 24.]
Episode: 1500 Total reward: 1

Episode Finish  [12.  0. 24.]
Episode: 1563 Total reward: 20.1 Training loss: 0.0199 Epsilon: 0.0100 Life: 300
Episode Finish  [11.  0.  8.]
Episode: 1564 Total reward: 17.5 Training loss: 0.0234 Epsilon: 0.0100 Life: 300
Episode Finish  [10.  0.  6.]
Episode: 1565 Total reward: 15.899999999999999 Training loss: 0.0240 Epsilon: 0.0100 Life: 300
Episode Finish  [14.  0.  4.]
Episode: 1566 Total reward: 23.999999999999996 Training loss: 0.0224 Epsilon: 0.0100 Life: 300
Episode Finish  [11.  0. 16.]
Episode: 1567 Total reward: 17.9 Training loss: 0.0203 Epsilon: 0.0100 Life: 300
Episode Finish  [11.  0.  4.]
Episode: 1568 Total reward: 17.9 Training loss: 0.0253 Epsilon: 0.0100 Life: 300
Episode Finish  [ 8.  0. 10.]
Episode: 1569 Total reward: 12.0 Training loss: 0.0210 Epsilon: 0.0100 Life: 300
Episode Finish  [12.  0. 24.]
Episode: 1570 Total reward: 20.1 Training loss: 0.0264 Epsilon: 0.0100 Life: 300
Model updated
Episode Finish  [14.  0. 24.]
Episode: 1571 Total reward: 24.099999999

Episode Finish  [11.  0. 16.]
Episode: 1635 Total reward: 17.7 Training loss: 0.0314 Epsilon: 0.0100 Life: 300
Episode Finish  [11.  0. 28.]
Episode: 1636 Total reward: 18.1 Training loss: 0.0252 Epsilon: 0.0100 Life: 300
Episode Finish  [11.  0. 12.]
Episode: 1637 Total reward: 17.8 Training loss: 0.0260 Epsilon: 0.0100 Life: 300
Episode Finish  [14.  0. 24.]
Episode: 1638 Total reward: 24.1 Training loss: 0.0238 Epsilon: 0.0100 Life: 300
Episode Finish  [10.  0. 50.]
Episode: 1639 Total reward: 16.0 Training loss: 0.0265 Epsilon: 0.0100 Life: 300
Episode Finish  [15.  0.  6.]
Episode: 1640 Total reward: 26.0 Training loss: 0.0186 Epsilon: 0.0100 Life: 300
Episode Finish  [10.  0.  6.]
Episode: 1641 Total reward: 15.400000000000002 Training loss: 0.0312 Epsilon: 0.0100 Life: 300
Episode Finish  [17.  0. 20.]
Episode: 1642 Total reward: 29.7 Training loss: 0.0190 Epsilon: 0.0100 Life: 300
Episode Finish  [15.  0.  6.]
Episode: 1643 Total reward: 25.800000000000004 Training loss: 0.0227

Episode Finish  [15.  0. 18.]
Episode: 1706 Total reward: 25.7 Training loss: 0.0234 Epsilon: 0.0100 Life: 300
Episode Finish  [12.  0. 24.]
Episode: 1707 Total reward: 20.1 Training loss: 0.0297 Epsilon: 0.0100 Life: 300
Episode Finish  [12.  0.  8.]
Episode: 1708 Total reward: 20.1 Training loss: 0.0198 Epsilon: 0.0100 Life: 300
Episode Finish  [11.  0. 14.]
Episode: 1709 Total reward: 17.799999999999997 Training loss: 0.0249 Epsilon: 0.0100 Life: 300
Episode Finish  [10.  1. 12.]
Episode: 1710 Total reward: 16.2 Training loss: 0.0179 Epsilon: 0.0100 Life: 300
Episode Finish  [ 7. 15. 24.]
Episode: 1711 Total reward: 11.6 Training loss: 0.0239 Epsilon: 0.0100 Life: 300
Episode Finish  [9. 3. 2.]
Episode: 1712 Total reward: 14.200000000000003 Training loss: 0.0200 Epsilon: 0.0100 Life: 300
Episode Finish  [11.  0.  6.]
Episode: 1713 Total reward: 17.9 Training loss: 0.0181 Epsilon: 0.0100 Life: 300
Episode Finish  [10.  0. 10.]
Episode: 1714 Total reward: 16.0 Training loss: 0.0297 Ep

Episode Finish  [12.  9.  4.]
Episode: 1777 Total reward: 20.700000000000003 Training loss: 0.0377 Epsilon: 0.0100 Life: 300
Episode Finish  [16.  0. 24.]
Episode: 1778 Total reward: 28.1 Training loss: 0.0226 Epsilon: 0.0100 Life: 300
Episode Finish  [12.  2.  4.]
Episode: 1779 Total reward: 20.099999999999998 Training loss: 0.0238 Epsilon: 0.0100 Life: 300
Episode Finish  [14.  3. 24.]
Episode: 1780 Total reward: 25.4 Training loss: 0.0246 Epsilon: 0.0100 Life: 300
Episode Finish  [9. 4. 4.]
Episode: 1781 Total reward: 15.4 Training loss: 0.0260 Epsilon: 0.0100 Life: 300
Episode Finish  [13.  1. 24.]
Episode: 1782 Total reward: 22.199999999999996 Training loss: 0.0191 Epsilon: 0.0100 Life: 300
Episode Finish  [12.  7. 24.]
Episode: 1783 Total reward: 20.8 Training loss: 0.0221 Epsilon: 0.0100 Life: 300
Episode Finish  [ 6. 13. 24.]
Episode: 1784 Total reward: 9.4 Training loss: 0.0251 Epsilon: 0.0100 Life: 300
Episode Finish  [13.  3. 24.]
Episode: 1785 Total reward: 22.4 Training lo

Episode Finish  [10.  0. 24.]
Episode: 1849 Total reward: 16.1 Training loss: 0.0221 Epsilon: 0.0100 Life: 300
Episode Finish  [ 9.  4. 24.]
Episode: 1850 Total reward: 14.5 Training loss: 0.0188 Epsilon: 0.0100 Life: 300
Stats Saved at 1850
Episode Finish  [10.  1.  4.]
Episode: 1851 Total reward: 15.899999999999999 Training loss: 0.0188 Epsilon: 0.0100 Life: 300
Episode Finish  [ 8.  3. 20.]
Episode: 1852 Total reward: 12.2 Training loss: 0.0271 Epsilon: 0.0100 Life: 300
Episode Finish  [12.  0.  4.]
Episode: 1853 Total reward: 19.799999999999997 Training loss: 0.0173 Epsilon: 0.0100 Life: 300
Episode Finish  [14.  0.  6.]
Episode: 1854 Total reward: 23.9 Training loss: 0.0215 Epsilon: 0.0100 Life: 300
Episode Finish  [12.  0. 24.]
Episode: 1855 Total reward: 20.099999999999998 Training loss: 0.0211 Epsilon: 0.0100 Life: 300
Episode Finish  [12.  0. 10.]
Episode: 1856 Total reward: 19.400000000000006 Training loss: 0.0288 Epsilon: 0.0100 Life: 300
Episode Finish  [13.  0. 24.]
Episod

Episode Finish  [16.  0.  8.]
Episode: 1920 Total reward: 27.800000000000004 Training loss: 0.0194 Epsilon: 0.0100 Life: 300
Episode Finish  [16.  0. 10.]
Episode: 1921 Total reward: 28.000000000000004 Training loss: 0.0335 Epsilon: 0.0100 Life: 300
Episode Finish  [14.  0. 24.]
Episode: 1922 Total reward: 24.1 Training loss: 0.0334 Epsilon: 0.0100 Life: 300
Episode Finish  [9. 0. 4.]
Episode: 1923 Total reward: 14.100000000000001 Training loss: 0.0208 Epsilon: 0.0100 Life: 300
Episode Finish  [13.  0.  8.]
Episode: 1924 Total reward: 21.8 Training loss: 0.0270 Epsilon: 0.0100 Life: 300
Episode Finish  [13.  2.  4.]
Episode: 1925 Total reward: 22.1 Training loss: 0.0226 Epsilon: 0.0100 Life: 300
Episode Finish  [ 6.  6. 24.]
Episode: 1926 Total reward: 8.700000000000001 Training loss: 0.0293 Epsilon: 0.0100 Life: 300
Episode Finish  [10.  0.  2.]
Episode: 1927 Total reward: 15.399999999999999 Training loss: 0.0284 Epsilon: 0.0100 Life: 300
Episode Finish  [13.  0.  6.]
Episode: 1928 To

Episode Finish  [14.  0. 24.]
Episode: 1991 Total reward: 24.2 Training loss: 0.0202 Epsilon: 0.0100 Life: 300
Episode Finish  [14.  0.  4.]
Episode: 1992 Total reward: 24.1 Training loss: 0.0347 Epsilon: 0.0100 Life: 300
Episode Finish  [12.  3. 24.]
Episode: 1993 Total reward: 20.4 Training loss: 0.0211 Epsilon: 0.0100 Life: 300
Episode Finish  [12.  0.  4.]
Episode: 1994 Total reward: 19.9 Training loss: 0.0218 Epsilon: 0.0100 Life: 300
Episode Finish  [13.  0.  4.]
Episode: 1995 Total reward: 22.0 Training loss: 0.0300 Epsilon: 0.0100 Life: 300
Episode Finish  [15.  0. 12.]
Episode: 1996 Total reward: 25.7 Training loss: 0.0184 Epsilon: 0.0100 Life: 300
Episode Finish  [16.  0. 24.]
Episode: 1997 Total reward: 28.099999999999998 Training loss: 0.0249 Epsilon: 0.0100 Life: 300
Episode Finish  [11.  7. 12.]
Episode: 1998 Total reward: 18.8 Training loss: 0.0213 Epsilon: 0.0100 Life: 300
Episode Finish  [10.  6.  4.]
Episode: 1999 Total reward: 16.400000000000002 Training loss: 0.0264

Episode Finish  [14.  0. 14.]
Episode: 2063 Total reward: 24.0 Training loss: 0.0294 Epsilon: 0.0100 Life: 300
Episode Finish  [14.  0. 26.]
Episode: 2064 Total reward: 23.9 Training loss: 0.0256 Epsilon: 0.0100 Life: 300
Episode Finish  [12.  0. 12.]
Episode: 2065 Total reward: 19.7 Training loss: 0.0290 Epsilon: 0.0100 Life: 300
Episode Finish  [12.  0.  8.]
Episode: 2066 Total reward: 19.8 Training loss: 0.0329 Epsilon: 0.0100 Life: 300
Episode Finish  [ 7. 11. 10.]
Episode: 2067 Total reward: 11.1 Training loss: 0.0149 Epsilon: 0.0100 Life: 300
Episode Finish  [13.  0. 32.]
Episode: 2068 Total reward: 22.0 Training loss: 0.0211 Epsilon: 0.0100 Life: 300
Episode Finish  [ 8. 12. 16.]
Episode: 2069 Total reward: 13.100000000000001 Training loss: 0.0197 Epsilon: 0.0100 Life: 300
Episode Finish  [14.  0.  2.]
Episode: 2070 Total reward: 23.299999999999997 Training loss: 0.0253 Epsilon: 0.0100 Life: 300
Episode Finish  [15.  0. 20.]
Episode: 2071 Total reward: 26.1 Training loss: 0.0288

Episode Finish  [11.  0.  2.]
Episode: 2135 Total reward: 17.3 Training loss: 0.0217 Epsilon: 0.0100 Life: 300
Episode Finish  [ 8.  9. 12.]
Episode: 2136 Total reward: 13.0 Training loss: 0.0232 Epsilon: 0.0100 Life: 300
Episode Finish  [11.  0.  2.]
Episode: 2137 Total reward: 17.999999999999996 Training loss: 0.0218 Epsilon: 0.0100 Life: 300
Episode Finish  [11.  0.  4.]
Episode: 2138 Total reward: 17.799999999999997 Training loss: 0.0167 Epsilon: 0.0100 Life: 300
Episode Finish  [13.  0.  4.]
Episode: 2139 Total reward: 22.0 Training loss: 0.0235 Epsilon: 0.0100 Life: 300
Episode Finish  [15.  0. 10.]
Episode: 2140 Total reward: 25.999999999999996 Training loss: 0.0155 Epsilon: 0.0100 Life: 300
Episode Finish  [ 7.  9. 14.]
Episode: 2141 Total reward: 10.799999999999999 Training loss: 0.0146 Epsilon: 0.0100 Life: 300
Episode Finish  [14.  0.  2.]
Episode: 2142 Total reward: 23.9 Training loss: 0.0211 Epsilon: 0.0100 Life: 300
Episode Finish  [ 8.  0. 14.]
Episode: 2143 Total reward

Episode Finish  [16.  0. 28.]
Episode: 2206 Total reward: 27.9 Training loss: 0.0208 Epsilon: 0.0100 Life: 524
Episode Finish  [14.  0. 24.]
Episode: 2207 Total reward: 24.1 Training loss: 0.0265 Epsilon: 0.0100 Life: 524
Episode Finish  [11.  0.  8.]
Episode: 2208 Total reward: 18.0 Training loss: 0.0152 Epsilon: 0.0100 Life: 524
Episode Finish  [16.  0. 24.]
Episode: 2209 Total reward: 27.8 Training loss: 0.0275 Epsilon: 0.0100 Life: 524
Episode Finish  [12.  0. 40.]
Episode: 2210 Total reward: 20.1 Training loss: 0.0251 Epsilon: 0.0100 Life: 524
Episode Finish  [10.  0. 18.]
Episode: 2211 Total reward: 15.7 Training loss: 0.0371 Epsilon: 0.0100 Life: 524
Episode Finish  [10.  0.  4.]
Episode: 2212 Total reward: 15.799999999999999 Training loss: 0.0219 Epsilon: 0.0100 Life: 524
Episode Finish  [ 9. 11. 24.]
Episode: 2213 Total reward: 15.1 Training loss: 0.0244 Epsilon: 0.0100 Life: 524
Episode Finish  [13.  0.  8.]
Episode: 2214 Total reward: 21.799999999999997 Training loss: 0.0227

Episode Finish  [15.  0. 44.]
Episode: 2277 Total reward: 26.2 Training loss: 0.0191 Epsilon: 0.0100 Life: 524
Episode Finish  [ 8.  1. 16.]
Episode: 2278 Total reward: 12.100000000000001 Training loss: 0.0283 Epsilon: 0.0100 Life: 524
Episode Finish  [16.  0.  4.]
Episode: 2279 Total reward: 27.900000000000006 Training loss: 0.0205 Epsilon: 0.0100 Life: 524
Episode Finish  [13.  0. 10.]
Episode: 2280 Total reward: 21.900000000000002 Training loss: 0.0326 Epsilon: 0.0100 Life: 524
Episode Finish  [11.  3. 40.]
Episode: 2281 Total reward: 18.4 Training loss: 0.0266 Epsilon: 0.0100 Life: 524
Episode Finish  [12.  1. 24.]
Episode: 2282 Total reward: 20.2 Training loss: 0.0229 Epsilon: 0.0100 Life: 524
Episode Finish  [ 9.  8. 28.]
Episode: 2283 Total reward: 15.0 Training loss: 0.0328 Epsilon: 0.0100 Life: 524
Episode Finish  [15.  0. 10.]
Episode: 2284 Total reward: 26.0 Training loss: 0.0292 Epsilon: 0.0100 Life: 524
Episode Finish  [12.  0. 24.]
Episode: 2285 Total reward: 20.099999999

Episode Finish  [13.  0.  6.]
Episode: 2348 Total reward: 21.599999999999998 Training loss: 0.0195 Epsilon: 0.0100 Life: 524
Episode Finish  [10.  0.  8.]
Episode: 2349 Total reward: 15.599999999999998 Training loss: 0.0153 Epsilon: 0.0100 Life: 524
Episode Finish  [13.  0. 24.]
Episode: 2350 Total reward: 22.099999999999998 Training loss: 0.0167 Epsilon: 0.0100 Life: 524
Stats Saved at 2350
Episode Finish  [9. 0. 6.]
Episode: 2351 Total reward: 13.9 Training loss: 0.0169 Epsilon: 0.0100 Life: 524
Episode Finish  [14.  0. 10.]
Episode: 2352 Total reward: 24.0 Training loss: 0.0194 Epsilon: 0.0100 Life: 524
Episode Finish  [12.  0.  8.]
Episode: 2353 Total reward: 20.0 Training loss: 0.0265 Epsilon: 0.0100 Life: 524
Episode Finish  [11.  0.  2.]
Episode: 2354 Total reward: 17.200000000000003 Training loss: 0.0201 Epsilon: 0.0100 Life: 524
Episode Finish  [12.  0.  6.]
Episode: 2355 Total reward: 19.7 Training loss: 0.0174 Epsilon: 0.0100 Life: 524
Episode Finish  [14.  0. 24.]
Episode: 

## Step 9: Watch our Agent play 👀
Now that we trained our agent, we can test it

In [None]:
with tf.Session() as sess:
    
    game = DoomGame()
    
    # Load the correct configuration (TESTING)
    game.load_config("defend_the_center.cfg")
    
    # Load the correct scenario (in our case deadly_corridor scenario)
    game.set_doom_scenario_path("defend_the_center.wad")
    
    game.init()    
    
    # Load the model
    saver.restore(sess, "./models/model.ckpt")
    game.init()
    
    for i in range(10):
        
        game.new_episode()
        state = game.get_state().screen_buffer
        state, stacked_frames = stack_frames(stacked_frames, state, True)
    
        while not game.is_episode_finished():
            ## EPSILON GREEDY STRATEGY
            # Choose action a from state s using epsilon greedy.
            ## First we randomize a number
            exp_exp_tradeoff = np.random.rand()
            

            explore_probability = 0.01
    
            if (explore_probability > exp_exp_tradeoff):
                # Make a random action (exploration)
                action = random.choice(possible_actions)
        
            else:
                # Get action from Q-network (exploitation)
                # Estimate the Qs values state
                Qs = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: state.reshape((1, *state.shape))})
        
                # Take the biggest Q value (= the best action)
                choice = np.argmax(Qs)
                action = possible_actions[int(choice)]
            
            game.make_action(action)
            done = game.is_episode_finished()
        
            if done:
                break  
                
            else:
                next_state = game.get_state().screen_buffer
                next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)
                state = next_state
        
        score = game.get_total_reward()
        print("Score: ", score)
    
    game.close()