## **OpenAI Gym, PyBullet and PyBulletGym Installation**
[Click here to see Gym documentaion](https://gym.openai.com/docs/)

[Click here to see PyBullet documentaion](https://docs.google.com/document/d/10sXEhzFRSnvFcl3XxNGhnD4N2SedqwdAvK3dsihxVUA)

[Click here to see PyBulletGym page](https://github.com/benelot/pybullet-gym)



Note that this assignment was done in a remote server.

**Before we start, first update the apt-get tool in the given machine.**

In [1]:
import sys
print(sys.version)

3.7.4 (default, Aug 13 2019, 20:35:49) 
[GCC 7.3.0]


In [2]:
# !apt-get update

Most of the requirements of python packages are already fulfilled on Colab. To run Gym, you have to install prerequisites like xvbf,opengl & other python-dev packages using the following codes.

In [3]:
# !pip install gym
# !apt-get install python-opengl -y
# !apt install xvfb -y

For rendering environment, you can use pyvirtualdisplay. So fulfill that

In [4]:
# !pip install pyvirtualdisplay
# !pip install piglet

In [5]:
# !pip install pybullet==2.5.9

In [6]:
# !git clone https://github.com/benelot/pybullet-gym.git # should already be there in my Google Drive

## **Update the source code**
In pybulletgym/envs/mujoco/envs/pendulum/inverted_pendulum_env.py, line 32, change

done = not np.isfinite(state).all() or np.abs(state[1]) > .2

to

done = abs(state[0][0]) > 2.4 or abs(state[0][1]) > 0.27

**Restart runtime and run the following cells.**

In [7]:
# cd /content/pybullet-gym/ # use the address below instead

In [8]:
# cd './pybullet-gym'

In [9]:
# !pip install -e .

# Tensorflow version: 1.13.1

In [10]:
# !pip install tensorflow==1.13.1

Import everything.

In [11]:
from __future__ import division
import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) # error only

import pybulletgym  # register PyBullet enviroments with open ai gym
import pybullet
import pybullet_data

import numpy as np
import random
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
import os
from os import path
import copy
import hickle as hkl

from IPython.display import HTML
from IPython import display as ipythondisplay

# Colab comes with PyTorch
# import torch
# import torch.nn as nn
# import torch.nn.functional as F
# from torch.autograd import Variable
from collections import namedtuple
from itertools import count
from PIL import Image
import shutil
import gym
import psutil
import gc
import statistics
import cv2

from keras.models import Sequential, Model, load_model
from keras.layers import Dense, Dropout, Input, BatchNormalization, \
                            Reshape, Flatten, Activation, ZeroPadding2D, \
                            Lambda, Convolution2D
from keras.layers.merge import Add, Multiply
from keras.optimizers import Adam
# import keras.backend as K

from keras.layers.advanced_activations import LeakyReLU
from keras.layers.convolutional import UpSampling2D, Conv2D
from keras.engine.topology import Layer
from keras import optimizers
from keras import initializers

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
from collections import deque

# use plaidml as backend
# install plaidml:
# pip install plaidml-keras
# plaidml-setup
# ======================================================
os.environ["KERAS_BACKEND"] = "plaidml.keras.backend"
# ======================================================
from keras import backend as K
# from tensorflow.keras import backend as K
from tensorflow.python.keras import backend as k
# use tensorflow as backend
# ======================================================
# import tensorflow as tf
os.environ['CUDA_VISIBLE_DEVICES'] = '1'
# config = tf.compat.v1.ConfigProto() # for Tensorflow 2.1
config = tf.ConfigProto() # for Tensorflow 1.13.1
config.gpu_options.allow_growth = True
print("config: ", config)

# tf.compat.v1.keras.backend.set_session(tf.compat.v1.Session(config=config)) # for Tensorflow 2.1
K.set_session(tf.Session(config=config)) # for Tensorflow 1.13.1
# ======================================================

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Instructions for updating:
non-resource variables are not supported in the long term
config:  gpu_options {
  allow_growth: true
}



# All Parameters

In [12]:
# ---------------
# All parameters
# ---------------
class Params:
    def __init__(self):
        # Paramaters for this experiment
        self.exp_id = 'DQN_bk_smer_4' # bk: Breakout, smer: save max episode reward
        self.env_id = 'BreakoutDeterministic-v4'# 'BreakoutNoFrameskip-v4'
        self.server_path = '/home/bryanbc/Repos/rl/'
        self.hw = 'hw03'
        self.video_path = self.server_path + self.hw + '/' + self.exp_id + '/video/'
        self.mp4list_path = self.video_path + '*.mp4'
        
        # Parameters for enviroment
        self.seed_num = 123
        self.n_actions = 4
        self.s_len = 4
        self.input_shape = (None, 93, 80, self.s_len)
        self.init_epsilon = 1.
        self.final_epsilon = 0.01
        self.epsilon = copy.deepcopy(self.init_epsilon)
        self.max_episodes = 10000000
        self.max_steps = 1000
        self.exploration_steps = 500000
        self.cnt_frames = 0
        self.max_episode_reward = 0
        
        # Parameters for models
        self.init_learning_rate = 1e-4
        self.final_learning_rate = 5e-6
        self.learning_rate = copy.deepcopy(self.init_learning_rate)
        self.learning_rate_decay_step = 0
        self.batch_size = 32
        self.gamma = 0.99
        self.tau   =  0.001
        self.buffer_size = 500000
        
        self.train_frame_interval = 4
        self.update_target_network_episode_interval = 40
        self.save_model_episode_interval = 500
        self.episode_i = 0
        
        # saved models paths
        self.saved_models_path = '/ssd/bryanbc/saved_models/' + self.hw + '/' + self.exp_id
        self.saved_train_network_filepath = '%s/train_network_episode_%d.h5' % \
            (self.saved_models_path, self.episode_i)
        self.saved_target_network_filepath = '%s/target_network_episode_%d.h5' % \
            (self.saved_models_path, self.episode_i)
        self.saved_PARAMS_filepath = '%s/PARAMS_episode_%d.hkl' % \
            (self.saved_models_path, self.episode_i)
        
        # saved log path
        self.log_path = '/ssd/bryanbc/data/logs/hw/' + self.hw + '/'
        os.makedirs((self.log_path), exist_ok=True)
        self.log_filepath = self.log_path + self.exp_id + '_episode_reward.log'
        # Open the log file
        self.log_file = open((self.log_filepath), 'a')
        
        # saved replay buffer path
        # self.replay_buffer_path = self.saved_models_path + '/'
        # os.makedirs((self.replay_buffer_path), exist_ok=True)
        # self.saved_replay_buffer_filepath = self.replay_buffer_path + self.exp_id + '_replay_buffer_gzip.hkl'
        
    def load(self, load_PARAMS):
        self.epsilon, self.learning_rate, \
            self.learning_rate_decay_step, self.cnt_frames, \
            self.episode_i, self.max_episode_reward = load_PARAMS
        
PARAMS = Params()

To activate virtual display, we need to run a script once for training an agent, as follows:

In [13]:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

<Display cmd_param=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':6119'] cmd=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':6119'] oserror=None return_code=None stdout="None" stderr="None" timeout_happened=False>

The following code creates a virtual display to draw game images on. If you are running locally, just ignore it.

In [14]:
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY"))==0:
    !bash ../xvfb start
    %env DISPLAY=:1

In [15]:
"""
Utility functions to enable video recording of gym environment and displaying it
To enable video, just do "env = wrap_env(env)""
"""
# mp4list_path_colab = '/content/gdrive/My Drive/video/*.mp4'
def show_video():
    mp4list = glob.glob(PARAMS.mp4list_path) # glob.glob('/content/video/*.mp4')
    if len(mp4list) > 0:
        mp4 = mp4list[0]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
                </video>'''.format(encoded.decode('ascii'))))
    else: 
        print("Could not find video")
    
# video_path_colab = '/content/gdrive/My Drive/video/'
def wrap_env(env):
    env = Monitor(env, PARAMS.video_path, force=True) # Monitor(env, '/content/video', force=True)
    return env

# Utility

In [16]:
def dense_to_one_hot(data, depth=10):
    return (np.arange(depth) == np.array(data)[:, None]).astype(np.bool)


def rgb2gray(im):
    return (cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)).astype(np.uint8)


def down_sample(gray):
    return gray[25::2, ::2]


class LayerNormalization(Layer):

    def __init__(self, eps=1e-5, activation=None, **kwargs):
        self.eps = eps
        self.channels = None
        self.activation = activation
        super(LayerNormalization, self).__init__(**kwargs)

    def build(self, input_shape):
        self.channels = input_shape[-1]
        shape = [1] * (len(input_shape) - 1)
        shape.append(self.channels)
        self.add_weight('gamma', shape, dtype='float32', initializer='ones')
        self.add_weight('beta', shape, dtype='float32', initializer='zeros')

        super(LayerNormalization, self).build(input_shape)  # Be sure to call this somewhere!

    def call(self, inputs, **kwargs):
        dim = len(K.int_shape(inputs)) - 1
        mean = K.mean(inputs, axis=dim, keepdims=True)
        var = K.mean(K.square(inputs - mean), axis=dim, keepdims=True)
        outputs = (inputs - mean) / K.sqrt(var + self.eps)
        outputs = outputs * self.trainable_weights[0] + self.trainable_weights[1]
        if self.activation is None:
            return outputs
        else:
            return self.activation(outputs)

# **RL Algorithms**
Code based: https://github.com/IntoxicatedDING/DQN-Beat-Atari.

# DQN Network

In [17]:
class Agent:
    def __init__(self):
        self.replay_buffer = deque()
        self.q_out, self.train_network = Agent.build_train_network()
        self.target_network = Agent.build_target_network()

        # self.opt = optimizers.rmsprop(lr=self.learning_rate, rho=0.95)
        self.opt = optimizers.adam(lr=PARAMS.learning_rate)
        # self.opt = optimizers.RMSprop(lr=self.learning_rate, rho=0.95, epsilon=0.01)
        self.train_network.compile(optimizer=self.opt, loss=[Agent.huber_loss])

    # Append a transition (s, a, s_, r, done) into replay buffer
    def remember(self, transition):
        if len(self.replay_buffer) >= PARAMS.buffer_size:
            self.replay_buffer.popleft()
        self.replay_buffer.append(transition)
        return self.replay_buffer

    def sample_batch(self):
        # Sample a batch of transitions from replay buffer
        batch_q, batch_state, batch_mask, states_next, rewards, done =\
            map(lambda x: np.array(list(x)), zip(*random.sample(self.replay_buffer, PARAMS.batch_size)))
        
        batch_state = np.transpose(batch_state, axes=[0, 2, 3, 1])
        states_next = np.transpose(states_next, axes=[0, 2, 3, 1])
        batch_mask = dense_to_one_hot(batch_mask, PARAMS.n_actions)
        q_next = self.target_network.predict(states_next)
        batch_q[batch_mask] = np.array(rewards) + PARAMS.gamma * np.array(done) * np.max(q_next, axis=1)
        return batch_q, batch_state, batch_mask

    def build_train_network():
        X = Input(shape=PARAMS.input_shape[1:], dtype='float32')
        mask = Input(shape=(PARAMS.n_actions,), dtype='float32')
        q_out, model = Agent.build_network(X)
        q_ = Lambda(lambda x: K.reshape(K.sum(x * mask, axis=1), (-1, 1)), output_shape=(1,))(q_out)
        return K.function([X], [q_out]), Model(inputs=[X, mask], outputs=q_)

    def huber_loss(x, y):
        error = K.abs(x - y)
        quadratic_part = K.clip(error, 0.0, 1.0)
        linear_part = error - quadratic_part
        loss = K.mean(0.5 * K.square(quadratic_part) + linear_part, axis=-1)
        return loss

    def build_target_network():
        X = Input(shape=PARAMS.input_shape[1:], dtype='float32')
        Q, model = Agent.build_network(X, trainable=False, init=initializers.zeros())
        return model

    def build_network(X, trainable=True, init=initializers.truncated_normal(stddev=0.01)):
        init_w = init
        init_b = initializers.constant(0.)
        normed = Lambda(lambda x: x / 255., output_shape=K.int_shape(X)[1:])(X)
        h_conv1 = Convolution2D(32, (8, 8), strides=(4, 4),
                                kernel_initializer=init_w, use_bias=False, padding='same')(normed)
        h_ln1 = LayerNormalization(activation=K.relu)(h_conv1)
        h_conv2 = Convolution2D(64, (4, 4), strides=(2, 2),
                                kernel_initializer=init_w, use_bias=False, padding='same')(h_ln1)
        h_ln2 = LayerNormalization(activation=K.relu)(h_conv2)
        h_conv3 = Convolution2D(64, (3, 3), strides=(1, 1),
                                kernel_initializer=init_w, use_bias=False, padding='same')(h_ln2)
        h_ln3 = LayerNormalization(activation=K.relu)(h_conv3)
        h_flat = Flatten()(h_ln3)
        fc1 = Dense(512, use_bias=False, kernel_initializer=init_w)(h_flat)
        h_ln_fc1 = LayerNormalization(activation=K.relu)(fc1)
        q = Dense(PARAMS.n_actions, kernel_initializer=init_w, use_bias=False, bias_initializer=init_b)(h_ln_fc1)
        # q = LayerNormalization()(fc2)
        model = Model(inputs=X, outputs=q)
        model.trainable = trainable
        return q, model

    def train(self):
        batch_q, batch_state, batch_mask = self.sample_batch()
        self.train_network.fit([batch_state, batch_mask], np.sum(batch_mask * batch_q, axis=1), verbose=0)

    def update_epsilon(self):
        PARAMS.epsilon = np.maximum(PARAMS.final_epsilon,
                                  PARAMS.epsilon - (PARAMS.init_epsilon - PARAMS.final_epsilon) / PARAMS.exploration_steps)

    def predict(self, state):
        q = self.q_out([state])
        q = np.array(q).flatten()
        # print(np.argmax(q))
        # print(q)
        return q, np.argmax(q)

    def update_learning_rate(self):
        PARAMS.learning_rate = PARAMS.learning_rate * (0.99 ** (PARAMS.learning_rate_decay_step / 100))
        K.set_value(self.train_network.optimizer.lr, PARAMS.learning_rate)
        PARAMS.learning_rate_decay_step += 1

    def update_target_network(self):
        self.target_network.set_weights(self.train_network.get_weights())

    # Save model weights and PARAMS
    def save(self, best=False):
        os.makedirs((PARAMS.saved_models_path), exist_ok=True)
        
        # Update file paths
        if best:
            PARAMS.saved_train_network_filepath = '%s/train_network_episode_%d_max_r_%d_best.h5' % \
                (PARAMS.saved_models_path, PARAMS.episode_i, PARAMS.max_episode_reward)
            PARAMS.saved_target_network_filepath = '%s/target_network_episode_%d_max_r_%d_best.h5' % \
                (PARAMS.saved_models_path, PARAMS.episode_i, PARAMS.max_episode_reward)
            PARAMS.saved_PARAMS_filepath = '%s/PARAMS_episode_%d_max_r_%d_best.hkl' % \
                (PARAMS.saved_models_path, PARAMS.episode_i, PARAMS.max_episode_reward)
            # PARAMS.saved_replay_buffer_filepath = '%s/replay_buffer_episode_%d_max_r_%d_best.hkl' % \
            #     (PARAMS.saved_models_path, PARAMS.episode_i, PARAMS.max_episode_reward)
        else:
            PARAMS.saved_train_network_filepath = '%s/train_network_episode_%d.h5' % \
                (PARAMS.saved_models_path, PARAMS.episode_i)
            PARAMS.saved_target_network_filepath = '%s/target_network_episode_%d.h5' % \
                (PARAMS.saved_models_path, PARAMS.episode_i)
            # PARAMS.saved_PARAMS_filepath = '%s/PARAMS_episode_%d.hkl' % \
            #     (PARAMS.saved_models_path, PARAMS.episode_i)
            # PARAMS.saved_replay_buffer_filepath = '%s/replay_buffer_episode_%d.hkl' % \
            #     (PARAMS.saved_models_path, PARAMS.episode_i)
        
        self.train_network.save_weights(PARAMS.saved_train_network_filepath)
        self.target_network.save_weights(PARAMS.saved_target_network_filepath)
        
        # Parameters to save
        save_PARAMS = (PARAMS.epsilon,
                       PARAMS.learning_rate,
                       PARAMS.learning_rate_decay_step,
                       PARAMS.cnt_frames,
                       PARAMS.episode_i,
                       PARAMS.max_episode_reward)
        hkl.dump(save_PARAMS, PARAMS.saved_PARAMS_filepath, mode='w')
        
        # Save replay buffer
        # hkl.dump(self.replay_buffer, PARAMS.saved_replay_buffer_filepath, mode='w', compression='gzip')
    
    # Load model weights and PARAMS
    def restore(self, episode_i, best, max_episode_reward):
        if best:
            saved_train_network_filepath = '%s/train_network_episode_%d_max_r_%d_best.h5' % \
                (PARAMS.saved_models_path, episode_i, max_episode_reward)
            saved_target_network_filepath = '%s/target_network_episode_%d_max_r_%d_best.h5' % \
                (PARAMS.saved_models_path, episode_i, max_episode_reward)
            saved_PARAMS_filepath = '%s/PARAMS_episode_%d_max_r_%d_best.hkl' % \
                (PARAMS.saved_models_path, episode_i, max_episode_reward)
            # saved_replay_buffer_filepath = '%s/replay_buffer_episode_%d_max_r_%d_best.hkl' % \
            #     (PARAMS.saved_models_path, episode_i, max_episode_reward)
        else:
            saved_train_network_filepath = '%s/train_network_episode_%d.h5' % \
                (PARAMS.saved_models_path, episode_i)
            saved_target_network_filepath = '%s/target_network_episode_%d.h5' % \
                (PARAMS.saved_models_path, episode_i)
            saved_PARAMS_filepath = '%s/PARAMS_episode_%d.hkl' % \
                (PARAMS.saved_models_path, episode_i)
            # saved_replay_buffer_filepath = '%s/replay_buffer_episode_%d.hkl' % \
            #     (PARAMS.saved_models_path, episode_i)
            
        if path.exists(saved_train_network_filepath) and \
            path.exists(saved_target_network_filepath) and \
            path.exists(saved_PARAMS_filepath):
            # path.exists(saved_replay_buffer_filepath):
            
            self.train_network.load_weights(saved_train_network_filepath)
            self.target_network.load_weights(saved_target_network_filepath)
            
            PARAMS.load(hkl.load(saved_PARAMS_filepath))
            
            print()
            print("====== Models and Parameters Loaded! ======")
            print("%s, %s, and %s loaded!" % (saved_train_network_filepath,
                                             saved_target_network_filepath,
                                             saved_PARAMS_filepath))
            print("Current PARAMS.epsilon: ", PARAMS.epsilon,
                    " PARAMS.learning_rate: ", PARAMS.learning_rate,
                    " PARAMS.learning_rate_decay_step: ", PARAMS.learning_rate_decay_step,
                    " PARAMS.cnt_frames: ", PARAMS.cnt_frames,
                    " PARAMS.episode_i: ", PARAMS.episode_i,
                    " PARAMS.max_episode_reward: ", PARAMS.max_episode_reward)
            
            # Load replay buffer
            # self.replay_buffer = hkl.load(saved_replay_buffer_filepath)
            # print("%s loaded!" % saved_replay_buffer_filepath)
            
        else:
            print("====== Saved files not exist. Start from episode 0. ======")
            PARAMS.episode_i = 0        

# **Environment**

In [18]:
# Create Environment
PARAMS.env = gym.make(PARAMS.env_id)
PARAMS.env = wrap_env(PARAMS.env)
PARAMS.env.seed(PARAMS.seed_num)

s_dim = PARAMS.env.observation_space.shape[0]
PARAMS.n_actions = PARAMS.env.action_space.n

print("s_dim:", s_dim, " PARAMS.env.observation_space.shape: ", PARAMS.env.observation_space.shape)
print("PARAMS.n_actions: ", PARAMS.n_actions, " PARAMS.env.action_space: ", PARAMS.env.action_space)
# print("PARAMS.env.observation_space.high: ", PARAMS.env.observation_space.high)
# print("PARAMS.env.observation_space.low: ", PARAMS.env.observation_space.low)
# print("PARAMS.env.action_space.high: ", PARAMS.env.action_space.high)
# print("PARAMS.env.action_space.low: ", PARAMS.env.action_space.low)
print("PARAMS.env.unwrapped.get_action_meanings(): ", PARAMS.env.unwrapped.get_action_meanings())
print("PARAMS.env.unwrapped.ale.lives(): ", PARAMS.env.unwrapped.ale.lives())

  result = entry_point.load(False)


s_dim: 210  PARAMS.env.observation_space.shape:  (210, 160, 3)
PARAMS.n_actions:  4  PARAMS.env.action_space:  Discrete(4)
PARAMS.env.unwrapped.get_action_meanings():  ['NOOP', 'FIRE', 'RIGHT', 'LEFT']
PARAMS.env.unwrapped.ale.lives():  5


# Training

In [19]:
################
### Training ###
################
# Let the agent interact with the environment

# Create an instance of DQN Agent.
agent = Agent()

# -------------
# Load weights
# -------------------------------------------------------
# Need to Specify episode_i, best and max_episode_reward.
# If you train the agent from stratch, comment the 
# following lines of code out.
# -------------------------------------------------------
# episode_i = 700
# max_episode_reward = 1 # Any number does not matter if best=False
# best = False
# agent.restore(episode_i=episode_i, best=best, max_episode_reward=max_episode_reward)

# -----------------------------
# Iterate through all episodes
# -----------------------------
print()
print("====== Start Interacting with Env ======")
while PARAMS.episode_i < PARAMS.max_episodes:
    obs = PARAMS.env.reset()
    frame = down_sample(rgb2gray(obs))
    frame_stack = [frame, frame, frame, frame]
    episode_reward = 0
    
    # Iterate through all steps
    for t in range(PARAMS.max_steps):
        PARAMS.env.render()
        
        PARAMS.cnt_frames += 1
        # s: current state
        s = frame_stack[-PARAMS.s_len:]
        # q: current q, a: current action
        q, a = agent.predict(\
                    np.expand_dims(np.transpose(\
                    s, [1, 2, 0]), axis=0))
        
        # epsilon probability to take random actions to explore
        if np.random.random() < PARAMS.epsilon:
            a = PARAMS.env.action_space.sample()
        
        # One Step
        obs, r, done, info = PARAMS.env.step(a) 
        # r: immediate reward, done: terminal state indicator
        
        next_frame = down_sample(rgb2gray(obs))
        frame_stack.append(next_frame)
        # s_: next state
        s_ = frame_stack[-PARAMS.s_len:]
        
        transition = (q, s, a, s_, r, int(not done))
        # Append a transition into replay buffer
        agent.remember(transition)
        
        if len(agent.replay_buffer) >= PARAMS.batch_size and \
            PARAMS.cnt_frames % PARAMS.train_frame_interval == 0:
            agent.train()
            agent.update_epsilon()
        
        frame_stack = frame_stack[-PARAMS.s_len:]
        
        episode_reward += r
        
        if done:
            break
    # -------------------
    # End of one episode
    # -------------------
    
    # -------------------
    # Log episode reward
    # -------------------
    PARAMS.log_file.write(str(episode_reward) + '\n')
    PARAMS.log_file.flush()
    print("PARAMS.episode_i: ", PARAMS.episode_i, "episode reward: ", episode_reward, " Episode finished after {} timesteps".format(t+1) + "\n"\
          "    PARAMS.epsilon: ", PARAMS.epsilon, " PARAMS.learning_rate: ", PARAMS.learning_rate)
    
    # Update PARAMS.learning_rate
    if np.abs(PARAMS.epsilon - PARAMS.final_epsilon) < 1e-5 and lr > PARAMS.final_learning_rate:
        agent.update_learning_rate()
        # print("agent.update_learning_rate() Done!")
    # Update target network
    if PARAMS.episode_i % PARAMS.update_target_network_episode_interval == 0:
        agent.update_target_network()
        # print("agent.update_target_network() Done!")
        
    # ------------------------------
    # Save model weights and PARAMS
    # ------------------------------
    if episode_reward > PARAMS.max_episode_reward:
        PARAMS.max_episode_reward = episode_reward
        agent.save(best=True)
        print("PARAMS.max_episode_reward: ", PARAMS.max_episode_reward, " agent.save() Done!")
    elif PARAMS.episode_i % PARAMS.save_model_episode_interval == 0:
        agent.save()
        # print("agent.save() Done!")
    
        
    PARAMS.episode_i += 1
    
PARAMS.env.close()

Instructions for updating:
Colocations handled automatically by placer.

Instructions for updating:
Use tf.cast instead.
PARAMS.episode_i:  0 episode reward:  1.0  Episode finished after 173 timesteps
    PARAMS.epsilon:  0.9999287200000015  PARAMS.learning_rate:  0.0001
PARAMS.max_episode_reward:  1.0  agent.save() Done!
PARAMS.episode_i:  1 episode reward:  2.0  Episode finished after 227 timesteps
    PARAMS.epsilon:  0.999815860000004  PARAMS.learning_rate:  0.0001
PARAMS.max_episode_reward:  2.0  agent.save() Done!
PARAMS.episode_i:  2 episode reward:  1.0  Episode finished after 169 timesteps
    PARAMS.epsilon:  0.9997327000000058  PARAMS.learning_rate:  0.0001
PARAMS.episode_i:  3 episode reward:  2.0  Episode finished after 187 timesteps
    PARAMS.epsilon:  0.9996396400000078  PARAMS.learning_rate:  0.0001
PARAMS.episode_i:  4 episode reward:  0.0  Episode finished after 134 timesteps
    PARAMS.epsilon:  0.9995743000000092  PARAMS.learning_rate:  0.0001
PARAMS.episode_i:  5 

Error: Tried to reset environment which is not done. While the monitor is active for BreakoutDeterministic-v4, you cannot call reset() unless the episode is over.

In [None]:
show_video()