## **OpenAI Gym, PyBullet and PyBulletGym Installation**
[Click here to see Gym documentaion](https://gym.openai.com/docs/)

[Click here to see PyBullet documentaion](https://docs.google.com/document/d/10sXEhzFRSnvFcl3XxNGhnD4N2SedqwdAvK3dsihxVUA)

[Click here to see PyBulletGym page](https://github.com/benelot/pybullet-gym)



Note that this assignment was done in a remote server.

**Before we start, first update the apt-get tool in the given machine.**

In [0]:
import sys
print(sys.version)

3.7.4 (default, Aug 13 2019, 20:35:49) 
[GCC 7.3.0]


In [0]:
# !apt-get update

Most of the requirements of python packages are already fulfilled on Colab. To run Gym, you have to install prerequisites like xvbf,opengl & other python-dev packages using the following codes.

In [0]:
# !pip install gym
# !apt-get install python-opengl -y
# !apt install xvfb -y

For rendering environment, you can use pyvirtualdisplay. So fulfill that

In [0]:
# !pip install pyvirtualdisplay
# !pip install piglet

In [0]:
# !pip install pybullet==2.5.9

In [0]:
# !git clone https://github.com/benelot/pybullet-gym.git # should already be there in my Google Drive

## **Update the source code**
In pybulletgym/envs/mujoco/envs/pendulum/inverted_pendulum_env.py, line 32, change

done = not np.isfinite(state).all() or np.abs(state[1]) > .2

to

done = abs(state[0][0]) > 2.4 or abs(state[0][1]) > 0.27

**Restart runtime and run the following cells.**

In [0]:
# cd /content/pybullet-gym/ # use the address below instead

In [0]:
# cd './pybullet-gym'

In [0]:
# !pip install -e .

Import everything.

In [0]:
from __future__ import division
import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) # error only

import pybulletgym  # register PyBullet enviroments with open ai gym
import pybullet
import pybullet_data

import numpy as np
import random
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
import os
from os import path

from IPython.display import HTML
from IPython import display as ipythondisplay

# Colab comes with PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from collections import namedtuple
from itertools import count
from PIL import Image
import shutil

import gym
import os
import psutil
import gc

import statistics

from collections import deque

In [0]:
# ---------------
# All parameters
# ---------------
class Params:
    def __init__(self):
        # Paramaters for this experiment
        self.exp_id = 'htc_actor_critic_pytorch_1'
        self.env_id = 'HalfCheetahMuJoCoEnv-v0' # 'HalfCheetahMuJoCoEnv-v0' # 'Pendulum-v0' # 'InvertedPendulumMuJoCoEnv-v0' #
        self.server_path = '/home/bryanbc/Repos/rl/'
        self.hw = 'hw02'
        self.video_path = self.server_path + self.hw + '/' + self.exp_id + '/video/'
        self.mp4list_path = self.video_path + '*.mp4'
        
        self.save_model_episode_interval = 50
        self.saved_models_path = '/ssd/bryanbc/saved_models/' + self.hw + '/' + self.exp_id
        self.train_start_episode = 0
        
        # Parameters for models
        self.learning_rate = 0.001
        self.batch_size = 128
        self.epsilon = 1.0
        self.epsilon_decay = 0.99
        self.gamma = 0.99
        self.tau   =  0.001 # 0.125
        self.max_buffer = 1000000
        self.dropout_rate = 0.5
        self.sigma = 0.01
        
        # Parameters for enviroment
        self.max_episodes = 5000
        self.max_steps = 1000
        self.EPS = 0.003
        
        self.log_path = '/ssd/bryanbc/data/logs/hw/' + self.hw + '/' + self.exp_id + '_'
        
PARAMS = Params()

In [0]:
# Open log files
log_file = open((PARAMS.log_path + 'episode_reward.log'), 'a')

To activate virtual display, we need to run a script once for training an agent, as follows:

In [0]:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

<Display cmd_param=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':4797'] cmd=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':4797'] oserror=None return_code=None stdout="None" stderr="None" timeout_happened=False>

The following code creates a virtual display to draw game images on. If you are running locally, just ignore it.

In [0]:
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY"))==0:
    !bash ../xvfb start
    %env DISPLAY=:1

In [0]:
"""
Utility functions to enable video recording of gym environment and displaying it
To enable video, just do "env = wrap_env(env)""
"""
# mp4list_path_colab = '/content/gdrive/My Drive/video/*.mp4'
def show_video():
    mp4list = glob.glob(PARAMS.mp4list_path) # glob.glob('/content/video/*.mp4')
    if len(mp4list) > 0:
        mp4 = mp4list[0]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
                </video>'''.format(encoded.decode('ascii'))))
    else: 
        print("Could not find video")
    
# video_path_colab = '/content/gdrive/My Drive/video/'
def wrap_env(env):
    env = Monitor(env, PARAMS.video_path, force=True) # Monitor(env, '/content/video', force=True)
    return env

# **InvertedPendulumMuJoCoEnv-v0**

In [0]:
# Create Environment
PARAMS.env = gym.make(PARAMS.env_id)
PARAMS.env = wrap_env(PARAMS.env)

s_dim = PARAMS.env.observation_space.shape[0]
a_dim = PARAMS.env.action_space.shape[0]

print("s_dim:", s_dim, " PARAMS.env.observation_space.shape: ", PARAMS.env.observation_space.shape)
print("a_dim: ", a_dim, " PARAMS.env.action_space.shape: ", PARAMS.env.action_space.shape)
print("PARAMS.env.action_space.high: ", PARAMS.env.action_space.high)
print("PARAMS.env.action_space.low: ", PARAMS.env.action_space.low)

  result = entry_point.load(False)


current_dir=/home/bryanbc/Apps/anaconda3/lib/python3.7/site-packages/pybullet_envs/bullet
WalkerBase::__init__
s_dim: 17  PARAMS.env.observation_space.shape:  (17,)
a_dim:  6  PARAMS.env.action_space.shape:  (6,)
PARAMS.env.action_space.high:  [1. 1. 1. 1. 1. 1.]
PARAMS.env.action_space.low:  [-1. -1. -1. -1. -1. -1.]


# **RL Algorithms**
Since the action space is continuous, I use DDPG(Actor-Critic on continuous actions) https://arxiv.org/pdf/1509.02971.pdf. Note that $logπ_{θ}(a|s)$ and the advantage term are not in the code because they are used to approximate the policy gradient $∇_{θ}J(θ)$ with computationally efficiency, while DDPG uses a different way to compute $∇_{θ}J(θ)$ as shown in the report and paper.

My code is based on this tutorial: https://github.com/vy007vikas/PyTorch-ActorCriticRL and https://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html. Note that action value ranges is [-1, 1] in this hw, so 'tanh' is used as the activation in the last layer to be in the same range as the action.

# Utility

In [0]:
# ----------------------------------------------------------------------------------- 
# OrnsteinUhlenbeckActionNoise the simulate the dynamics noise in the physical world
# -----------------------------------------------------------------------------------

# Based on http://math.stackexchange.com/questions/1287634/implementing-ornstein-uhlenbeck-in-matlab
class OrnsteinUhlenbeckActionNoise:

    def __init__(self, a_dim, mu = 0, theta = 0.15, sigma = 0.2):
        self.a_dim = a_dim
        self.mu = mu
        self.theta = theta
        self.sigma = sigma
        self.X = np.ones(self.a_dim) * self.mu

    def reset(self):
        self.X = np.ones(self.a_dim) * self.mu

    def sample(self):
        dx = self.theta * (self.mu - self.X)
        dx = dx + self.sigma * np.random.randn(len(self.X))
        self.X = self.X + dx
        return self.X

# ReplayBuffer

In [0]:
'''
ReplayBuffer
'''
class ReplayBuffer:

    def __init__(self, size):
        self.buffer = deque(maxlen=size)
        self.max_size = size
        self.len = 0

    def sample(self, count):
        '''
        Sample a random batch from the replay buffer.
            input:
                count: batch size
            return:
                batch (numpy array)
        '''
        batch = []
        count = min(count, self.len)
        
        # sample a batch of transitions from replay buffer
        batch = random.sample(self.buffer, count)

        s_arr = np.float32([arr[0] for arr in batch])
        a_arr = np.float32([arr[1] for arr in batch])
        r_arr = np.float32([arr[2] for arr in batch])
        s1_arr = np.float32([arr[3] for arr in batch])

        return s_arr, a_arr, r_arr, s1_arr

    def len(self):
        return self.len

    def remember(self, s, a, r, s1):
        '''
        Append a transition in the replay buffer.
            input:
                transition (s, a, r, s1)
        '''
        transition = (s, a, r, s1)
        self.len += 1
        if self.len > self.max_size:
            self.len = self.max_size
        self.buffer.append(transition)

# Actor Critic Network

In [0]:
def fanin_init(size, fanin=None):
    fanin = fanin or size[0]
    v = 1. / np.sqrt(fanin)
    return torch.Tensor(size).uniform_(-v, v)

'''
Actor
'''
class Actor(nn.Module):

    def __init__(self, s_dim, a_dim):
        '''
        input:
            s_dim: state dimension (int)
            a_dim: output action dimension (int)
        '''
        super(Actor, self).__init__()

        self.s_dim = s_dim
        self.a_dim = a_dim

        self.fc1 = nn.Linear(s_dim, 256)
        self.fc1.weight.data = fanin_init(self.fc1.weight.data.size())

        self.fc2 = nn.Linear(256, 128)
        self.fc2.weight.data = fanin_init(self.fc2.weight.data.size())

        self.fc3 = nn.Linear(128, 64)
        self.fc3.weight.data = fanin_init(self.fc3.weight.data.size())

        self.fc4 = nn.Linear(64, a_dim)
        self.fc4.weight.data.uniform_(-PARAMS.EPS,PARAMS.EPS)

    def forward(self, s):
        '''
        Return policy function Pi(s) obtained from actor network.
        '''
        x = F.relu(self.fc1(s))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        a = F.tanh(self.fc4(x)) # tanh to match the action space [-1, 1]
        return a


'''
Critic
'''
class Critic(nn.Module):

    def __init__(self, s_dim, a_dim):
        '''
        input:
            s_dim: state dimension (int)
            a_dim: output action dimension (int)
        '''
        super(Critic, self).__init__()

        self.s_dim = s_dim
        self.a_dim = a_dim

        self.fc_s1 = nn.Linear(s_dim, 256)
        self.fc_s1.weight.data = fanin_init(self.fc_s1.weight.data.size())
        self.fc_s2 = nn.Linear(256, 128)
        self.fc_s2.weight.data = fanin_init(self.fc_s2.weight.data.size())

        self.fc_a1 = nn.Linear(a_dim, 128)
        self.fc_a1.weight.data = fanin_init(self.fc_a1.weight.data.size())

        self.fc2 = nn.Linear(256, 128)
        self.fc2.weight.data = fanin_init(self.fc2.weight.data.size())

        self.fc3 = nn.Linear(128, 1)
        self.fc3.weight.data.uniform_(-PARAMS.EPS, PARAMS.EPS)

    def forward(self, s, a):
        s1 = F.relu(self.fc_s1(s))
        s2 = F.relu(self.fc_s2(s1))
        a1 = F.relu(self.fc_a1(a))
        x = torch.cat((s2, a1), dim=1)

        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
    

'''
Actor-critic
'''
class ActorCritic:

    def __init__(self, s_dim, a_dim, replay_buffer):
        '''
        input:
            s_dim: state dimension (int)
            a_dim: output action dimension (int)
            replay_buffer: replay_buffer object
        '''
        self.s_dim = s_dim
        self.a_dim = a_dim
        self.replay_buffer = replay_buffer
        self.noise = OrnsteinUhlenbeckActionNoise(self.a_dim)

        # ------------------------------------------------------
        #  Create network and target network instance for Actor
        # ------------------------------------------------------
        self.actor = Actor(self.s_dim, self.a_dim)
        self.target_actor = Actor(self.s_dim, self.a_dim)
        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), PARAMS.learning_rate)

        # ------------------------------------------------------
        #  Create network and target network instance for Critic
        # ------------------------------------------------------
        self.critic = Critic(self.s_dim, self.a_dim)
        self.target_critic = Critic(self.s_dim, self.a_dim)
        self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), PARAMS.learning_rate)

        self.hard_update(self.target_actor, self.actor)
        self.hard_update(self.target_critic, self.critic)
        
    def select_action(self, s, a_type="exploitation"):
        '''
        input:
            s: state (Numpy array)
            a: sampled action (Numpy array)
        '''
        if a_type == "exploration":
            '''
            Return an action from actor added with exploration noise.
            '''
            s = Variable(torch.from_numpy(s))
            a = self.actor.forward(s).detach()
            new_a = a.data.numpy() + (self.noise.sample())
            return np.clip(new_a, PARAMS.env.action_space.low, PARAMS.env.action_space.high)
        else:
            '''
            a_type="exploitation"
            Return an action from target actor added with exploration noise
            '''
            s = Variable(torch.from_numpy(s))
            a = self.target_actor.forward(s).detach()
        return np.clip(a.data.numpy(), PARAMS.env.action_space.low, PARAMS.env.action_space.high)

    def train(self):
        '''
        Sample a random batch from replay memory and train the actor-critic model.
        '''
        s1, a1, r1, s2 = self.replay_buffer.sample(PARAMS.batch_size)

        # Convert numpy variables to Pytorch ones.
        s1 = Variable(torch.from_numpy(s1))
        a1 = Variable(torch.from_numpy(a1))
        r1 = Variable(torch.from_numpy(r1))
        s2 = Variable(torch.from_numpy(s2))

        # -------------
        # Train critic
        # -------------
        # Use target actor exploitation policy here for loss evaluation
        a2 = self.target_actor.forward(s2).detach()
        next_v = torch.squeeze(self.target_critic.forward(s2, a2).detach())
        exp_y = r1 + PARAMS.gamma * next_v
        pred_y = torch.squeeze(self.critic.forward(s1, a1))
        
        # compute critic loss, and update the critic
        loss_critic = F.smooth_l1_loss(pred_y, exp_y)
        self.critic_optimizer.zero_grad()
        loss_critic.backward()
        self.critic_optimizer.step()

        # ------------
        # Train actor
        # ------------
        pred_a1 = self.actor.forward(s1)
        # "Using gradient ascent, we can move θ toward the direction 
        # suggested by the gradient ∇θJ(θ) to find the best θ for πθ 
        # that produces the highest return." 
        #   https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html
        # Thus the actor loss is negated.
        loss_actor = -1*torch.sum(self.critic.forward(s1, pred_a1))
        self.actor_optimizer.zero_grad()
        loss_actor.backward()
        self.actor_optimizer.step()

        # soft update -- copy weighted parameters instead of directly
        self.soft_update(self.target_actor, self.actor, PARAMS.tau)
        self.soft_update(self.target_critic, self.critic, PARAMS.tau)

    def save_models(self, episode_i):
        torch.save(self.target_actor.state_dict(), \
                   '%s/actor_episode_%d.pt' % 
                        (PARAMS.saved_models_path, episode_i))
        torch.save(self.target_critic.state_dict(), \
                   '%s/critic_episode_%d.pt' % 
                        (PARAMS.saved_models_path, episode_i))
        print('Models %s/actor_episode_%d.pt and %s/critic_episode_%d.pt saved successfully' % 
                        (PARAMS.saved_models_path, episode_i,
                         PARAMS.saved_models_path, episode_i))

    def load_models(self, episode_i):
        self.actor.load_state_dict(torch.load('%s/actor_episode_%d.pt' % 
                        (PARAMS.saved_models_path, episode_i)))
        self.critic.load_state_dict(torch.load('%s/critic_episode_%d.pt' % 
                        (PARAMS.saved_models_path, episode_i)))
        self.hard_update(self.target_actor, self.actor)
        self.hard_update(self.target_critic, self.critic)
        print('Models %s/actor_episode_%d.pt and %s/critic_episode_%d.pt loaded successfully' % 
                        (PARAMS.saved_models_path, episode_i,
                         PARAMS.saved_models_path, episode_i))
        
    def soft_update(self, target, source, tau):
        '''
        Copies the parameters from source network (x) to target network (y) using the below update
        y = TAU*x + (1 - TAU)*y
        input:
            target: Target network (PyTorch)
            source: Source network (PyTorch)
        '''
        for target_param, param in zip(target.parameters(), source.parameters()):
            target_param.data.copy_(
                target_param.data * (1.0 - tau) + param.data * tau
            )

    def hard_update(self, target, source):
        '''
        Copies the parameters from source network to target network
        input:
            target: Target network (PyTorch)
             source: Source network (PyTorch)
        '''
        for target_param, param in zip(target.parameters(), source.parameters()):
            target_param.data.copy_(param.data)

# Training

In [0]:
################
### Training ###
################
# Let the agent interact with the environment
total_steps = 0
last_ten_episode_rewards, last_ten_episode_rewards_i = [], 0

# the saved model to start training
episode_i = PARAMS.train_start_episode

# Create an instance of replay buffer.
replay_buffer = ReplayBuffer(PARAMS.max_buffer)
# Create an instance of ActorCritic model.
actor_critic = ActorCritic(s_dim, a_dim, replay_buffer)

# -------------
# Load weights
# -------------
os.makedirs(PARAMS.saved_models_path, exist_ok=True)
if path.exists('%s/actor_episode_%d.pt' % 
        (PARAMS.saved_models_path, episode_i)) and \
    path.exists('%s/critic_episode_%d.pt' % 
        (PARAMS.saved_models_path, episode_i)):
    actor_critic.load_models(episode_i)
else:
    print('%s/actor_episode_%d.pt or %s/actor_episode_%d.pt does not exist. Will train the model from episode 0.' % 
                        (PARAMS.saved_models_path, episode_i,
                         PARAMS.saved_models_path, episode_i))
    episode_i = 0

# -----------------------------
# Iterate through all episodes
# -----------------------------
while episode_i < PARAMS.max_episodes:
    obs = PARAMS.env.reset()
    episode_reward = 0

    # Iterate through all steps
    for t in range(PARAMS.max_steps):
        PARAMS.env.render()
        
        # s: current state
        s = np.float32(obs).flatten() # Flatten state into 1D array
               
        # Select and perform an action
        a = actor_critic.select_action(s, "exploration")
        # print("pred a: ", a) # Verify predicted action value
               
        # One Step
        obs, r, done, info = PARAMS.env.step(a) # r: immediate reward, done: terminal state indicator
        s1 = np.float32(obs).flatten() # Flatten state into 1D array
        
        # Store the transition in the replay buffer
        replay_buffer.remember(s, a, r, s1)
        
        # Train the networks
        actor_critic.train()
        
        episode_reward += r

        if done:
            s1 = None
            # -------------------
            # Log episode reward
            # -------------------
            log_file.write(str(episode_reward) + '\n')
            log_file.flush()
            
            if len(last_ten_episode_rewards) < 10:
                last_ten_episode_rewards.append(episode_reward)
                print("episode_i: ", episode_i, "episode reward: ", episode_reward, " Episode finished after {} timesteps".format(t+1))
            else:
                last_ten_episode_rewards[last_ten_episode_rewards_i] = episode_reward
                last_ten_episode_rewards_i += 1
                last_ten_episode_rewards_i %= 10
                print("episode_i: ", episode_i, "episode reward: ", episode_reward, " mean reward: ", np.mean(last_ten_episode_rewards), " Episode finished after {} timesteps".format(t+1))
            break
            
    # -------------
    # Save weights
    # -------------
    if episode_i % PARAMS.save_model_episode_interval == 0:
        os.makedirs((PARAMS.saved_models_path), exist_ok=True)
        actor_critic.save_models(episode_i)
        
    episode_i += 1
    
PARAMS.env.close()

/ssd/bryanbc/saved_models/hw02/htc_actor_critic_pytorch_1/actor_episode_0.pt or /ssd/bryanbc/saved_models/hw02/htc_actor_critic_pytorch_1/actor_episode_0.pt does not exist. Will train the model from episode 0.
options= 




episode_i:  0 episode reward:  -407.7514723315412  Episode finished after 1000 timesteps
Models /ssd/bryanbc/saved_models/hw02/htc_actor_critic_pytorch_1/actor_episode_0.pt and /ssd/bryanbc/saved_models/hw02/htc_actor_critic_pytorch_1/critic_episode_0.pt saved successfully




episode_i:  1 episode reward:  -114.077880891529  Episode finished after 1000 timesteps
episode_i:  2 episode reward:  -99.13269838623003  Episode finished after 1000 timesteps
episode_i:  3 episode reward:  -61.19484414589196  Episode finished after 1000 timesteps
episode_i:  4 episode reward:  -79.18515017953045  Episode finished after 1000 timesteps
episode_i:  5 episode reward:  -85.99512327213516  Episode finished after 1000 timesteps
episode_i:  6 episode reward:  -99.54744067070662  Episode finished after 1000 timesteps
episode_i:  7 episode reward:  -64.45424406464234  Episode finished after 1000 timesteps




episode_i:  8 episode reward:  -50.23435435130421  Episode finished after 1000 timesteps
episode_i:  9 episode reward:  -80.00551991182924  Episode finished after 1000 timesteps
episode_i:  10 episode reward:  -13.574291815624855  mean reward:  -74.74015476894239  Episode finished after 1000 timesteps
episode_i:  11 episode reward:  -47.714130513365674  mean reward:  -68.10377973112605  Episode finished after 1000 timesteps
episode_i:  12 episode reward:  -44.47870815439793  mean reward:  -62.63838070794284  Episode finished after 1000 timesteps
episode_i:  13 episode reward:  33.846028355389606  mean reward:  -53.13429345781468  Episode finished after 1000 timesteps
episode_i:  14 episode reward:  47.42494712752842  mean reward:  -40.4732837271088  Episode finished after 1000 timesteps
episode_i:  15 episode reward:  147.1959724773489  mean reward:  -17.154174152160397  Episode finished after 1000 timesteps
episode_i:  16 episode reward:  227.31112074394753  mean reward:  15.531681989



episode_i:  27 episode reward:  355.15055434455104  mean reward:  109.83890099474281  Episode finished after 1000 timesteps
episode_i:  28 episode reward:  388.29919702142485  mean reward:  143.13295494942003  Episode finished after 1000 timesteps
episode_i:  29 episode reward:  406.46636940936065  mean reward:  164.83251539350778  Episode finished after 1000 timesteps
episode_i:  30 episode reward:  440.03025317485594  mean reward:  189.71416603035522  Episode finished after 1000 timesteps
episode_i:  31 episode reward:  432.0529999099895  mean reward:  228.18158374020172  Episode finished after 1000 timesteps
episode_i:  32 episode reward:  497.3664760562754  mean reward:  295.458327905185  Episode finished after 1000 timesteps
episode_i:  33 episode reward:  355.9516943563558  mean reward:  317.36067744776557  Episode finished after 1000 timesteps
episode_i:  34 episode reward:  443.37562911346373  mean reward:  367.2630875340265  Episode finished after 1000 timesteps
episode_i:  35



episode_i:  64 episode reward:  924.6850624764329  mean reward:  843.1595873661139  Episode finished after 1000 timesteps
episode_i:  65 episode reward:  758.4142394981261  mean reward:  841.623404480724  Episode finished after 1000 timesteps
episode_i:  66 episode reward:  983.480807088133  mean reward:  853.0057276966863  Episode finished after 1000 timesteps
episode_i:  67 episode reward:  892.0800104470626  mean reward:  856.712791223428  Episode finished after 1000 timesteps
episode_i:  68 episode reward:  940.8284767821556  mean reward:  864.015935269778  Episode finished after 1000 timesteps
episode_i:  69 episode reward:  1088.5984083722465  mean reward:  884.8875630619484  Episode finished after 1000 timesteps
episode_i:  70 episode reward:  890.1231936842383  mean reward:  906.9066218409387  Episode finished after 1000 timesteps
episode_i:  71 episode reward:  994.3872735958681  mean reward:  913.0286549067248  Episode finished after 1000 timesteps
episode_i:  72 episode rewa



episode_i:  125 episode reward:  1352.5578059880506  mean reward:  1170.9045990646996  Episode finished after 1000 timesteps
episode_i:  126 episode reward:  1157.0741866591943  mean reward:  1177.0491746741538  Episode finished after 1000 timesteps
episode_i:  127 episode reward:  1249.811184964978  mean reward:  1180.1527834176554  Episode finished after 1000 timesteps
episode_i:  128 episode reward:  1120.1733616495178  mean reward:  1172.2406302323586  Episode finished after 1000 timesteps
episode_i:  129 episode reward:  1198.5333853548716  mean reward:  1175.7683430161471  Episode finished after 1000 timesteps
episode_i:  130 episode reward:  1154.690649207908  mean reward:  1168.7029663493297  Episode finished after 1000 timesteps
episode_i:  131 episode reward:  1244.0418481519077  mean reward:  1177.163857174362  Episode finished after 1000 timesteps
episode_i:  132 episode reward:  1133.652363508197  mean reward:  1180.7724808983562  Episode finished after 1000 timesteps
epis

In [0]:
show_video()

In [0]:
# -------------------------------------------------
#  Write log data for visualization in Tensorboard
# -------------------------------------------------
from tensorboardX import SummaryWriter

class DataLogger():
    def __init__(self):
        self.log_path = PARAMS.log_path
        self.logdir = self.log_path[:-1] + '/runs/'
        self.writer = SummaryWriter(logdir=self.logdir)
        
    def write2tb(self):
        # Write log data into tensorboard
        log_file = open((PARAMS.log_path + 'episode_reward.log'), 'r')
        for i, reward in enumerate(log_file):
            self.writer.add_scalar('episode_reward', float(reward), i)
        

data_logger = DataLogger()
data_logger.write2tb()
print("Check tensorboard now.")

Check tensorboard now.
