## Setup

You will need to make a copy of this notebook in your Google Drive before you can edit the homework files. You can do so with **File &rarr; Save a copy in Drive**.

In [None]:
# #@title mount your Google Drive
# #@markdown Your work will be stored in a folder called `cs285_f2023` by default to prevent Colab instance timeouts from deleting your edits.

# import os
# from google.colab import drive
# drive.mount('/content/gdrive', force_remount=True)

In [3]:
# #@title set up mount symlink

# DRIVE_PATH = '/content/gdrive/My\ Drive/cs285_f2023'
# DRIVE_PYTHON_PATH = DRIVE_PATH.replace('\\', '')
# if not os.path.exists(DRIVE_PYTHON_PATH):
#   %mkdir $DRIVE_PATH

# ## the space in `My Drive` causes some issues,
# ## make a symlink to avoid this
# SYM_PATH = '/content/cs285_f2023'
# if not os.path.exists(SYM_PATH):
#   !ln -s $DRIVE_PATH $SYM_PATH

In [None]:
# #@title apt install requirements

# #@markdown Run each section with Shift+Enter

# #@markdown Double-click on section headers to show code.

# !apt update
# !apt install -y --no-install-recommends \
#         build-essential \
#         curl \
#         git \
#         gnupg2 \
#         make \
#         cmake \
#         ffmpeg \
#         swig \
#         libz-dev \
#         unzip \
#         zlib1g-dev \
#         libglfw3 \
#         libglfw3-dev \
#         libxrandr2 \
#         libxinerama-dev \
#         libxi6 \
#         libxcursor-dev \
#         libgl1-mesa-dev \
#         libgl1-mesa-glx \
#         libglew-dev \
#         libosmesa6-dev \
#         lsb-release \
#         ack-grep \
#         patchelf \
#         wget \
#         xpra \
#         xserver-xorg-dev \
#         ffmpeg
# !apt-get install python-opengl -y
# !apt install xvfb -y

In [None]:
# #@title clone homework repo

# %cd $SYM_PATH
# !git clone https://github.com/berkeleydeeprlcourse/homework_fall2023.git
# %cd hw1
# %pip install -r requirements_colab.txt
# %pip install -e .

In [None]:
# #@title set up virtual display

# from pyvirtualdisplay import Display

# display = Display(visible=0, size=(1400, 900))
# display.start()

In [None]:
#@title test virtual display

#@markdown If you see a video of a four-legged ant fumbling about, setup is complete!

import gym
from cs285.infrastructure.colab_utils import (
    wrap_env,
    show_video
)

env = wrap_env(gym.make("Ant-v4", render_mode='rgb_array'))

observation = env.reset()
for i in range(100):
    env.render()
    obs, rew, term, _ = env.step(env.action_space.sample() )
    if term:
      break;

env.close()
print('Loading video...')
show_video()

## Editing Code

To edit code, click the folder icon on the left menu. Navigate to the corresponding file (`cs285_f2020/...`). Double click a file to open an editor. You will need to edit code in the following files:
```markdown
* cs285/policies/MLP_policy.py
* cs285/infrastructure/utils.py
* cs285/scripts/run_hw1.py
```

## Run Behavior Cloning (Problem 1)

Note that there is a timeout of about ~12 hours with Colab while it is active (and less if you close your browser window). We sync your edits to Google Drive so that you won't lose your work in the event of an instance timeout, but you will need to re-mount your Google Drive and re-install packages with every new instance.

In [2]:
#@title imports

import time

from cs285.scripts.run_hw1 import run_training_loop

%load_ext autoreload
%autoreload 2

  from imp import reload


In [3]:
#@title runtime arguments

class Args:

  def __getitem__(self, key):
    return getattr(self, key)

  def __setitem__(self, key, val):
    setattr(self, key, val)

  #@markdown expert data
  # expert_policy_file = 'cs285/policies/experts/Ant.pkl' #@param
  expert_policy_file = '../../cs285/policies/experts/Ant.pkl' #@param
  # expert_data = 'cs285/expert_data/expert_data_Ant-v4.pkl' #@param
  expert_data = '../../cs285/expert_data/expert_data_Ant-v4.pkl' #@param
  env_name = 'Ant-v4' #@param ['Ant-v4', 'Walker2d-v4', 'HalfCheetah-v4', 'Hopper-v4']
  exp_name = 'bc_ant' #@param
  do_dagger = False #@param {type: "boolean"}
  ep_len = 1000 #@param {type: "integer"}
  save_params = False #@param {type: "boolean"}

  num_agent_train_steps_per_iter = 1000 #@param {type: "integer"})
  n_iter = 1 #@param {type: "integer"})

  #@markdown batches & buffers
  batch_size_initial = 2000 #@param {type: "integer"})
  batch_size = 1000 #@param {type: "integer"})
  eval_batch_size = 5000 #@param {type: "integer"}
  train_batch_size = 100 #@param {type: "integer"}
  max_replay_buffer_size = 1000000 #@param {type: "integer"}

  #@markdown network
  n_layers = 2 #@param {type: "integer"}
  size = 64 #@param {type: "integer"}
  learning_rate = 5e-3 #@param {type: "number"}

  #@markdown logging
  video_log_freq = 5 #@param {type: "integer"}
  scalar_log_freq = 1 #@param {type: "integer"}

  #@markdown gpu & run-time settings
  no_gpu = False #@param {type: "boolean"}
  which_gpu = 0 #@param {type: "integer"}
  seed = 1 #@param {type: "integer"}

args = Args()

In [4]:
#@title create directory for logging

import os

def create_log_dir(args, part=''):
    if args.do_dagger:
        logdir_prefix = 'q2_'  # The autograder uses the prefix `q2_`
        assert args.n_iter>1, ('DAgger needs more than 1 iteration (n_iter>1) of training, to iteratively query the expert and train (after 1st warmstarting from behavior cloning).')
    else:
        logdir_prefix = 'q1_'  # The autograder uses the prefix `q1_`
        assert args.n_iter==1, ('Vanilla behavior cloning collects expert data just once (n_iter=1)')
    
    # data_path ='/content/cs285_f2023/hw1/data'
    data_path = '../../data'
    if not (os.path.exists(data_path)):
        os.makedirs(data_path)
    logdir = logdir_prefix + args.exp_name + '_' + args.env_name #+ \
             # '_' + time.strftime("%d-%m-%Y_%H-%M-%S")
    logdir = os.path.join(data_path, part, logdir)
    args['logdir'] = logdir
    if not(os.path.exists(logdir)):
        os.makedirs(logdir)

* (3.2) Experiment with one set of hyperparameters that affects the performance of the behavioral cloning agent, such as the amount of training steps, the amount of expert data provided, or something that you come up with yourself. For one of the tasks used in the previous question, show a graph of how the BC agent’s performance varies with the value of this hyperparameter. In the caption for the graph, state the hyperparameter and a brief rationale for why you chose it.

To experiment different values of hyperparameters that affects the performance, I choose 
* num_agent_train_steps_per_iter
* train_batch_size
* size

with different values as below.  
Because num_agent_train_steps_per_iter is one of key hyperparameters that affects the performance most significantly, and also try another two hyperparameters (train_batch_size & size).

In [5]:
# do grid search
from itertools import product

params = {
    'num_agent_train_steps_per_iter': [1000, 5000, 10000],
    'train_batch_size': [200, 500, 1000],
    'size': [64, 128, 256],
    'video_log_freq': [-1],
}

grid_params = [dict(zip(params, vals)) for vals in product(*params.values())]

# set exp_name by parameter combination
for param in grid_params:
    param['exp_name'] = f"t_step={param['num_agent_train_steps_per_iter']}-tb_size={param['train_batch_size']}-size={param['size']}"

grid_params

[{'num_agent_train_steps_per_iter': 1000,
  'train_batch_size': 200,
  'size': 64,
  'video_log_freq': -1,
  'exp_name': 't_step=1000-tb_size=200-size=64'},
 {'num_agent_train_steps_per_iter': 1000,
  'train_batch_size': 200,
  'size': 128,
  'video_log_freq': -1,
  'exp_name': 't_step=1000-tb_size=200-size=128'},
 {'num_agent_train_steps_per_iter': 1000,
  'train_batch_size': 200,
  'size': 256,
  'video_log_freq': -1,
  'exp_name': 't_step=1000-tb_size=200-size=256'},
 {'num_agent_train_steps_per_iter': 1000,
  'train_batch_size': 500,
  'size': 64,
  'video_log_freq': -1,
  'exp_name': 't_step=1000-tb_size=500-size=64'},
 {'num_agent_train_steps_per_iter': 1000,
  'train_batch_size': 500,
  'size': 128,
  'video_log_freq': -1,
  'exp_name': 't_step=1000-tb_size=500-size=128'},
 {'num_agent_train_steps_per_iter': 1000,
  'train_batch_size': 500,
  'size': 256,
  'video_log_freq': -1,
  'exp_name': 't_step=1000-tb_size=500-size=256'},
 {'num_agent_train_steps_per_iter': 1000,
  'train

In [None]:
# ## run training
# print(args.logdir)
# run_training_loop(args)

In [6]:
# run training with grid parameters
for param in grid_params:
    for p in param:
        args[p] = param[p]
    create_log_dir(args, part='3-2')
    run_training_loop(args)
    print('*' * 100)

['batch_size=1000', 'batch_size_initial=2000', 'do_dagger=False', 'env_name=Ant-v4', 'ep_len=1000', 'eval_batch_size=5000', 'exp_name=t_step=1000-tb_size=200-size=64', 'expert_data=../../cs285/expert_data/expert_data_Ant-v4.pkl', 'expert_policy_file=../../cs285/policies/experts/Ant.pkl', 'learning_rate=0.005', 'logdir=../../data\\3-2\\q1_t_step=1000-tb_size=200-size=64_Ant-v4', 'max_replay_buffer_size=1000000', 'n_iter=1', 'n_layers=2', 'no_gpu=False', 'num_agent_train_steps_per_iter=1000', 'save_params=False', 'scalar_log_freq=1', 'seed=1', 'size=64', 'train_batch_size=200', 'video_log_freq=-1', 'which_gpu=0']

########################
logging outputs to  ../../data\3-2\q1_t_step=1000-tb_size=200-size=64_Ant-v4
########################
GPU not detected. Defaulting to CPU.
Loading expert policy from... ../../cs285/policies/experts/Ant.pkl
obs (1, 111) (1, 111)
Done restoring expert policy...


********** Iteration 0 ************

Collecting data to be used for training...

Training age

  deprecation(
  deprecation(



Beginning logging procedure...

Collecting data for eval...


  if not isinstance(terminated, (bool, np.bool8)):


Eval_AverageReturn : 1503.58984375
Eval_StdReturn : 515.1749267578125
Eval_MaxReturn : 2144.9677734375
Eval_MinReturn : 696.6118774414062
Eval_AverageEpLen : 1000.0
Train_AverageReturn : 4681.891673935816
Train_StdReturn : 30.70862278765526
Train_MaxReturn : 4712.600296723471
Train_MinReturn : 4651.18305114816
Train_AverageEpLen : 1000.0
Training Loss : 0.03301353007555008
Train_EnvstepsSoFar : 0
TimeSinceStart : 2.8112382888793945
Initial_DataCollection_AverageReturn : 4681.891673935816
Done logging...


****************************************************************************************************
['batch_size=1000', 'batch_size_initial=2000', 'do_dagger=False', 'env_name=Ant-v4', 'ep_len=1000', 'eval_batch_size=5000', 'exp_name=t_step=1000-tb_size=200-size=128', 'expert_data=../../cs285/expert_data/expert_data_Ant-v4.pkl', 'expert_policy_file=../../cs285/policies/experts/Ant.pkl', 'learning_rate=0.005', 'logdir=../../data\\3-2\\q1_t_step=1000-tb_size=200-size=128_Ant-v4', 'max_

  scalar = float(scalar)



Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 1274.5731201171875
Eval_StdReturn : 641.5538330078125
Eval_MaxReturn : 2241.037353515625
Eval_MinReturn : 546.1234130859375
Eval_AverageEpLen : 855.2857142857143
Train_AverageReturn : 4681.891673935816
Train_StdReturn : 30.70862278765526
Train_MaxReturn : 4712.600296723471
Train_MinReturn : 4651.18305114816
Train_AverageEpLen : 1000.0
Training Loss : 0.03430361673235893
Train_EnvstepsSoFar : 0
TimeSinceStart : 3.5667097568511963
Initial_DataCollection_AverageReturn : 4681.891673935816
Done logging...


****************************************************************************************************
['batch_size=1000', 'batch_size_initial=2000', 'do_dagger=False', 'env_name=Ant-v4', 'ep_len=1000', 'eval_batch_size=5000', 'exp_name=t_step=1000-tb_size=200-size=256', 'expert_data=../../cs285/expert_data/expert_data_Ant-v4.pkl', 'expert_policy_file=../../cs285/policies/experts/Ant.pkl', 'learning_rate=0.00

* (3.1) Run behavioral cloning (BC) and report results on two tasks: one where a behavioral cloning agent should achieve at least 30% of the performance of the expert, and one environment of your choosing where it does not.

After checking the results, I select one set of combination to apply and train the agent on all four tasks. The set of hyperparameters is:
* num_agent_train_steps_per_iter = 10000 
* train_batch_size = 200
* size = 64

In [5]:
args = Args()
args['num_agent_train_steps_per_iter'] = 10000
args['train_batch_size'] = 200
args['size'] = 64
args['video_log_freq'] = 1

In [6]:
param_Ant = {
    'expert_policy_file': '../../cs285/policies/experts/Ant.pkl',
    'expert_data': '../../cs285/expert_data/expert_data_Ant-v4.pkl',
    'env_name': 'Ant-v4',
    'exp_name': 'bc_ant',
}

for k in param_Ant:
    args[k] = param_Ant[k]

In [7]:
# run training on Ant
create_log_dir(args, part='3-1')
run_training_loop(args)

['batch_size=1000', 'batch_size_initial=2000', 'do_dagger=False', 'env_name=Ant-v4', 'ep_len=1000', 'eval_batch_size=5000', 'exp_name=bc_ant', 'expert_data=../../cs285/expert_data/expert_data_Ant-v4.pkl', 'expert_policy_file=../../cs285/policies/experts/Ant.pkl', 'learning_rate=0.005', 'logdir=../../data\\3-1\\q1_bc_ant_Ant-v4', 'max_replay_buffer_size=1000000', 'n_iter=1', 'n_layers=2', 'no_gpu=False', 'num_agent_train_steps_per_iter=10000', 'save_params=False', 'scalar_log_freq=1', 'seed=1', 'size=64', 'train_batch_size=200', 'video_log_freq=1', 'which_gpu=0']

########################
logging outputs to  ../../data\3-1\q1_bc_ant_Ant-v4
########################
GPU not detected. Defaulting to CPU.
Loading expert policy from... ../../cs285/policies/experts/Ant.pkl
obs (1, 111) (1, 111)
Done restoring expert policy...


********** Iteration 0 ************

Collecting data to be used for training...

Training agent using sampled data from replay buffer...


  deprecation(
  deprecation(



Beginning logging procedure...

Collecting video rollouts eval


See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(
  if not isinstance(terminated, (bool, np.bool8)):



Collecting data for eval...
Eval_AverageReturn : 4749.41015625
Eval_StdReturn : 59.842185974121094
Eval_MaxReturn : 4807.9140625
Eval_MinReturn : 4649.9013671875
Eval_AverageEpLen : 1000.0
Train_AverageReturn : 4681.891673935816
Train_StdReturn : 30.70862278765526
Train_MaxReturn : 4712.600296723471
Train_MinReturn : 4651.18305114816
Train_AverageEpLen : 1000.0
Training Loss : 0.00029027400887571275
Train_EnvstepsSoFar : 0
TimeSinceStart : 45.74547266960144
Initial_DataCollection_AverageReturn : 4681.891673935816
Done logging...




In [6]:
param_HalfCheetah = {
    'expert_policy_file': '../../cs285/policies/experts/HalfCheetah.pkl',
    'expert_data': '../../cs285/expert_data/expert_data_HalfCheetah-v4.pkl',
    'env_name': 'HalfCheetah-v4',
    'exp_name': 'bc_halfcheetah',
}

for k in param_HalfCheetah:
    args[k] = param_HalfCheetah[k]

In [7]:
# run training on HalfCheetah
create_log_dir(args, part='3-1')
run_training_loop(args)

['batch_size=1000', 'batch_size_initial=2000', 'do_dagger=False', 'env_name=HalfCheetah-v4', 'ep_len=1000', 'eval_batch_size=5000', 'exp_name=bc_halfcheetah', 'expert_data=../../cs285/expert_data/expert_data_HalfCheetah-v4.pkl', 'expert_policy_file=../../cs285/policies/experts/HalfCheetah.pkl', 'learning_rate=0.005', 'logdir=../../data\\3-1\\q1_bc_halfcheetah_HalfCheetah-v4', 'max_replay_buffer_size=1000000', 'n_iter=1', 'n_layers=2', 'no_gpu=False', 'num_agent_train_steps_per_iter=10000', 'save_params=False', 'scalar_log_freq=1', 'seed=1', 'size=64', 'train_batch_size=200', 'video_log_freq=1', 'which_gpu=0']

########################
logging outputs to  ../../data\3-1\q1_bc_halfcheetah_HalfCheetah-v4
########################
GPU not detected. Defaulting to CPU.
Loading expert policy from... ../../cs285/policies/experts/HalfCheetah.pkl
obs (1, 17) (1, 17)
Done restoring expert policy...


********** Iteration 0 ************

Collecting data to be used for training...

Training agent us

  deprecation(
  deprecation(



Beginning logging procedure...

Collecting video rollouts eval


See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(
  if not isinstance(terminated, (bool, np.bool8)):



Collecting data for eval...
Eval_AverageReturn : 4080.100341796875
Eval_StdReturn : 58.103641510009766
Eval_MaxReturn : 4181.392578125
Eval_MinReturn : 4019.67529296875
Eval_AverageEpLen : 1000.0
Train_AverageReturn : 4034.7999834965067
Train_StdReturn : 32.8677631311341
Train_MaxReturn : 4067.6677466276406
Train_MinReturn : 4001.9322203653724
Train_AverageEpLen : 1000.0
Training Loss : 0.0007307063206098974
Train_EnvstepsSoFar : 0
TimeSinceStart : 40.932074546813965
Initial_DataCollection_AverageReturn : 4034.7999834965067
Done logging...




In [10]:
param_Hopper = {
    'expert_policy_file': '../../cs285/policies/experts/Hopper.pkl',
    'expert_data': '../../cs285/expert_data/expert_data_Hopper-v4.pkl',
    'env_name': 'Hopper-v4',
    'exp_name': 'bc_hopper',
}

for k in param_Hopper:
    args[k] = param_Hopper[k]

In [11]:
# run training on Hopper
create_log_dir(args, part='3-1')
run_training_loop(args)

['batch_size=1000', 'batch_size_initial=2000', 'do_dagger=False', 'env_name=Hopper-v4', 'ep_len=1000', 'eval_batch_size=5000', 'exp_name=bc_hopper', 'expert_data=../../cs285/expert_data/expert_data_Hopper-v4.pkl', 'expert_policy_file=../../cs285/policies/experts/Hopper.pkl', 'learning_rate=0.005', 'logdir=../../data\\3-1\\q1_bc_hopper_Hopper-v4', 'max_replay_buffer_size=1000000', 'n_iter=1', 'n_layers=2', 'no_gpu=False', 'num_agent_train_steps_per_iter=10000', 'save_params=False', 'scalar_log_freq=1', 'seed=1', 'size=64', 'train_batch_size=200', 'video_log_freq=1', 'which_gpu=0']

########################
logging outputs to  ../../data\3-1\q1_bc_hopper_Hopper-v4
########################
GPU not detected. Defaulting to CPU.
Loading expert policy from... ../../cs285/policies/experts/Hopper.pkl
obs (1, 11) (1, 11)
Done restoring expert policy...


********** Iteration 0 ************

Collecting data to be used for training...

Training agent using sampled data from replay buffer...

Begin

In [12]:
param_Walker2d = {
    'expert_policy_file': '../../cs285/policies/experts/Walker2d.pkl',
    'expert_data': '../../cs285/expert_data/expert_data_Walker2d-v4.pkl',
    'env_name': 'Walker2d-v4',
    'exp_name': 'bc_walker2d',
}

for k in param_Walker2d:
    args[k] = param_Walker2d[k]

In [13]:
# run training on Walker2d
create_log_dir(args, part='3-1')
run_training_loop(args)

['batch_size=1000', 'batch_size_initial=2000', 'do_dagger=False', 'env_name=Walker2d-v4', 'ep_len=1000', 'eval_batch_size=5000', 'exp_name=bc_walker2d', 'expert_data=../../cs285/expert_data/expert_data_Walker2d-v4.pkl', 'expert_policy_file=../../cs285/policies/experts/Walker2d.pkl', 'learning_rate=0.005', 'logdir=../../data\\3-1\\q1_bc_walker2d_Walker2d-v4', 'max_replay_buffer_size=1000000', 'n_iter=1', 'n_layers=2', 'no_gpu=False', 'num_agent_train_steps_per_iter=10000', 'save_params=False', 'scalar_log_freq=1', 'seed=1', 'size=64', 'train_batch_size=200', 'video_log_freq=1', 'which_gpu=0']

########################
logging outputs to  ../../data\3-1\q1_bc_walker2d_Walker2d-v4
########################
GPU not detected. Defaulting to CPU.
Loading expert policy from... ../../cs285/policies/experts/Walker2d.pkl
obs (1, 17) (1, 17)
Done restoring expert policy...


********** Iteration 0 ************

Collecting data to be used for training...

Training agent using sampled data from repla

Result:
|                              | Ant               | HalfCheetah       | Hopper            | Walker2d          |
| ---------------------------- | ----------------- | ----------------- | ----------------- | ----------------- |
| AverageReturn (Train / Eval) | 4681.89 / 4749.41 | 4034.79 / 4080.10 | 3717.51 / 2664.45 | 5383.31 / 5266.77 |
| StdReturn (Train / Eval)     | 30.70 / 59.84     | 32.86 / 58.10     | 0.35 / 714.44     | 54.15 / 35.68     |

It looks like four tasks all achieve at least 30% of the performance of the expert(=Train). ：)

In [1]:
#@markdown You can visualize your runs with tensorboard from within the notebook

%load_ext tensorboard
# %tensorboard --logdir /content/cs285_f2023/hw1/data
# %tensorboard --logdir ../../data/3-2

In [None]:
%tensorboard --logdir ../../data/3-1

## Running DAgger (Problem 2)
Modify the settings above:
1. check the `do_dagger` box
2. set `n_iters` to `10`
3. set `exp_name` to `dagger_{env_name}`
and then rerun the code.

* (4.2) Run DAgger and report results on the two tasks you tested previously with behavioral cloning.

Use the same hyperparameters used in Behavior Cloning, and train on all four tasks.

In [5]:
args = Args()
args['expert_policy_file'] = '../../cs285/policies/experts/Ant.pkl'
args['expert_data'] = '../../cs285/expert_data/expert_data_Ant-v4.pkl'
args['env_name'] = 'Ant-v4'
args['exp_name'] = 'dagger_ant'

# hyperparameters for DAgger
args['do_dagger'] = True
args['n_iter'] = 10

# the same as Behavior Cloning; the others are as default
args['num_agent_train_steps_per_iter'] = 10000
args['train_batch_size'] = 200
args['size'] = 64

In [6]:
# run training on Ant
create_log_dir(args, part='4-2')
run_training_loop(args)

['batch_size=1000', 'batch_size_initial=2000', 'do_dagger=True', 'env_name=Ant-v4', 'ep_len=1000', 'eval_batch_size=5000', 'exp_name=dagger_ant', 'expert_data=../../cs285/expert_data/expert_data_Ant-v4.pkl', 'expert_policy_file=../../cs285/policies/experts/Ant.pkl', 'learning_rate=0.005', 'logdir=../../data\\4-2\\q2_dagger_ant_Ant-v4', 'max_replay_buffer_size=1000000', 'n_iter=10', 'n_layers=2', 'no_gpu=False', 'num_agent_train_steps_per_iter=10000', 'save_params=False', 'scalar_log_freq=1', 'seed=1', 'size=64', 'train_batch_size=200', 'video_log_freq=5', 'which_gpu=0']

########################
logging outputs to  ../../data\4-2\q2_dagger_ant_Ant-v4
########################
GPU not detected. Defaulting to CPU.
Loading expert policy from... ../../cs285/policies/experts/Ant.pkl
obs (1, 111) (1, 111)
Done restoring expert policy...


********** Iteration 0 ************

Collecting data to be used for training...

Training agent using sampled data from replay buffer...


  deprecation(
  deprecation(



Beginning logging procedure...

Collecting data for eval...


  if not isinstance(terminated, (bool, np.bool8)):


Eval_AverageReturn : 4767.443359375
Eval_StdReturn : 30.58685874938965
Eval_MaxReturn : 4818.509765625
Eval_MinReturn : 4727.31982421875
Eval_AverageEpLen : 1000.0
Train_AverageReturn : 4681.891673935816
Train_StdReturn : 30.70862278765526
Train_MaxReturn : 4712.600296723471
Train_MinReturn : 4651.18305114816
Train_AverageEpLen : 1000.0
Training Loss : 0.0003318272647447884
Train_EnvstepsSoFar : 0
TimeSinceStart : 11.092876672744751
Initial_DataCollection_AverageReturn : 4681.891673935816
Done logging...




********** Iteration 1 ************

Collecting data to be used for training...


  scalar = float(scalar)



Relabelling collected observations with labels from an expert policy...

Training agent using sampled data from replay buffer...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 4588.5986328125
Eval_StdReturn : 83.02816009521484
Eval_MaxReturn : 4708.1083984375
Eval_MinReturn : 4482.7734375
Eval_AverageEpLen : 1000.0
Train_AverageReturn : 4774.0146484375
Train_StdReturn : 0.0
Train_MaxReturn : 4774.0146484375
Train_MinReturn : 4774.0146484375
Train_AverageEpLen : 1000.0
Training Loss : 0.00013036698510404676
Train_EnvstepsSoFar : 1000
TimeSinceStart : 22.77952289581299
Done logging...




********** Iteration 2 ************

Collecting data to be used for training...

Relabelling collected observations with labels from an expert policy...

Training agent using sampled data from replay buffer...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 4729.0380859375
Eval_StdReturn : 84.97787475585938
Eval_MaxReturn : 4838.45654

See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(



Collecting data for eval...
Eval_AverageReturn : 4732.171875
Eval_StdReturn : 42.655364990234375
Eval_MaxReturn : 4792.9619140625
Eval_MinReturn : 4676.5458984375
Eval_AverageEpLen : 1000.0
Train_AverageReturn : 4901.9560546875
Train_StdReturn : 0.0
Train_MaxReturn : 4901.9560546875
Train_MinReturn : 4901.9560546875
Train_AverageEpLen : 1000.0
Training Loss : 0.00011696351430146024
Train_EnvstepsSoFar : 4000
TimeSinceStart : 87.47571349143982
Done logging...




********** Iteration 5 ************

Collecting data to be used for training...

Relabelling collected observations with labels from an expert policy...

Training agent using sampled data from replay buffer...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 4751.0478515625
Eval_StdReturn : 94.81939697265625
Eval_MaxReturn : 4893.2666015625
Eval_MinReturn : 4619.4677734375
Eval_AverageEpLen : 1000.0
Train_AverageReturn : 4795.8134765625
Train_StdReturn : 0.0
Train_MaxReturn : 4795.8134765625
Tr

In [7]:
args = Args()
args['expert_policy_file'] = '../../cs285/policies/experts/HalfCheetah.pkl'
args['expert_data'] = '../../cs285/expert_data/expert_data_HalfCheetah-v4.pkl'
args['env_name'] = 'HalfCheetah-v4'
args['exp_name'] = 'dagger_halfcheetah'

# hyperparameters for DAgger
args['do_dagger'] = True
args['n_iter'] = 10

# the same as Behavior Cloning; the others are as default
args['num_agent_train_steps_per_iter'] = 10000
args['train_batch_size'] = 200
args['size'] = 64

In [8]:
# run training on HalfCheetah
create_log_dir(args, part='4-2')
run_training_loop(args)

['batch_size=1000', 'batch_size_initial=2000', 'do_dagger=True', 'env_name=HalfCheetah-v4', 'ep_len=1000', 'eval_batch_size=5000', 'exp_name=dagger_halfcheetah', 'expert_data=../../cs285/expert_data/expert_data_HalfCheetah-v4.pkl', 'expert_policy_file=../../cs285/policies/experts/HalfCheetah.pkl', 'learning_rate=0.005', 'logdir=../../data\\4-2\\q2_dagger_halfcheetah_HalfCheetah-v4', 'max_replay_buffer_size=1000000', 'n_iter=10', 'n_layers=2', 'no_gpu=False', 'num_agent_train_steps_per_iter=10000', 'save_params=False', 'scalar_log_freq=1', 'seed=1', 'size=64', 'train_batch_size=200', 'video_log_freq=5', 'which_gpu=0']

########################
logging outputs to  ../../data\4-2\q2_dagger_halfcheetah_HalfCheetah-v4
########################
GPU not detected. Defaulting to CPU.
Loading expert policy from... ../../cs285/policies/experts/HalfCheetah.pkl
obs (1, 17) (1, 17)
Done restoring expert policy...


********** Iteration 0 ************

Collecting data to be used for training...

Train

In [9]:
args = Args()
args['expert_policy_file'] = '../../cs285/policies/experts/Hopper.pkl'
args['expert_data'] = '../../cs285/expert_data/expert_data_Hopper-v4.pkl'
args['env_name'] = 'Hopper-v4'
args['exp_name'] = 'dagger_hopper'

# hyperparameters for DAgger
args['do_dagger'] = True
args['n_iter'] = 10

# the same as Behavior Cloning; the others are as default
args['num_agent_train_steps_per_iter'] = 10000
args['train_batch_size'] = 200
args['size'] = 64

In [10]:
# run training on Hopper
create_log_dir(args, part='4-2')
run_training_loop(args)

['batch_size=1000', 'batch_size_initial=2000', 'do_dagger=True', 'env_name=Hopper-v4', 'ep_len=1000', 'eval_batch_size=5000', 'exp_name=dagger_hopper', 'expert_data=../../cs285/expert_data/expert_data_Hopper-v4.pkl', 'expert_policy_file=../../cs285/policies/experts/Hopper.pkl', 'learning_rate=0.005', 'logdir=../../data\\4-2\\q2_dagger_hopper_Hopper-v4', 'max_replay_buffer_size=1000000', 'n_iter=10', 'n_layers=2', 'no_gpu=False', 'num_agent_train_steps_per_iter=10000', 'save_params=False', 'scalar_log_freq=1', 'seed=1', 'size=64', 'train_batch_size=200', 'video_log_freq=5', 'which_gpu=0']

########################
logging outputs to  ../../data\4-2\q2_dagger_hopper_Hopper-v4
########################
GPU not detected. Defaulting to CPU.
Loading expert policy from... ../../cs285/policies/experts/Hopper.pkl
obs (1, 11) (1, 11)
Done restoring expert policy...


********** Iteration 0 ************

Collecting data to be used for training...

Training agent using sampled data from replay buff

In [5]:
args = Args()
args['expert_policy_file'] = '../../cs285/policies/experts/Walker2d.pkl'
args['expert_data'] = '../../cs285/expert_data/expert_data_Walker2d-v4.pkl'
args['env_name'] = 'Walker2d-v4'
args['exp_name'] = 'dagger_walker2d'

# hyperparameters for DAgger
args['do_dagger'] = True
args['n_iter'] = 10

# the same as Behavior Cloning; the others are as default
args['num_agent_train_steps_per_iter'] = 10000
args['train_batch_size'] = 200
args['size'] = 64

In [6]:
# run training on Walker2d
create_log_dir(args, part='4-2')
run_training_loop(args)

['batch_size=1000', 'batch_size_initial=2000', 'do_dagger=True', 'env_name=Walker2d-v4', 'ep_len=1000', 'eval_batch_size=5000', 'exp_name=dagger_walker2d', 'expert_data=../../cs285/expert_data/expert_data_Walker2d-v4.pkl', 'expert_policy_file=../../cs285/policies/experts/Walker2d.pkl', 'learning_rate=0.005', 'logdir=../../data\\4-2\\q2_dagger_walker2d_Walker2d-v4', 'max_replay_buffer_size=1000000', 'n_iter=10', 'n_layers=2', 'no_gpu=False', 'num_agent_train_steps_per_iter=10000', 'save_params=False', 'scalar_log_freq=1', 'seed=1', 'size=64', 'train_batch_size=200', 'video_log_freq=5', 'which_gpu=0']

########################
logging outputs to  ../../data\4-2\q2_dagger_walker2d_Walker2d-v4
########################
GPU not detected. Defaulting to CPU.
Loading expert policy from... ../../cs285/policies/experts/Walker2d.pkl
obs (1, 17) (1, 17)
Done restoring expert policy...


********** Iteration 0 ************

Collecting data to be used for training...

Training agent using sampled dat

  deprecation(
  deprecation(



Beginning logging procedure...

Collecting data for eval...


  if not isinstance(terminated, (bool, np.bool8)):


Eval_AverageReturn : 4927.95849609375
Eval_StdReturn : 677.58935546875
Eval_MaxReturn : 5363.66552734375
Eval_MinReturn : 3425.5888671875
Eval_AverageEpLen : 950.0
Train_AverageReturn : 5383.310325177668
Train_StdReturn : 54.15251563871789
Train_MaxReturn : 5437.462840816386
Train_MinReturn : 5329.1578095389505
Train_AverageEpLen : 1000.0
Training Loss : 0.003240828402340412
Train_EnvstepsSoFar : 0
TimeSinceStart : 10.413546323776245
Initial_DataCollection_AverageReturn : 5383.310325177668
Done logging...




********** Iteration 1 ************

Collecting data to be used for training...


  scalar = float(scalar)



Relabelling collected observations with labels from an expert policy...

Training agent using sampled data from replay buffer...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 4694.85791015625
Eval_StdReturn : 1556.56591796875
Eval_MaxReturn : 5450.8916015625
Eval_MinReturn : 1215.750732421875
Eval_AverageEpLen : 883.5
Train_AverageReturn : 5283.61474609375
Train_StdReturn : 0.0
Train_MaxReturn : 5283.61474609375
Train_MinReturn : 5283.61474609375
Train_AverageEpLen : 1000.0
Training Loss : 0.0028004648629575968
Train_EnvstepsSoFar : 1000
TimeSinceStart : 21.13810133934021
Done logging...




********** Iteration 2 ************

Collecting data to be used for training...

Relabelling collected observations with labels from an expert policy...

Training agent using sampled data from replay buffer...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 5327.6884765625
Eval_StdReturn : 42.142616271972656
Eval_MaxReturn : 538

See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(



Collecting data for eval...
Eval_AverageReturn : 5381.42724609375
Eval_StdReturn : 38.432857513427734
Eval_MaxReturn : 5447.4462890625
Eval_MinReturn : 5334.1787109375
Eval_AverageEpLen : 1000.0
Train_AverageReturn : 5392.83740234375
Train_StdReturn : 0.0
Train_MaxReturn : 5392.83740234375
Train_MinReturn : 5392.83740234375
Train_AverageEpLen : 1000.0
Training Loss : 0.0014680755557492375
Train_EnvstepsSoFar : 4000
TimeSinceStart : 89.11248660087585
Done logging...




********** Iteration 5 ************

Collecting data to be used for training...

Relabelling collected observations with labels from an expert policy...

Training agent using sampled data from replay buffer...

Beginning logging procedure...

Collecting data for eval...
Eval_AverageReturn : 5430.9033203125
Eval_StdReturn : 20.93875503540039
Eval_MaxReturn : 5462.21484375
Eval_MinReturn : 5406.53369140625
Eval_AverageEpLen : 1000.0
Train_AverageReturn : 5386.66796875
Train_StdReturn : 0.0
Train_MaxReturn : 5386.66796875


Result of the last iteration:
|                              | Ant               | HalfCheetah       | Hopper            | Walker2d          |
| ---------------------------- | ----------------- | ----------------- | ----------------- | ----------------- |
| AverageReturn (Train / Eval) | 4844.54 / 4660.49 | 4136.01 / 4117.94 | 3716.81 / 3713.66 | 5371.93 / 5432.31 |
| StdReturn (Train / Eval)     | 0.0 / 72.78       | 0.0 / 71.78       | 0.0  / 1.77       | 0.0 / 32.73       |

In [None]:
%load_ext tensorboard
%tensorboard --logdir ../../data/4-2

Compare Behavior Cloning vs. DAgger

In [None]:
%tensorboard --logdir ../../data/