# Setup

In [1]:
%load_ext autoreload
%autoreload 2

# 3.1.1 Bootstrapping

* (Testing) Train an agent on Pendulum-v1 with the sample configuration experiments/sac/sanity_pendulum.yaml. It shouldn’t get high reward yet (you’re not training an actor), but the Q-values should stabilize at some large negative number. The “do-nothing” reward for this environment is about -10 per step; you can use that together with the discount factor γ to calculate (approximately) what Q should be. If the Q-values go to minus infinity or stay close to zero, you probably have a bug.

In [3]:
# test hard update; didn't update actor
# set use_entropy_bonus=False, use_soft_target_update=False(default) in sanity_pendulum.yaml
%run cs285/scripts/run_hw3_sac.py -cfg experiments/sac/sanity_pendulum.yaml \
    --exp_name sanity_pendulum_no-entropy_hard-update

Namespace(config_file='experiments/sac/sanity_pendulum.yaml', eval_interval=5000, num_eval_trajectories=10, num_render_trajectories=0, seed=1, no_gpu=False, which_gpu=0, log_interval=1000, exp_name='sanity_pendulum_no-entropy_hard-update', video_log_freq=-1)
{'exp_name': 'sanity_pendulum_no-entropy_hard-update', 'total_steps': 300000, 'random_steps': 5000, 'training_starts': 10000, 'batch_size': 128, 'replay_buffer_capacity': 1000000, 'ep_len': None, 'discount': 0.99, 'use_soft_target_update': False, 'target_update_period': 1000, 'soft_target_update_rate': None, 'actor_gradient_type': 'reinforce', 'num_actor_samples': 1, 'num_critic_updates': 1, 'num_critic_networks': 1, 'target_critic_backup_type': 'mean', 'backup_entropy': True, 'use_entropy_bonus': False, 'temperature': 0.1, 'log_string': 'sanity_pendulum_no-entropy_hard-update_Pendulum-v1', 'actor_fixed_std': None, 'actor_learning_rate': 0.0003, 'critic_learning_rate': 0.0003, 'env_name': 'Pendulum-v1', 'hidden_size': 128, 'num_lay

In [4]:
# test soft update; didn't update actor
# set use_entropy_bonus=False, use_soft_target_update=True in sanity_pendulum.yaml
%run cs285/scripts/run_hw3_sac.py -cfg experiments/sac/sanity_pendulum.yaml \
    --exp_name sanity_pendulum_no-entropy_soft-update

Namespace(config_file='experiments/sac/sanity_pendulum.yaml', eval_interval=5000, num_eval_trajectories=10, num_render_trajectories=0, seed=1, no_gpu=False, which_gpu=0, log_interval=1000, exp_name='sanity_pendulum_no-entropy_soft-update', video_log_freq=-1)
{'exp_name': 'sanity_pendulum_no-entropy_soft-update', 'total_steps': 300000, 'random_steps': 5000, 'training_starts': 10000, 'batch_size': 128, 'replay_buffer_capacity': 1000000, 'ep_len': None, 'discount': 0.99, 'use_soft_target_update': True, 'target_update_period': None, 'soft_target_update_rate': 0.005, 'actor_gradient_type': 'reinforce', 'num_actor_samples': 1, 'num_critic_updates': 1, 'num_critic_networks': 1, 'target_critic_backup_type': 'mean', 'backup_entropy': True, 'use_entropy_bonus': False, 'temperature': 0.1, 'log_string': 'sanity_pendulum_no-entropy_soft-update_Pendulum-v1', 'actor_fixed_std': None, 'actor_learning_rate': 0.0003, 'critic_learning_rate': 0.0003, 'env_name': 'Pendulum-v1', 'hidden_size': 128, 'num_lay

In [None]:
%load_ext tensorboard
%tensorboard --logdir data/hw3_sac

# 3.1.2 Entropy Bonus and Soft Actor-Critic

* (Testing) The code should be logging entropy during the critic updates. If you run sanity_pendulum.yaml from before, it should achieve (close to) the maximum possible entropy for a 1-dimensional action space. Entropy is maximized by a uniform distribution:  
$$ \mathcal{H}(\mathcal{U}[−1, 1]) = \Bbb{E}[− \log p(x)] = − \log \frac{1}{2} = \log 2 ≈ 0.69 $$
Because currently our actor loss **only** consists of the entropy bonus (we haven’t implemented anything to maximize rewards yet), the entropy should increase until it arrives at roughly this level.  
If your logged entropy is higher than this, or significantly lower, you have a bug.

In [5]:
# test entropy; didn't update actor
# set use_entropy_bonus=True(default), use_soft_target_update=False(default) in sanity_pendulum.yaml
%run cs285/scripts/run_hw3_sac.py -cfg experiments/sac/sanity_pendulum.yaml

Namespace(config_file='experiments/sac/sanity_pendulum.yaml', eval_interval=5000, num_eval_trajectories=10, num_render_trajectories=0, seed=1, no_gpu=False, which_gpu=0, log_interval=1000, exp_name='', video_log_freq=-1)
{'exp_name': 'sanity_pendulum', 'total_steps': 300000, 'random_steps': 5000, 'training_starts': 10000, 'batch_size': 128, 'replay_buffer_capacity': 1000000, 'ep_len': None, 'discount': 0.99, 'use_soft_target_update': False, 'target_update_period': 1000, 'soft_target_update_rate': None, 'actor_gradient_type': 'reinforce', 'num_actor_samples': 1, 'num_critic_updates': 1, 'num_critic_networks': 1, 'target_critic_backup_type': 'mean', 'backup_entropy': True, 'use_entropy_bonus': True, 'temperature': 0.1, 'log_string': 'sanity_pendulum_Pendulum-v1', 'actor_fixed_std': None, 'actor_learning_rate': 0.0003, 'critic_learning_rate': 0.0003, 'env_name': 'Pendulum-v1', 'hidden_size': 128, 'num_layers': 3, 'use_tanh': True}
########################
logging outputs to  C:\Users\user

In [None]:
%load_ext tensorboard
%tensorboard --logdir data/hw3_sac

# 3.1.3 Actor with REINFORCE

* (Testing) Train an agent on InvertedPendulum-v4 using sanity_invertedpendulum_reinforce.yaml. You should achieve reward close to 1000, which corresponds to staying upright for all time steps.

In [21]:
%run cs285/scripts/run_hw3_sac.py -cfg experiments/sac/sanity_invertedpendulum_reinforce.yaml

Namespace(config_file='experiments/sac/sanity_invertedpendulum_reinforce.yaml', eval_interval=5000, num_eval_trajectories=10, num_render_trajectories=0, seed=1, no_gpu=False, which_gpu=0, log_interval=1000, exp_name='', video_log_freq=-1)
{'exp_name': 'sanity_invpendulum_reinforce', 'total_steps': 300000, 'random_steps': 5000, 'training_starts': 10000, 'batch_size': 128, 'replay_buffer_capacity': 1000000, 'ep_len': None, 'discount': 0.99, 'use_soft_target_update': False, 'target_update_period': 1000, 'soft_target_update_rate': None, 'actor_gradient_type': 'reinforce', 'num_actor_samples': 1, 'num_critic_updates': 1, 'num_critic_networks': 1, 'target_critic_backup_type': 'mean', 'backup_entropy': True, 'use_entropy_bonus': True, 'temperature': 0.1, 'log_string': 'sanity_invpendulum_reinforce_InvertedPendulum-v4', 'actor_fixed_std': None, 'actor_learning_rate': 0.0003, 'critic_learning_rate': 0.0003, 'env_name': 'InvertedPendulum-v4', 'hidden_size': 128, 'num_layers': 3, 'use_tanh': True

* Train an agent on HalfCheetah-v4 using the provided config (halfcheetah_reinforce1.yaml). Note that this configuration uses only one sampled action per training example.

In [4]:
%run cs285/scripts/run_hw3_sac.py -cfg experiments/sac/halfcheetah_reinforce1.yaml

Namespace(config_file='experiments/sac/halfcheetah_reinforce1.yaml', eval_interval=5000, num_eval_trajectories=10, num_render_trajectories=0, seed=1, no_gpu=False, which_gpu=0, log_interval=1000, exp_name='', video_log_freq=-1)
{'exp_name': 'halfcheetah_reinforce1', 'total_steps': 1000000, 'random_steps': 5000, 'training_starts': 10000, 'batch_size': 128, 'replay_buffer_capacity': 1000000, 'ep_len': None, 'discount': 0.99, 'use_soft_target_update': True, 'target_update_period': None, 'soft_target_update_rate': 0.005, 'actor_gradient_type': 'reinforce', 'num_actor_samples': 1, 'num_critic_updates': 1, 'num_critic_networks': 1, 'target_critic_backup_type': 'mean', 'backup_entropy': True, 'use_entropy_bonus': True, 'temperature': 0.2, 'log_string': 'halfcheetah_reinforce1_HalfCheetah-v4', 'actor_fixed_std': None, 'actor_learning_rate': 0.0003, 'critic_learning_rate': 0.0003, 'env_name': 'HalfCheetah-v4', 'hidden_size': 128, 'num_layers': 3, 'use_tanh': True}
########################
loggi

* Train another agent with halfcheetah_reinforce10.yaml. This configuration takes many samples from the actor for computing the REINFORCE gradient (we’ll call this REINFORCE-10, and the singlesample version REINFORCE-1). Plot the results (evaluation return over time) on the same axes as the single-sample REINFORCE. Compare and explain your results.

In [5]:
%run cs285/scripts/run_hw3_sac.py -cfg experiments/sac/halfcheetah_reinforce10.yaml

Namespace(config_file='experiments/sac/halfcheetah_reinforce10.yaml', eval_interval=5000, num_eval_trajectories=10, num_render_trajectories=0, seed=1, no_gpu=False, which_gpu=0, log_interval=1000, exp_name='', video_log_freq=-1)
{'exp_name': 'halfcheetah_reinforce10', 'total_steps': 1000000, 'random_steps': 5000, 'training_starts': 10000, 'batch_size': 128, 'replay_buffer_capacity': 1000000, 'ep_len': None, 'discount': 0.99, 'use_soft_target_update': True, 'target_update_period': None, 'soft_target_update_rate': 0.005, 'actor_gradient_type': 'reinforce', 'num_actor_samples': 10, 'num_critic_updates': 1, 'num_critic_networks': 1, 'target_critic_backup_type': 'mean', 'backup_entropy': True, 'use_entropy_bonus': True, 'temperature': 0.2, 'log_string': 'halfcheetah_reinforce10_HalfCheetah-v4', 'actor_fixed_std': None, 'actor_learning_rate': 0.0003, 'critic_learning_rate': 0.0003, 'env_name': 'HalfCheetah-v4', 'hidden_size': 128, 'num_layers': 3, 'use_tanh': True}
########################
l

reinforce10 works better. Because REINFORCE needs more samples to perform better, reinforce10 performs better.

In [None]:
%load_ext tensorboard
%tensorboard --logdir data/hw3_sac

# 3.1.4 Actor with REPARAMETRIZE

* (Testing) Make sure you can solve InvertedPendulum-v4 (use sanity_invertedpendulum_reparametrize.yaml) and achieve similar reward to the REINFORCE case.

In [11]:
%run cs285/scripts/run_hw3_sac.py -cfg experiments/sac/sanity_invertedpendulum_reparametrize.yaml

Namespace(config_file='experiments/sac/sanity_invertedpendulum_reparametrize.yaml', eval_interval=5000, num_eval_trajectories=10, num_render_trajectories=0, seed=1, no_gpu=False, which_gpu=0, log_interval=1000, exp_name='', video_log_freq=-1)
{'exp_name': 'sanity_invpendulum_reparametrize', 'total_steps': 300000, 'random_steps': 5000, 'training_starts': 10000, 'batch_size': 128, 'replay_buffer_capacity': 1000000, 'ep_len': None, 'discount': 0.99, 'use_soft_target_update': False, 'target_update_period': 1000, 'soft_target_update_rate': None, 'actor_gradient_type': 'reparametrize', 'num_actor_samples': 1, 'num_critic_updates': 1, 'num_critic_networks': 1, 'target_critic_backup_type': 'mean', 'backup_entropy': True, 'use_entropy_bonus': True, 'temperature': 0.1, 'log_string': 'sanity_invpendulum_reparametrize_InvertedPendulum-v4', 'actor_fixed_std': None, 'actor_learning_rate': 0.0003, 'critic_learning_rate': 0.0003, 'env_name': 'InvertedPendulum-v4', 'hidden_size': 128, 'num_layers': 3, 

* Train (once again) on HalfCheetah-v4 with halfcheetah_reparametrize.yaml. Plot results for all three gradient estimators (REINFORCE-1, REINFORCE-10 samples, and REPARAMETRIZE) on the same set of axes, with number of environment steps on the x-axis and evaluation return on the y-axis.

In [3]:
%run cs285/scripts/run_hw3_sac.py -cfg experiments/sac/halfcheetah_reparametrize.yaml

Namespace(config_file='experiments/sac/halfcheetah_reparametrize.yaml', eval_interval=5000, num_eval_trajectories=10, num_render_trajectories=0, seed=1, no_gpu=False, which_gpu=0, log_interval=1000, exp_name='', video_log_freq=-1)
{'exp_name': 'halfcheetah_reparametrize', 'total_steps': 1000000, 'random_steps': 5000, 'training_starts': 10000, 'batch_size': 128, 'replay_buffer_capacity': 1000000, 'ep_len': None, 'discount': 0.99, 'use_soft_target_update': True, 'target_update_period': None, 'soft_target_update_rate': 0.005, 'actor_gradient_type': 'reparametrize', 'num_actor_samples': 1, 'num_critic_updates': 1, 'num_critic_networks': 1, 'target_critic_backup_type': 'mean', 'backup_entropy': True, 'use_entropy_bonus': True, 'temperature': 0.1, 'log_string': 'halfcheetah_reparametrize_HalfCheetah-v4', 'actor_fixed_std': None, 'actor_learning_rate': 0.0003, 'critic_learning_rate': 0.0003, 'env_name': 'HalfCheetah-v4', 'hidden_size': 128, 'num_layers': 3, 'use_tanh': True}
#################

* Train an agent for the Humanoid-v4 environment with humanoid.yaml and plot results.

In [4]:
# executing for about 60.5 hr (on localhost CPU, i5-12400)
%run cs285/scripts/run_hw3_sac.py -cfg experiments/sac/humanoid.yaml \
    --video_log_freq 1_000_000 --num_render_trajectories 2

Namespace(config_file='experiments/sac/humanoid.yaml', eval_interval=5000, num_eval_trajectories=10, num_render_trajectories=2, seed=1, no_gpu=False, which_gpu=0, log_interval=1000, exp_name='', video_log_freq=1000000)
{'exp_name': 'humanoid', 'total_steps': 5000000, 'random_steps': 5000, 'training_starts': 10000, 'batch_size': 256, 'replay_buffer_capacity': 1000000, 'ep_len': None, 'discount': 0.99, 'use_soft_target_update': True, 'target_update_period': None, 'soft_target_update_rate': 0.005, 'actor_gradient_type': 'reparametrize', 'num_actor_samples': 1, 'num_critic_updates': 1, 'num_critic_networks': 2, 'target_critic_backup_type': 'min', 'backup_entropy': True, 'use_entropy_bonus': True, 'temperature': 0.05, 'log_string': 'humanoid_Humanoid-v4', 'actor_fixed_std': None, 'actor_learning_rate': 0.0003, 'critic_learning_rate': 0.0003, 'env_name': 'Humanoid-v4', 'hidden_size': 256, 'num_layers': 3, 'use_tanh': True}
########################
logging outputs to  C:\Users\user\Colab\Berk

See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(



************ Step 1500000 ************

Collecting data for eval...
eval_return : 2436.89794921875
eval_ep_len : 484.2
eval/return_std : 1635.2459716796875
eval/return_max : 5105.0830078125
eval/return_min : 450.7032165527344
eval/ep_len_std : 323.3477385107247
eval/ep_len_max : 1000
eval/ep_len_min : 90
TimeSinceStart : 49007.102304935455
Done logging...


************ Step 2000000 ************

Collecting data for eval...
eval_return : 3819.98388671875
eval_ep_len : 731.5
eval/return_std : 1281.099609375
eval/return_max : 5213.82373046875
eval/return_min : 1239.657470703125
eval/ep_len_std : 249.33240864356162
eval/ep_len_max : 1000
eval/ep_len_min : 224
TimeSinceStart : 71976.43786811829
Done logging...

Collecting video rollouts...


************ Step 2500000 ************

Collecting data for eval...
eval_return : 2825.539794921875
eval_ep_len : 522.5
eval/return_std : 1787.7298583984375
eval/return_max : 5674.67431640625
eval/return_min : 321.5865783691406
eval/ep_len_std : 314.2

In [None]:
%load_ext tensorboard
%tensorboard --logdir data/hw3_sac

# 3.1.5 Stabilizing Target Values

* Run single-Q, double-Q, and clipped double-Q on Hopper-v4 using the corresponding configuration files. Which one works best? Plot the logged eval_return from each of them as well as q_values. Discuss how these results relate to overestimation bias.

In [6]:
%run cs285/scripts/run_hw3_sac.py -cfg experiments/sac/hopper.yaml

Namespace(config_file='experiments/sac/hopper.yaml', eval_interval=5000, num_eval_trajectories=10, num_render_trajectories=0, seed=1, no_gpu=False, which_gpu=0, log_interval=1000, exp_name='', video_log_freq=-1)
{'exp_name': 'hopper_singlecritic', 'total_steps': 100000, 'random_steps': 5000, 'training_starts': 10000, 'batch_size': 256, 'replay_buffer_capacity': 100000, 'ep_len': None, 'discount': 0.99, 'use_soft_target_update': True, 'target_update_period': None, 'soft_target_update_rate': 0.005, 'actor_gradient_type': 'reparametrize', 'num_actor_samples': 1, 'num_critic_updates': 1, 'num_critic_networks': 1, 'target_critic_backup_type': 'mean', 'backup_entropy': False, 'use_entropy_bonus': True, 'temperature': 0.05, 'log_string': 'hopper_singlecritic_Hopper-v4', 'actor_fixed_std': None, 'actor_learning_rate': 0.0003, 'critic_learning_rate': 0.0003, 'env_name': 'Hopper-v4', 'hidden_size': 128, 'num_layers': 3, 'use_tanh': True}
########################
logging outputs to  C:\Users\user

In [7]:
%run cs285/scripts/run_hw3_sac.py -cfg experiments/sac/hopper_doubleq.yaml

Namespace(config_file='experiments/sac/hopper_doubleq.yaml', eval_interval=5000, num_eval_trajectories=10, num_render_trajectories=0, seed=1, no_gpu=False, which_gpu=0, log_interval=1000, exp_name='', video_log_freq=-1)
{'exp_name': 'hopper_doubleq', 'total_steps': 100000, 'random_steps': 5000, 'training_starts': 10000, 'batch_size': 256, 'replay_buffer_capacity': 100000, 'ep_len': None, 'discount': 0.99, 'use_soft_target_update': True, 'target_update_period': None, 'soft_target_update_rate': 0.005, 'actor_gradient_type': 'reparametrize', 'num_actor_samples': 1, 'num_critic_updates': 1, 'num_critic_networks': 2, 'target_critic_backup_type': 'doubleq', 'backup_entropy': False, 'use_entropy_bonus': True, 'temperature': 0.05, 'log_string': 'hopper_doubleq_Hopper-v4', 'actor_fixed_std': None, 'actor_learning_rate': 0.0003, 'critic_learning_rate': 0.0003, 'env_name': 'Hopper-v4', 'hidden_size': 128, 'num_layers': 3, 'use_tanh': True}
########################
logging outputs to  C:\Users\use

In [8]:
%run cs285/scripts/run_hw3_sac.py -cfg experiments/sac/hopper_clipq.yaml

Namespace(config_file='experiments/sac/hopper_clipq.yaml', eval_interval=5000, num_eval_trajectories=10, num_render_trajectories=0, seed=1, no_gpu=False, which_gpu=0, log_interval=1000, exp_name='', video_log_freq=-1)
{'exp_name': 'hopper_clipq', 'total_steps': 100000, 'random_steps': 5000, 'training_starts': 10000, 'batch_size': 256, 'replay_buffer_capacity': 100000, 'ep_len': None, 'discount': 0.99, 'use_soft_target_update': True, 'target_update_period': None, 'soft_target_update_rate': 0.005, 'actor_gradient_type': 'reparametrize', 'num_actor_samples': 1, 'num_critic_updates': 1, 'num_critic_networks': 2, 'target_critic_backup_type': 'min', 'backup_entropy': False, 'use_entropy_bonus': True, 'temperature': 0.05, 'log_string': 'hopper_clipq_Hopper-v4', 'actor_fixed_std': None, 'actor_learning_rate': 0.0003, 'critic_learning_rate': 0.0003, 'env_name': 'Hopper-v4', 'hidden_size': 128, 'num_layers': 3, 'use_tanh': True}
########################
logging outputs to  C:\Users\user\Colab\Be

q_value: single-Q > double-Q > clipped double-Q; eval_return: clipped double-Q > double-Q > single-Q. single-Q has an over-estimation issue; double-Q and clipped double-Q can alleviate it quitely.

* Pick the best configuration (single-Q/double-Q/clipped double-Q, or REDQ if you implement it) and run it on Humanoid-v4 using humanoid.yaml (edit the config to use the best option). You can truncate it after 500K environment steps. If you got results from the humanoid environment in the last homework, plot them together with environment steps on the x-axis and evaluation return on the y-axis. Otherwise, we will provide a humanoid log file that you can use for comparison. How do the off-policy and on-policy algorithms compare in terms of sample efficiency? *Note: if you’d like to run training to completion (5M steps), you should get a proper, walking humanoid! You can run with videos enabled by using **-nvid 1**. If you run with videos, you can strip videos from the logs for submission with [this script](https://gist.github.com/kylestach/e9964f5f34ee74367547dec83eaf5fae).*

In [18]:
# executing for about 111.7 hr (on localhost CPU, i5-12400)
%run cs285/scripts/run_hw3_sac.py -cfg experiments/sac/humanoid_redq_reparametrize.yaml \
    --video_log_freq 1_000_000 --num_render_trajectories 2

Namespace(config_file='experiments/sac/humanoid_redq_reparametrize.yaml', eval_interval=5000, num_eval_trajectories=10, num_render_trajectories=2, seed=1, no_gpu=False, which_gpu=0, log_interval=1000, exp_name='', video_log_freq=1000000)
{'exp_name': 'humanoid_redq_reparametrize', 'total_steps': 5000000, 'random_steps': 5000, 'training_starts': 10000, 'batch_size': 256, 'replay_buffer_capacity': 1000000, 'ep_len': None, 'discount': 0.99, 'use_soft_target_update': True, 'target_update_period': None, 'soft_target_update_rate': 0.005, 'actor_gradient_type': 'reparametrize', 'num_actor_samples': 1, 'num_critic_updates': 1, 'num_critic_networks': 10, 'target_critic_backup_type': 'redq', 'backup_entropy': True, 'use_entropy_bonus': True, 'temperature': 0.05, 'log_string': 'humanoid_redq_reparametrize_Humanoid-v4', 'actor_fixed_std': None, 'actor_learning_rate': 0.0003, 'critic_learning_rate': 0.0003, 'env_name': 'Humanoid-v4', 'hidden_size': 256, 'num_layers': 3, 'use_tanh': True}
##########

In [None]:
%load_ext tensorboard
%tensorboard --logdir data/hw3_sac