Skip to content
Switch branches/tags
Go to file

Latest commit

Change-Id: I790e457de7cfd4442fb472c9b9c5211df5dbc4f4

Git stats


Failed to load latest commit information.
Latest commit message
Commit time
Aug 2, 2018
Aug 2, 2018
Jul 10, 2018

The Mirage of Action-Dependent Baselines in Reinforcement Learning

Code to reproduce the experiments in The Mirage of Action-Dependent Baselines in Reinforcement Learning. George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard E. Turner, Zoubin Ghahramani, Sergey Levine. ICML 2018. (

Linear-Quadratic-Gaussian (LQG) systems

See Appendix Section 9 for a detailed description of the LQG system. The code in this folder was used to generate the results in the LQG section (3.1) and Figures 1 and 5.

Q-Prop (

We modified the Q-Prop implementation published by the authors at (commit: 4d55f96). For our experiments, we used the conservative variant of QProp, as is used throughout the experimental section in the original paper. We used the default choices of policy and value functions, learning rates, and other hyperparameters. This code was used to generate Figure 3 and we describe the modifications in detail in Appendix 8.1.

The experimental data for all the results is contained in data/local/*. To run the plotter to get the same results as in the paper, you can run python or you can run python --mini to generate the same plot where each subfigure has its own legend (useful for cropping).

NOTE: Running the experiments found in sandbox/rocky/tf/launchers/ might throw a ModuleNotFoundError. To fix this, add the top-level folder to your environment variable PYTHONPATH.

Backpropagation through the Void (

We used the implementation published by the authors (, commit: 0e6623d) with the following modification: we measure the variance of the policy gradient estimator. In the original code, the authors accidentally measure the variance of a gradient estimator that neither method uses. We note that Grathwohl et al. (2018) recently corrected a bug in the code that caused the LAX method to use a different advantage estimator than the base method. We use this bug fix. The code was used to generate Figure 13.

To generate the figure, run the following commands:


Action-depedent Control Variates for Policy Optimization via Stein's Identity (

We used the Stein control variate implementation published by the authors at (commit: 6eec471). We describe the experiments in Appendix Section 8.2 and use the code to generate Figures 8 and 12.

To generate Figure 8, first create runner scripts with


Then run the bash scripts to generate results. Use


to generate the figure from the log files (included in the repo).

To generate Figure 12,


TRPO experiments

We modified the open-source TRPO implementation: (commit: 27400b8).

Performance comparison

To generate the performance comparison plot (Figure 4), switch to branch state_comparison and run the commands in the run_*.sh scripts and copy down the logs. Then run to generate Figure 4.

Variance calculations

To generate the variance plots (Figures 2, 9, 10, and 11), switch to branch variance and run

Horizon-aware Comparison

To generate the figures for the horizon-aware comparison experiments (Figures 6 and 7), switch to branch horizon_aware_comparison and you will need to run the training done in:


This script uses a simple utility (berg, not included) to schedule jobs on Google Compute Platform. The resulting log files are included in the gs_results folder.

To generate the figure from the data, run


This is not an officially supported Google product. George Tucker ( maintains this.


Code to reproduce the experiments in The Mirage of Action-Dependent Baselines in Reinforcement Learning.




No releases published


No packages published