A Hitchhiker's Guide to Statistical Comparisons of Reinforcement Learning Algorithms

This is the repository associated to the paper:

A Hitchhiker's Guide to Statistical Comparisons of Reinforcement Learning Algorithms.

This code serves two purposes: reproducing the experiments from the paper and showing an example of rigorous testing and visualization of algorithm performances.

Reproducing the Results:

Running experiments:

python3 run_experiment.py --study equal_dist_equal_var

Possible studies:

equal_dist_equal_var
equal_dist_unequal_var
unequal_dist_equal_var
unequal_dist_unequal_var_1: here the first distribution is the one that has the smallest std
unequal_dist_unequal_var_2: here the first distribution has the largest std

This creates a pickle file in ./data/equal_dist_equal_var/ for each pair of distributions. A bash file is made available to launch the experiment on a slurm cluster.

It is advised to run the experiment with fewer iterations first, to make sure everything works.

Plots and Tables

To obtain plots of the false positive rates as a function of the sample size for various tests, just run the plot_false_positive.py script:

python3 plot_false_positive.py --study equal_dist_equal_var
To obtain code for latex table that contains the statistical power results use the table_from_results.py script:

python3 table_from_results.py --study equal_dist_equal_var

Test and Plot two samples

python3 example_test_and_plot.py

The data we used are:

192 runs of Soft-Actor Critic for 2M timesteps on Half-Cheetah-v2, using the Spinning Up implementation.
192 runs of Twin-Delayed Deep Deterministic Policy Gradient for 2M timesteps on Half-Cheetah-v2, using the Spinning Up implementation.

This example samples one sample of a given size from each, compares them using a statistical test and plot the learning curves, with error shades and dots indicating statistical significance.

The central tendency, the type of error, the test used, the confidence level and the sample size are tunable parameters.

SAC and TD3 Performances

This repository also provides text files with learning curves for 192 runs of Soft-Actor Critic and 192 runs of Twin-Delayed Deep Deterministic Policy Gradient run for 2M timesteps on Half-Cheetah-v2.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
README.md		README.md
code_for_latex_table		code_for_latex_table
distributions.py		distributions.py
example_code_for_tests.py		example_code_for_tests.py
example_test_and_plot.py		example_test_and_plot.py
plot.png		plot.png
plot_false_positive.py		plot_false_positive.py
run_expe.sh		run_expe.sh
run_experiment.py		run_experiment.py
table_from_results.py		table_from_results.py
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Hitchhiker's Guide to Statistical Comparisons of Reinforcement Learning Algorithms

Reproducing the Results:

Running experiments:

Plots and Tables

Test and Plot two samples

SAC and TD3 Performances

About

Releases

Packages

Languages

ccolas/rl_stats

Folders and files

Latest commit

History

Repository files navigation

A Hitchhiker's Guide to Statistical Comparisons of Reinforcement Learning Algorithms

Reproducing the Results:

Running experiments:

Plots and Tables

Test and Plot two samples

SAC and TD3 Performances

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages