Code for Double Gumbel Q-Learning
Data (5.4 MB): https://drive.google.com/file/d/12wyYZ92bvVdkEQIHms8mVR5zYJZue-cd/view?usp=sharing
Logs (4.21 GB): https://drive.google.com/file/d/1LpR3lrKUx-qTaCrI4YViAjc0QA5kb8P2/view?usp=sharing
On Python 3.9
with Cuda 12.2.1
and cudnn 8.8.0
.
git clone git@github.com:dyth/doublegum.git
cd doublegum
create virtualenv
virtualenv <VIRTUALENV_LOCATION>/doublegum
source <VIRTUALENV_LOCATION>/doublegum
or conda
conda create --name doublegum python=3.9
conda activate doublegum
install mujoco
mkdir .mujoco
cd .mujoco
wget https://mujoco.org/download/mujoco210-linux-x86_64.tar.gz
tar -xf mujoco210-linux-x86_64.tar.gz
install packages
pip install -r requirements.txt
pip install "jax[cuda12_pip]==0.4.14" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
test that the code runs
./test.sh
main_cont.py --env <ENV_NAME> --policy <POLICY>
MetaWorld env
s are run with --env MetaWorld_<ENVNAME>
Policies benchmarked in our paper were:
DoubleGum
: DoubleGum (our algorithm)DDPG
: DDPG (Deep Deterministic Policy Gradients), [Lilicrap et al., 2015]TD3
: TD3 (Twin Delayed DDPG), [Fujimoto et al., 2018]SAC
: SAC (Soft Actor Critic, defaults to use Twin Critics), [Haarnoja et al., 2018]XQL --ensemble 1
: XQL (Extreme Q-Learning), [Garg et al., 2023]MoG-DDPG
: MoG-DDPG (Mixture of Gaussians Critics DDPG), [Barth-Maron et al., 2018, Shariari et al, 2022]
Policies we created/modified as additional benchmarks were:
QR-DDPG
: QR-DDPG (Quantile Regression [Dabney et al., 2018] with DDPG, defaults to use Twin Critics)QR-DDPG --ensemble 1
: QR-DDPG without Twin CriticsSAC --ensemble 1
: SAC without Twin CriticsXQL
: XQL with Twin CriticsTD3 --ensemble 5 --pessimism <p>
: Finer TD3, where p is an integer between 0 and 4
Policies included in this repository but not benchmarked in our paper were:
IQL
: Implicit Q-Learning adapted to an online setting, [Kostrikov et al., 2022]SACLite
: SAC without the entropy term on the critic, [Yu et al., 2022]
main_disc.py --env <ENV_NAME> --policy <POLICY>
Policies benchmarked in our paper were:
DoubleGum
: DoubleGum (our algorithm)DQN
: DQN, [Mnih et al., 2015]DDQN
: DDQN (Double DQN), [van Hasselt et al., 2016]DuellingDQN
: DuellingDQN, [Wang et al., 2016]
Policies we created/modified as additional benchmarks were:
DuellingDDQN
: DuellingDDQN (Duelling Double DQN)
Reproduced using raw data from Data
and Logs
.
Logs
(4.21 GB) contains data for Section 4 (Figures 1 and 2) and Appendix E.2 (Figures 6 and 7), while Data
(5.4 MB) contains benchmark results for DoubleGum and baselines used in all other graphs, results and tables.
Ran by
python plotting/fig<x>.py
python tables/tab<x>.py
- Wrappers from ikostrikov/jaxrl
- Distributional RL from google-deepmind/acme
- Control flow from yifan12wu/td3-jax