Figure 1: Behaviour foundation models (BFMs) with memory.
The is the official codebase for Zero-Shot Reinforcement Learning Under Partial Observability by Scott Jeen, Tom Bewley and Jonathan Cullen.
Assuming you have MuJoCo installed, setup a conda env with Python 3.9.16 using requirements.txt as usual:
conda create --name zsrl python=3.9.16
then install the dependencies from requirements.txt:
pip install -r requirements.txt
We provide implementations of the following algorithms:
| Algorithm | Command Line Argument |
|---|---|
| FB with Memory | fb_m |
| HILP with Memory | hilp_m |
We provide implementations for a range of memory model architectures:
| Memory Model | Authors | Command Line Argument |
|---|---|---|
| GRU | Cho et. al (2014) | --memory_type=gru |
| Transformer | Vaswani et. al (2017) | --memory_type=transformer |
| S4d | Gu et. al (2022) | --memory_type=s4d |
| MLP (frame-stacking) | Mnih et. al (2015) | --memory_type=mlp |
You can modify their hyperparameters:
| Hyperparameter | Description | Default | Command Line Arg |
|---|---|---|---|
|
|
Length of the trajectory passed to the forward model |
--history_length |
|
|
|
Length of the trajectory passed to the backward model |
--backward_history_length |
|
| Model Dimension | Hidden state dimension |
|
--model_dimension |
| Memory-based forward model / policy | Whether |
True |
--recurrent_F/--no-recurrent_F |
| Memory-based backward model | Whether |
True |
--recurrent_B/--no-recurrent_B |
You can recover standard FB and HILP by setting no-recurrent_F and no-recurrent_B respectively.
In the paper we report results with agents trained on different partially observed variants of ExORL domains. The domains are:
| Domain | Eval Tasks | Dimensionality | Type | Reward |
|---|---|---|---|---|
| Walker | stand walk run flip |
Low | Locomotion | Dense |
| Quadruped | stand roll roll_fast jump escape |
High | Locomotion | Dense |
| Cheetah | run run_backward walk walk_backward |
Low | Locomotion | Dense |
We implement a set of POMDPs that exhibit different types of partial observability:
| POMDP Setting | Description | Default Hyperparameter(s) | Environment Command Line Arg |
|---|---|---|---|
| Flickering states | States are dropped (zeroed) with probability |
flickering_prob=0.2 |
{env_name}_flickering |
| Noisy states | Isotropic 0-mean Gaussian noise is added to states with variace |
noise_std=0.2 |
{env_name}_noise |
| Dropped state variables | Subsets of states variables (sensors) are dropped (zeroed) with probability |
missing_sensor_prob=0.2 |
{env_name}_sensors |
| Removed velocities | Velocities are removed from the state | n/a | {env_name}_occluded |
| Changed dynamics | Mass and damping coefficients in the underlying MuJoCu simulator are scaled to different values between training and testing |
train_multiplies=1.0 eval_multipliers=1.0
|
{env_name} |
For each domain, you'll need to download the RND dataset manually from the ExORL benchmark then reformat it.
To download the rnd dataset on the walker domain, seperate their command line args with an _ and run:
python exorl_reformatter.py walker_rndthis will create a single dataset.npz file in the dataset/walker/rnd/buffer directory.
To train a standard FB-M model, with GRU memory model on rnd to solve all tasks in the walker_flickering domain, run:
python main_exorl.py fb_m walker_flickering rnd --memory_type=gru --eval_task stand run walk flipRead the full paper for more details! If you found this work useful, please consider citing it:
@article{jeen2025zero,
author = {Jeen, Scott and Bewley, Tom and Cullen, Jonathan M.},
title = {Zero-Shot Reinforcement Learning Under Partial Observability},
journal={arXiv preprint arXiv:2506.15446},
year={2025}
}
This work licensed under a standard MIT License, see LICENSE.md for further details.
