This repository contains the official implementation accompanying “Benchmarking RL via Converse Optimality: Generating Systems with Known Optimal Policies” (submitted for peer review).
The project provides a scientifically rigorous benchmarking framework for reinforcement learning, built around converse-optimality constructions that generate dynamical systems with:
- provably known optimal policies (
π*) - closed-form optimal value functions (
V*) - stochastic nonlinear control-affine dynamics
- tunable difficulty & conditioning
This enables ground-truth evaluation of RL algorithms in ways that classical environments cannot: true optimality gaps, exact regret, and precise sensitivity analysis.
Standard RL benchmarks do not provide access to the optimal value function or optimal feedback policy. As a result, evaluating or comparing algorithms is fundamentally indirect.
Our framework reverses the usual direction of optimal control:
Instead of searching for the optimal policy of a given system, we search for a system for which a given policy is optimal.
This “converse optimality” principle enables designing environments where:
- the optimal controller is analytically known
- the Bellman identity holds exactly
- nonlinearity and stochasticity are injected afterward, while preserving optimality
- difficulty is parameterized continuously
It creates RL environments with certified ground-truth, ideal for robust benchmarking.
Code/
├── benchmarks/
├── envs/
├── __init__.py
├── lqr_env.py
└── nudex_env.py
├── specs/
├── LQR_doubleint_hard_v1.yaml
└── NUDEx_K1_easy_v1.yaml
├── __init__.py
├── oracles.py
└── registry.py
├── cards/
└── nudex_k1_easy_v1.md
├── dataset/
├── code/
└── converse_four_systems_with_disturbances.py
├── figures/
├── fig_S1_COST.pdf
├── fig_S1_COST.png
├── fig_S2_COST.pdf
├── fig_S2_COST.png
├── fig_S3_COST.pdf
├── fig_S3_COST.png
├── fig_S4_COST.pdf
└── fig_S4_COST.png
├── math/
└── converse_four_systems_math.pdf
├── manifest.json
└── README.md
├── nvdex runs/
├── CSVs/
└── summary_nudex.csv
├── evaluation/
├── plotting code/
└── plot_summary_nudex_suite.py
└── analyze_nudex_runs.py
├── plotting code/
├── plot_summary_nudex_suite.py
└── r
├── nudex_env.py
├── nudex_family.py
├── resume_all_nudex_runs.sh
├── run_all_nudex.sh
├── run_nudex_k2_parallel_resume.sh
└── train_baselines_nudex.py
├── tests/
└── test_sanity.py
├── yaml configs/
├── arm link/
├── algorithms.yaml
├── global.yaml
├── grid.yaml
└── r
├── nvdex/
├── algorithms.yaml
├── difficulties.yaml
├── global.yaml
├── r
└── schedules.yaml
└── r
├── converse_arm_env.py
├── family.py
├── ReproducibilityChecklist (1).pdf
├── requirements.txt
├── run_all.sh
└── train_baselines_arm_gpu.py
LICENSE
README.md
This repository contains several canonical families derived through converse-optimality constructions.
All systems are discrete-time, stochastic, control-affine, and equipped with:
- a closed-form quadratic value function
- an optimal linear state-feedback policy
- nonlinear drifts constructed via differential geometry while maintaining full optimality
A converse-optimal family based on a bicycle/unicycle geometry with input dynamic extension. The system is:
- locally open-loop unstable
- nonholonomic
- rotationally coupled through the swirl operator
- solvable in closed form in the Quadratic–Gaussian specialization
Optimal policy and optimal value are known analytically.
A serial (n)-link planar arm, with revolute joints and torque inputs. This benchmark offers:
- a physically interpretable morphology
- highly coupled nonlinear dynamics
- closed-form optimal policies
- tunable difficulty via geometric parameters (swirl strength, conditioning, noise)
Each benchmark exposes:
oracle.action(s)— exact optimal actionoracle.value(s)— exact value functionoracle.q_value(s, a)— exact Q-function
Allowing measurement of:
- optimality gaps
- exact regret
- Bellman residuals
- finite-horizon deviation from optimality
Every benchmark family supports continuous difficulty control through:
- noise covariance
- drift conditioning
- swirl strength
- dynamic extension order
- cost geometry
Benchmarks use:
- deterministic system generation (via seeds)
- stochastic rollouts (Gaussian noise)
- stable YAML specifications (
benchmarks/specs/*.yaml)
Benchmarks can be instantiated from a single name:
from Code.benchmarks import registry
env, oracle = registry.make("NUDEx_K1_easy_v1")git clone https://github.com/converseoptimality/RL-Benchmarking.git
cd RL-Benchmarking/Code
pip install -r requirements.txt
from Code.benchmarks import registry
env, oracle = registry.make("NUDEx_K1_easy_v1")
obs = env.reset()
# Evaluate optimal policy
a_star = oracle.action(obs)
obs, reward, done, info = env.step(a_star)The repository includes an example GPU training script:
python Code/train_baselines_arm_gpu.py
and a convenience launcher:
bash Code/run_all.sh
This section describes how to reproduce all experiments, figures, and numerical results from the paper. Experiments are divided into two benchmark families:
- Converse arm system
- NVDEx system (Nonholonomic Vehicle with Dynamic Extension)
Both pipelines rely only on components included in the repository and require no external data.
Make sure the required packages are installed (as described higher).
Ensure the repository root is on the Python path:
export PYTHONPATH=$PWDGPU is recommended but not required for small-scale runs.
The arm benchmark experiments include:
- Optimal vs uncontrolled rollouts
- Optimality gap evaluations
- Grid-based heatmaps
- Difficulty-ladder trajectories
- Figures used in the paper
All arm benchmark configuration files are located in:
Code/yaml configs/arm link/
Use the main training script:
python Code/train_baselines_arm_gpu.pyThis script reads the following configuration files:
yaml configs/arm link/algorithms.yaml— RL hyperparametersyaml configs/arm link/global.yaml— global experimental settingsyaml configs/arm link/grid.yaml— grid rollout specification
The script generates:
arm_results/
├── logs/ # raw training logs
├── models/ # trained policies
├── eval/ # rollout CSVs
└── figs/ # learning curves, heatmaps, ideal vs uncontrolled
(If directory names differ slightly on your local version, adjust accordingly.)
After training, generate all arm figures using:
bash Code/run_all.shThis script performs:
- RL policy evaluation
- Optimality gap computation
- Heatmap generation
- All plots used in the paper
This includes:
- Controlled vs uncontrolled rollouts
- State heatmaps
- Optimality gap plots
- Difficulty ladder rollouts
Supplementary ARM figures (S1–S4) can be regenerated via:
python Code/dataset/code/converse_four_systems_with_disturbances.pyFigures are written to:
Code/dataset/figures/
NVDEx (NUDEx) experiments include:
- Optimal vs uncontrolled trajectories
- Difficulty ladder analysis
- Heatmaps and cost distributions
- Performance summaries across initial conditions
- All supplementary NVDEx figures
NUDEx-specific configuration files are located in:
Code/yaml configs/nvdex/
The core training script:
python Code/nvdex runs/train_baselines_nudex.pyReads:
yaml configs/nvdex/algorithms.yamlyaml configs/nvdex/global.yamlyaml configs/nvdex/difficulties.yamlyaml configs/nvdex/schedules.yaml
and produces:
nvdex runs/
├── CSVs/
└── evaluation/
bash Code/nvdex runs/run_all_nudex.shAfter training is complete, run:
python Code/nvdex runs/evaluation/analyze_nudex_runs.pyand generate summary plots using:
python Code/nvdex runs/evaluation/plotting code/plot_summary_nudex_suite.pyThis produces:
nvdex runs/CSVs/summary_nudex.csv
This directory contains:
- Difficulty ladder performance
- State and control trajectories
- Heatmaps
- All plots used in paper figures
Correctness can be checked via:
pytest Code/tests/test_sanity.pyThis ensures:
- correct environment construction
- correct oracle (
π*,V*) evaluations - consistent interface with RL algorithms
- All configurations are version-controlled via YAML files.
- All figures from the paper correspond to the outputs of the scripts above.
- No external assets are required.
The Code/dataset/ directory provides:
- benchmark metadata & manifest
- four canonical systems (math + code)
- figures used in supplementary material
- scripts generating QG-specialized dynamics
This dataset corresponds to the figures provided in the paper.
Basic sanity tests validate that each benchmark:
- initializes without error
- exposes
reset,step, andoracleinterface - preserves shape and dtype guarantees
Run tests via:
pytest Code/tests/
If you use this repository or benchmark suite in your research, please cite the paper:
@article{<to-be-added>,
title={Benchmarking RL via Converse Optimality: Generating Systems with Known Optimal Policies},
author={...},
year={2025},
journal={...},
}
A final BibTeX entry will be added upon publication.

