Benchmarking RL via Converse Optimality

Overview

This repository contains the official implementation accompanying “Benchmarking RL via Converse Optimality: Generating Systems with Known Optimal Policies” (submitted for peer review).

The project provides a scientifically rigorous benchmarking framework for reinforcement learning, built around converse-optimality constructions that generate dynamical systems with:

provably known optimal policies (π*)
closed-form optimal value functions (V*)
stochastic nonlinear control-affine dynamics
tunable difficulty & conditioning

This enables ground-truth evaluation of RL algorithms in ways that classical environments cannot: true optimality gaps, exact regret, and precise sensitivity analysis.

Motivation

Standard RL benchmarks do not provide access to the optimal value function or optimal feedback policy. As a result, evaluating or comparing algorithms is fundamentally indirect.

Our framework reverses the usual direction of optimal control:

Instead of searching for the optimal policy of a given system, we search for a system for which a given policy is optimal.

This “converse optimality” principle enables designing environments where:

the optimal controller is analytically known
the Bellman identity holds exactly
nonlinearity and stochasticity are injected afterward, while preserving optimality
difficulty is parameterized continuously

It creates RL environments with certified ground-truth, ideal for robust benchmarking.

Repository Structure

Code/
    ├── benchmarks/
        ├── envs/
            ├── __init__.py
            ├── lqr_env.py
            └── nudex_env.py
        ├── specs/
            ├── LQR_doubleint_hard_v1.yaml
            └── NUDEx_K1_easy_v1.yaml
        ├── __init__.py
        ├── oracles.py
        └── registry.py
    ├── cards/
        └── nudex_k1_easy_v1.md
    ├── dataset/
        ├── code/
            └── converse_four_systems_with_disturbances.py
        ├── figures/
            ├── fig_S1_COST.pdf
            ├── fig_S1_COST.png
            ├── fig_S2_COST.pdf
            ├── fig_S2_COST.png
            ├── fig_S3_COST.pdf
            ├── fig_S3_COST.png
            ├── fig_S4_COST.pdf
            └── fig_S4_COST.png
        ├── math/
            └── converse_four_systems_math.pdf
        ├── manifest.json
        └── README.md
    ├── nvdex runs/
        ├── CSVs/
            └── summary_nudex.csv
        ├── evaluation/
            ├── plotting code/
                └── plot_summary_nudex_suite.py
            └── analyze_nudex_runs.py
        ├── plotting code/
            ├── plot_summary_nudex_suite.py
            └── r
        ├── nudex_env.py
        ├── nudex_family.py
        ├── resume_all_nudex_runs.sh
        ├── run_all_nudex.sh
        ├── run_nudex_k2_parallel_resume.sh
        └── train_baselines_nudex.py
    ├── tests/
        └── test_sanity.py
    ├── yaml configs/
        ├── arm link/
            ├── algorithms.yaml
            ├── global.yaml
            ├── grid.yaml
            └── r
        ├── nvdex/
            ├── algorithms.yaml
            ├── difficulties.yaml
            ├── global.yaml
            ├── r
            └── schedules.yaml
        └── r
    ├── converse_arm_env.py
    ├── family.py
    ├── ReproducibilityChecklist (1).pdf
    ├── requirements.txt
    ├── run_all.sh
    └── train_baselines_arm_gpu.py
LICENSE
README.md

Benchmark Families

This repository contains several canonical families derived through converse-optimality constructions.

All systems are discrete-time, stochastic, control-affine, and equipped with:

a closed-form quadratic value function
an optimal linear state-feedback policy
nonlinear drifts constructed via differential geometry while maintaining full optimality

1. NVDEx System (Nonholonomic Vehicle with Dynamic Extension)

A converse-optimal family based on a bicycle/unicycle geometry with input dynamic extension. The system is:

locally open-loop unstable
nonholonomic
rotationally coupled through the swirl operator
solvable in closed form in the Quadratic–Gaussian specialization

Optimal policy and optimal value are known analytically.

Illustration

2. Converse Arm System

A serial (n)-link planar arm, with revolute joints and torque inputs. This benchmark offers:

a physically interpretable morphology
highly coupled nonlinear dynamics
closed-form optimal policies
tunable difficulty via geometric parameters (swirl strength, conditioning, noise)

Illustration (uncontrolled vs controlled)

Features

Ground-Truth Evaluation

Each benchmark exposes:

oracle.action(s) — exact optimal action
oracle.value(s) — exact value function
oracle.q_value(s, a) — exact Q-function

Allowing measurement of:

optimality gaps
exact regret
Bellman residuals
finite-horizon deviation from optimality

Tunable Difficulty

Every benchmark family supports continuous difficulty control through:

noise covariance
drift conditioning
swirl strength
dynamic extension order
cost geometry

Reproducible Stochasticity

Benchmarks use:

deterministic system generation (via seeds)
stochastic rollouts (Gaussian noise)
stable YAML specifications (benchmarks/specs/*.yaml)

Unified Registry

Benchmarks can be instantiated from a single name:

from Code.benchmarks import registry
env, oracle = registry.make("NUDEx_K1_easy_v1")

Quick Start

Installation

git clone https://github.com/converseoptimality/RL-Benchmarking.git
cd RL-Benchmarking/Code
pip install -r requirements.txt

Using a Benchmark

from Code.benchmarks import registry

env, oracle = registry.make("NUDEx_K1_easy_v1")

obs = env.reset()

# Evaluate optimal policy
a_star = oracle.action(obs)
obs, reward, done, info = env.step(a_star)

Running Baseline Training

The repository includes an example GPU training script:

python Code/train_baselines_arm_gpu.py

and a convenience launcher:

bash Code/run_all.sh

Reproducing the Paper

This section describes how to reproduce all experiments, figures, and numerical results from the paper. Experiments are divided into two benchmark families:

Converse arm system
NVDEx system (Nonholonomic Vehicle with Dynamic Extension)

Both pipelines rely only on components included in the repository and require no external data.

Prerequisites

Make sure the required packages are installed (as described higher).

Ensure the repository root is on the Python path:

export PYTHONPATH=$PWD

GPU is recommended but not required for small-scale runs.

Reproducing the arm experiments

The arm benchmark experiments include:

Optimal vs uncontrolled rollouts
Optimality gap evaluations
Grid-based heatmaps
Difficulty-ladder trajectories
Figures used in the paper

All arm benchmark configuration files are located in:

Code/yaml configs/arm link/

Run arm training

Use the main training script:

python Code/train_baselines_arm_gpu.py

This script reads the following configuration files:

yaml configs/arm link/algorithms.yaml — RL hyperparameters
yaml configs/arm link/global.yaml — global experimental settings
yaml configs/arm link/grid.yaml — grid rollout specification

The script generates:

arm_results/
    ├── logs/                # raw training logs
    ├── models/              # trained policies
    ├── eval/                # rollout CSVs
    └── figs/                # learning curves, heatmaps, ideal vs uncontrolled

(If directory names differ slightly on your local version, adjust accordingly.)

Reproduce arm benchmark evaluation/plots

After training, generate all arm figures using:

bash Code/run_all.sh

This script performs:

RL policy evaluation
Optimality gap computation
Heatmap generation
All plots used in the paper

This includes:

Controlled vs uncontrolled rollouts
State heatmaps
Optimality gap plots
Difficulty ladder rollouts

Reproduce arm and mathematical results

Supplementary ARM figures (S1–S4) can be regenerated via:

python Code/dataset/code/converse_four_systems_with_disturbances.py

Figures are written to:

Code/dataset/figures/

Reproducing the NUDEx Experiments

NVDEx (NUDEx) experiments include:

Optimal vs uncontrolled trajectories
Difficulty ladder analysis
Heatmaps and cost distributions
Performance summaries across initial conditions
All supplementary NVDEx figures

NUDEx-specific configuration files are located in:

Code/yaml configs/nvdex/

Run training for NUDEx

The core training script:

python Code/nvdex runs/train_baselines_nudex.py

Reads:

yaml configs/nvdex/algorithms.yaml
yaml configs/nvdex/global.yaml
yaml configs/nvdex/difficulties.yaml
yaml configs/nvdex/schedules.yaml

and produces:

nvdex runs/
    ├── CSVs/
    └── evaluation/

Run the entire pipeline in parallel:

bash Code/nvdex runs/run_all_nudex.sh

Generate NUDEx evaluation figures

After training is complete, run:

python Code/nvdex runs/evaluation/analyze_nudex_runs.py

and generate summary plots using:

python Code/nvdex runs/evaluation/plotting code/plot_summary_nudex_suite.py

This produces:

nvdex runs/CSVs/summary_nudex.csv

This directory contains:

Difficulty ladder performance
State and control trajectories
Heatmaps
All plots used in paper figures

Verification

Correctness can be checked via:

pytest Code/tests/test_sanity.py

This ensures:

correct environment construction
correct oracle (π*, V*) evaluations
consistent interface with RL algorithms

Notes

All configurations are version-controlled via YAML files.
All figures from the paper correspond to the outputs of the scripts above.
No external assets are required.

Results

NVDEx performance across initializations & difficulty ladder

Arm system: optimal vs uncontrolled

Additional data from experiments

Dataset

The Code/dataset/ directory provides:

benchmark metadata & manifest
four canonical systems (math + code)
figures used in supplementary material
scripts generating QG-specialized dynamics

This dataset corresponds to the figures provided in the paper.

Testing

Basic sanity tests validate that each benchmark:

initializes without error
exposes reset, step, and oracle interface
preserves shape and dtype guarantees

Run tests via:

pytest Code/tests/

Citation

If you use this repository or benchmark suite in your research, please cite the paper:

@article{<to-be-added>,
  title={Benchmarking RL via Converse Optimality: Generating Systems with Known Optimal Policies},
  author={...},
  year={2025},
  journal={...},
}

A final BibTeX entry will be added upon publication.

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
Code		Code
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Benchmarking RL via Converse Optimality

Overview

Motivation

Repository Structure

Benchmark Families

1. NVDEx System (Nonholonomic Vehicle with Dynamic Extension)

Illustration

2. Converse Arm System

Illustration (uncontrolled vs controlled)

Features

Ground-Truth Evaluation

Tunable Difficulty

Reproducible Stochasticity

Unified Registry

Quick Start

Installation

Using a Benchmark

Running Baseline Training

Reproducing the Paper

Prerequisites

Reproducing the arm experiments

Run arm training

Reproduce arm benchmark evaluation/plots

Reproduce arm and mathematical results

Reproducing the NUDEx Experiments

Run training for NUDEx

Run the entire pipeline in parallel:

Generate NUDEx evaluation figures

Verification

Notes

Results

NVDEx performance across initializations & difficulty ladder

Arm system: optimal vs uncontrolled

Additional data from experiments

Dataset

Testing

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages