This repository contains the official codebase, pre-trained weights, and evaluation environments for the preprint: "Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO". We provide a minimal, standalone reproducible example (MRE) using standard MLPs on LunarLander-v2 to demonstrate the pathology of surrogate hacking and our proposed solution.
We identify and formalize two severe optimization pathologies in multi-timescale RL: Surrogate Objective Hacking (exploiting short-term shaping rewards at the expense of the true objective) and the Paradox of Temporal Uncertainty (irreversible myopic degeneration caused by gradient-free variance routing).
To overcome these fundamental vulnerabilities, we introduce Target Decoupling, a novel architectural and algorithmic intervention that disentangles representation learning from temporal routing, allowing the agent to align with the true long-term objective (γ = 0.999) without collapsing into short-term behavioral traps.
The core contribution of this work is isolating and systematically solving the pathologies of multi-timescale learning. The comparison between Stage 1 and Stage 4 is particularly striking: while the baseline is paralyzed by the fear of crashing and greedy hoarding of small centering rewards, our decoupled agent acts with true foresight.
The repository is structured to perfectly mirror our 4-stage ablation study. Each stage is completely standalone, strictly utilizing standard MLPs to ensure clarity and ease of reproducibility.
.
├── 1_baseline.py # Stage 1: Standard PPO Baseline
├── 2_surrogate_hacking_attention.py # Stage 2: Introduction of multi-timescale collapse
├── 3_temporal_paradox_variance.py # Stage 3: Attempted variance reduction
├── 4_target_decoupling_final.py # Stage 4: Proposed Target Decoupling architecture
├── 5_evaluate_seeds_plot.py # Multi-seed evaluation and plotting script
├── record_1_baseline.py # Evaluation script for Stage 1
├── record_2_surrogate.py # Evaluation script for Stage 2
├── record_3_paradox.py # Evaluation script for Stage 3
├── record_4_decoupling.py # Evaluation script for Stage 4
├── weights_stage_1.pth # Pre-trained weights for Baseline
├── weights_stage_2.pth # Pre-trained weights for Surrogate Hacking
├── weights_stage_3.pth # Pre-trained weights for Temporal Paradox
├── weights_stage_4.pth # Pre-trained weights for Target Decoupling
└── docs/ # Assets (GIFs, etc.)
├── baseline_hovering.gif
├── seed_comparison_plot.png
├── surrogate_hacking_crash.gif
├── temporal_paradox_wandering.gif
└── target_decoupling_landing.gif
Evaluating the pre-trained models is designed to be frictionless.
-
Install Dependencies
pip install -r requirements.txt
-
Evaluate the Proposed Solution (Stage 4) See the Target Decoupling agent elegantly solve the environment:
python record_4_decoupling.py
-
Observe the Baseline Pathology (Stage 1) Contrast it by watching the baseline agent frantically hover and waste fuel:
python record_1_baseline.py
-
Multi-Seed Evaluation Run the full comparison across 5 random seeds to reproduce the statistical significance plots:
python 5_evaluate_seeds_plot.py
Note: You can run any of the standalone X_*.py scripts to train the given stage from scratch.
To rigorously validate our claims, we evaluate the Target Decoupling architecture against the Baseline over multiple random seeds (n=5). The Target Decoupling agent consistently solves the environment with minimal variance, easily eliminating the failure modes and escaping hovering local optima.
If you find this code or our insights useful in your research, please consider citing our work:
@misc{sunRepresentationRoutingOvercoming2026b,
title = {Representation over {{Routing}}: {{Overcoming Surrogate Hacking}} in {{Multi-Timescale PPO}}},
shorttitle = {Representation over {{Routing}}},
author = {Sun, Jing},
year = 2026,
publisher = {arXiv},
doi = {10.48550/ARXIV.2604.13517},
urldate = {2026-04-16},
copyright = {Creative Commons Attribution 4.0 International},
keywords = {Artificial Intelligence (cs.AI),FOS: Computer and information sciences,Machine Learning (cs.LG)}
}



