This project presents PAE (Prior-Aligned AutoEncoder), a tokenizer framework that explicitly shapes a diffusion-friendly latent manifold for latent diffusion models. Instead of relying solely on reconstruction fidelity or passively inheriting pretrained representations, PAE identifies and optimizes three key properties of a diffusion-friendly latent space — spatial structure coherence, local manifold continuity, and global manifold semantics — through targeted prior-alignment regularizations. On ImageNet 256×256, PAE achieves a new state-of-the-art gFID of 1.03 with up to 13× faster convergence than RAE under the same LightningDiT setup.
Prior alignment constructs a diffusion-friendly latent manifold. Left: Compared with reconstruction-oriented counterparts, the prior-aligned latent manifold is more structurally coherent, locally continuous, and semantically organized. Right: PAE yields faster convergence, better generation quality, and robust few-step sampling performance.
Class-conditional samples generated by PAE with LightningDiT-XL/1 on ImageNet 256×256.
At only 80 epochs, PAE achieves a gFID of 1.27. This performance already outperforms many baselines trained for 800 epochs, such as FAE (gFID 1.29) and AlignTok (gFID 1.37). With extended training, PAE sets a new SOTA gFID of 1.03 on ImageNet 256×256.
- [2026.05.09] 🚀 🚀 🚀 We release PAE. Code and pretrained models are now available!
- [2026.05.10] 🛠️ Scale-PAE is currently in progress. Once completed, we will release the full dynamic resolution PAE training code.
- 🎯 New Perspective: We study what makes a latent manifold diffusion-friendly, identifying three key properties: spatial structure coherence, local manifold continuity, and global manifold semantics.
- 🏗️ Explicit Manifold Shaping: PAE turns these properties into explicit training objectives via three prior-alignment regularizations (SSR, MCR, SCR), rather than leaving them to emerge indirectly.
- ⚡ 13× Faster Convergence: PAE reaches performance comparable to RAE with up to 13× fewer training epochs under the same LightningDiT setup.
- 🏆 State-of-the-Art: Achieves gFID 1.03 on ImageNet 256×256, the best result among all compared methods.
- 🔄 Encoder-Agnostic: Compatible with multiple VFM backbones including DINOv2, SigLIP2, DINOv3, and MAE.
Overview of the PAE framework. A frozen VFM provides stable representation features. DAM injects pixel detail while preserving the VFM as the dominant semantic source. Three prior-alignment objectives explicitly shape the latent manifold.
Our work builds upon the foundations laid by many excellent projects in the field. We would like to thank the authors of LightningDiT, RAE, GAE, ADM. We are grateful for their contributions to the community.
Please consider citing our work as follows if it is helpful.
@misc{yue2026mattersdiffusionfriendlylatentmanifold,
title={What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion},
author={Zhengrong Yue and Taihang Hu and Mengting Chen and Haiyu Zhang and Zihao Pan and Tao Liu and Zikang Wang and Jinsong Lan and Xiaoyong Zhu and Bo Zheng and Yali Wang},
year={2026},
eprint={2605.07915},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.07915},
}