GitHub - yuemingPAN/SFD: Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

Yueming Pan^1,2‡, Ruoyu Feng^3‡, Qi Dai², Yuqi Wang³, Wenfeng Lin³,
Mingyu Guo³, Chong Luo^2†, Nanning Zheng^1†

¹IAIR, Xi’an Jiaotong University ²Microsoft Research Asia ³ByteDance

‡ Equal contribution † Corresponding author

✨ Highlights

We propose Semantic-First Diffusion (SFD), a novel latent diffusion paradigm that performs asynchronous denoising on semantic and texture latents, allowing semantics to denoise earlier and subsequently guide texture generation.
SFD achieves state-of-the-art FID score of 1.04 on ImageNet 256×256 generation.
Exhibits 100× and 33.3× faster training convergence compared to DiT and LightningDiT, respectively.

🚩 Overview

Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit the texture generation by providing a semantic anchor. However, existing methods denoise semantic and texture latents synchronously, overlooking this natural ordering.

We propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD constructs composite latents by combining compact semantic representations from a pretrained visual encoder (via a Semantic VAE) with texture latents, and performs asynchronous denoising with separate noise schedules: semantics denoise earlier to guide texture refinement. During denoising, SFD operates in three phases: Stage I – Semantic initialization, where semantic latents denoise first; Stage II – Asynchronous generation, where semantics and textures denoise jointly but asynchronously, with semantics ahead of textures; Stage III – Texture completion, where only textures continue refining. After denoising, only the texture latent is decoded for the final image.

On ImageNet 256×256, SFD demonstrates both superior quality and remarkable convergence acceleration. SFD achieves state-of-the-art FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while exhibiting approximately 100× and 33.3× faster training convergence compared to DiT and LightningDiT, respectively. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling.

🗞️ News

[2025.12.05] Released inference code and pre-trained model weights of SFD on ImageNet 256×256.

🛠️ To-Do List

Inference code and model weights
Training code of Semantic VAE and diffusion model (SFD)

🧾 Results

Explicitly leading semantics ahead of textures with a moderate offset (Δt = 0.3) achieves an optimal balance between early semantic stabilization and texture collaboration, effectively harmonizing their joint modeling.

On ImageNet 256×256, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL).
100× and 33.3× faster training convergence compared to DiT and LightningDiT, respectively.

🎯 Usage

1. Prepare Environments

conda create -n sfd python=3.10.12
conda activate sfd
pip install -r requirements.txt
pip install numpy==1.24.3 protobuf==3.20.0
## guided-diffusion evaluation environment
git clone https://github.com/openai/guided-diffusion.git
pip install tensorflow==2.8.0
sed -i 's/dtype=np\.bool)/dtype=np.bool_)/g' guided-diffusion/evaluations/evaluator.py  # or will encounter the error: "AttributeError: module 'numpy' has no attribute 'bool'".

2. Prepare Model Weights

# Prepare the decoder of SD-VAE
mkdir -p outputs/model_weights/va-vae-imagenet256-experimental-variants
wget https://huggingface.co/hustvl/va-vae-imagenet256-experimental-variants/resolve/main/ldm-imagenet256-f16d32-50ep.ckpt \
    --no-check-certificate -O outputs/model_weights/va-vae-imagenet256-experimental-variants/ldm-imagenet256-f16d32-50ep.ckpt

# Prepare evaluation batches of ImageNet 256x256 from guided-diffusion
mkdir -p outputs/ADM_npz
wget https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/256/VIRTUAL_imagenet256_labeled.npz -O outputs/ADM_npz/VIRTUAL_imagenet256_labeled.npz

# Download files from huggingface
mkdir temp
mkdir -p outputs/dataset/imagenet1k-latents
mkdir -p outputs/train
# Prepare latent statistics
huggingface-cli download SFD-Project/SFD --include "imagenet1k-latents/*" --local-dir temp
mv temp/imagenet1k-latents/* outputs/dataset/imagenet1k-latents/
# Prepare the autoguidance model
huggingface-cli download SFD-Project/SFD --include "model_weights/sfd_autoguidance_b/*" --local-dir temp
mv temp/model_weights/sfd_autoguidance_b outputs/train/
# Prepare XL model (675M)
huggingface-cli download SFD-Project/SFD --include "model_weights/sfd_xl/*" --local-dir temp
mv temp/model_weights/sfd_xl outputs/train/
# Prepare XXL model (1.0B)
huggingface-cli download SFD-Project/SFD --include "model_weights/sfd_1p0/*" --local-dir temp
mv temp/model_weights/sfd_1p0 outputs/train/
rm -rf temp
# or you can directly download the checkpoints from huggingface: https://huggingface.co/SFD-Project/SFD. Put the files in model_weights/ of SFD-Project/SFD to outputs/train

3. Inference

Inference demo

PRECISION=bf16 bash run_fast_inference.sh $INFERENCE_CONFIG
# take XL model (675M) as an example. 
CFG_SCALE="1.5" \
AUTOGUIDANCE_MODEL_SIZE="b" \
AUTOGUIDANCE_CKPT_ITER="70" \
PRECISION=bf16 bash run_fast_inference.sh configs/sfd/lightningdit_xl/inference_4m_autoguidance_demo.yaml

Images will be saved into demo_images/demo_samples.png, e.g. the following one:

Inference 50K samples

For without AutoGuidance, run the following command:

# w/o AutoGuidance
FID_NUM=50000 \
GPUS_PER_NODE=$GPU_NUM PRECISION=bf16 bash run_inference.sh \
    $INFERENCE_CONFIG

# take XL model (675M) as an example. 
FID_NUM=50000 \
GPUS_PER_NODE=8 PRECISION=bf16 bash run_inference.sh \
    configs/sfd/lightningdit_xl/inference_4m.yaml

More inference configs can be found in configs/sfd/lightningdit_xl and configs/sfd/lightningdit_1p0, corresponding to XL (675M) and XXL (1.0B) models, respectively.

For with AutoGuidance, run the following command:

# w/ AutoGuidance
CFG_SCALE="$GUIDANCE_SCALE" \
AUTOGUIDANCE_MODEL_SIZE="b" \
AUTOGUIDANCE_CKPT_ITER="$GUIDANCE_ITER" \
FID_NUM=50000 \
GPUS_PER_NODE=$GPU_NUM PRECISION=bf16 bash run_inference.sh \
    $INFERENCE_CONFIG

# take XL model (675M) as an example. 
CFG_SCALE="1.5" \
AUTOGUIDANCE_MODEL_SIZE="b" \
AUTOGUIDANCE_CKPT_ITER="70" \
FID_NUM=50000 \
GPUS_PER_NODE=8 PRECISION=bf16 bash run_inference.sh \
    configs/sfd/lightningdit_xl/inference_4m_autoguidance.yaml

More inference configs can be found in configs/sfd/lightningdit_xl and configs/sfd/lightningdit_1p0, corresponding to XL (675M) and XXL (1.0B) models, respectively. For with AutoGuidance, the detailed parameters for each configuration are shown in the following table:

Model	Epochs	Params	Degraded Model	Iterations	Guidance Scale
LightningDiT-XL	80	675M	LightningDiT-B	70K	1.6
LightningDiT-XL	800	675M	LightningDiT-B	70K	1.5
LightningDiT-XXL	80	1.0B	LightningDiT-B	60K	1.5
LightningDiT-XXL	800	1.0B	LightningDiT-B	120K	1.5

4. Evaluation

# get final scores via guided-diffusion's evaluation tools
bash run_eval_via_guided_diffusion.sh $OUTPUT_IMAGES_DIR
# e.g.,
bash run_eval_via_guided_diffusion.sh outputs/train/sfd_xl/lightningdit-xl-1-ckpt-4000000-dopri5-250-balanced

Note that our models were trained and evaluated on 16 NPUs (consistent with the results reported in our paper). When testing on 8 A100 GPUs, we observed minor performance variations. The detailed results are presented below:

Without AutoGuidance

Model	Epochs	#Params	FID (NPU)	FID (GPU)
SFD-XL	80	675M	3.43	3.50
SFD-XL	800	675M	2.54	2.66
SFD-XXL	80	1.0B	2.84	2.92
SFD-XXL	800	1.0B	2.38	2.36

With AutoGuidance

Model	Epochs	#Params	FID (NPU)	FID (GPU)
SFD-XL	80	675M	1.30	1.29
SFD-XL	800	675M	1.06	1.03
SFD-XXL	80	1.0B	1.19	1.20
SFD-XXL	800	1.0B	1.04	1.04

These slight discrepancies are likely due to numerical precision differences between hardware platforms, but the overall performance remains consistent.

Acknowledgements

Our code is based on LightningDiT, REPA and ADM repositories. We sincerely thank the authors for releasing their code.

🔗 Citation

If you find our work, this repository, or pretrained models useful, please consider giving a star ⭐ and citing:

@article{Pan2025SFD,
  title={Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent    Diffusion},
  author={Pan, Yueming and Feng, Ruoyu and Dai, Qi and Wang, Yuqi and Lin, Wenfeng and Guo, Mingyu and Luo, Chong and Zheng, Nanning},
  journal={arXiv preprint arXiv:2512.04926},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
configs/sfd		configs/sfd
dataset		dataset
demo_images		demo_images
images		images
models		models
tokenizer		tokenizer
tools		tools
transport		transport
vavae		vavae
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate_tokenizer.py		evaluate_tokenizer.py
inference.py		inference.py
requirements.txt		requirements.txt
run_eval_via_guided_diffusion.sh		run_eval_via_guided_diffusion.sh
run_fast_inference.sh		run_fast_inference.sh
run_inference.sh		run_inference.sh
run_train.sh		run_train.sh
shuffle_gen_images.py		shuffle_gen_images.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

✨ Highlights

🚩 Overview

🗞️ News

🛠️ To-Do List

🧾 Results

🎯 Usage

1. Prepare Environments

2. Prepare Model Weights

3. Inference

4. Evaluation

Acknowledgements

🔗 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

yuemingPAN/SFD

Folders and files

Latest commit

History

Repository files navigation

Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

✨ Highlights

🚩 Overview

🗞️ News

🛠️ To-Do List

🧾 Results

🎯 Usage

1. Prepare Environments

2. Prepare Model Weights

3. Inference

4. Evaluation

Acknowledgements

🔗 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages