EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model
This repository provides the official PyTorch/GPU implementation of the paper EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model.
- 2026.4.13: Released the training & inference code.
- 2026.4.13: Released the EchoGen-2B checkpoint.
We propose EchoGen, the first feed-forward subject-driven image synthesis framework built on Visual Auto-Regressive models (VAR). EchoGen is capable of generating faithful renditions of a given subject in arbitrary text-described scenes without any test-time optimization. Unlike prior subject-driven approaches, EchoGen leverages the efficiency and hierarchical generation capability of visual autoregressive models to enable subject-driven generation.
Our method introduces:
- A feed-forward visual autoregressive paradigm for subject-driven generation, bringing subject control into visual autoregressive modeling and enabling zero-shot synthesis of subjects with significantly lower inference latency.
- A dual-path subject injection mechanism, which disentangles subject identity into high-level semantic features and fine-grained visual details, and injects them through decoupled cross-attention or multi-modal attention, respectively, to achieve more faithful and robust subject preservation across diverse scenes.
Evaluated on DreamBench, EchoGen achieves subject fidelity, text alignment, and image quality comparable to or better than leading diffusion-based methods, while offering substantially faster sampling.
This repo contains:
- 🪐 A clean PyTorch implementation of our EchoGen model.
- 🛸 Training scripts for EchoGen, built with PyTorch FSDP.
- 🦄 An inference script for EchoGen to generate high-fidelity subject-driven images with the EchoGen-2B model.
As in Infinity, the training dataset should follow the structure below. Specifically, the dataset consists of a set of JSONL files named as [h_div_w_template]_[num_examples].jsonl. Where:
h_div_w_templatedenotes the target aspect-ratio template, defined as the height-to-width ratio of the images.num_examplesdenotes the number of training examples whose actualh / wvalues are close to this template ratio.
In other words, each filename should explicitly indicate both the aspect-ratio template and the number of examples associated with it.
/path/to/dataset/:
[h_div_w_template1]_[num_examples].jsonl
[h_div_w_template2]_[num_examples].jsonl
[h_div_w_template3]_[num_examples].jsonl
Each [h_div_w_template]_[num_examples].jsonl file contains one JSON object per line, and each object should include the following fields:
{
"output_image": "path/to/the ground-truth image, required",
"cond_image": "path/to/the conditioning image, required",
"h_div_w": "float value representing the image aspect ratio, required",
"long_caption": "long-form text caption of the image, required"
}We also provide a toy dataset as a minimal example in this directory, and you can prepare your own dataset by following the same format.
To keep this repository focused on the core contributions of EchoGen, we do not include the subject-segmentation preprocessing pipeline in this release. You may prepare the conditional image with any off-the-shelf visual-language model (e.g., Qwen 3.5) and the subject segmentation tool (e.g., GroundingDINO) following the discussion in the paper.
Clone the repository:
git clone https://github.com/drx-code/EchoGen.git
cd EchoGenA suitable conda environment can be created and activated as follows:
conda create -n echogen python=3.10
conda activate echogen
pip install -r requirements.txt
pip install flash_attn --no-build-isolation --no-cache-dirTo set up the essential core components in EchoGen, run the provided build script. This will automatically download the following components:
- DINO-v2: Semantic image encoder.
- FLUX.1-dev VAE: Detailed content image encoder.
- GroundingDINO: Pre-trained segmentation model for subject segmentation.
- Infinity: Core pre-trained weights for the generation backbone.
sh scripts/build.shThen, download the pre-trained EchoGen-2B model:
sh scripts/download.shFor convenience, the pre-trained EchoGen-2B model can also be downloaded directly here:
Training the EchoGen-2B model can be started by running the script:
# train EchoGen-2B model
sh scripts/train.shTo train EchoGen with different model sizes {125M, 2B} and different resolutions {256, 1024}, you can run the following commands:
# 125M, layer12, pixel number = 256 x 256 = 0.06M Pixels
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
--model=layer12c4 --pn 0.06M --exp_name=infinity_125M_pn_0.06M \
--content_feats_len=1024 --content_feats_dim=16 --semantic_feats_len=257 --semantic_feats_dim=768 \
--vae_type 16 --vae_ckpt=pretrained/infinity/infinity_vae_d16.pth --rush_resume=pretrained/infinity/infinity_125M_256x256.pth
# 2B, layer32, pixel number = 256 x 256 = 0.06M Pixels
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
--model=2bc8 --pn 0.06M --exp_name=infinity_2B_pn_0.06M \
--content_feats_len=1024 --content_feats_dim=16 --semantic_feats_len=257 --semantic_feats_dim=768 \
--vae_type 32 --vae_ckpt=pretrained/infinity/infinity_vae_d32reg.pth --rush_resume=pretrained/infinity/infinity_2b_reg.pth
# 2B, layer32, pixel number = 1024 x 1024 = 1M Pixels
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
--model=2bc8 --pn 1M --exp_name=infinity_2B_pn_1M \
--content_feats_len=4096 --content_feats_dim=16 --semantic_feats_len=1025 --semantic_feats_dim=768 \
--vae_type 32 --vae_ckpt=pretrained/infinity/infinity_vae_d32reg.pth --rush_resume=pretrained/infinity/infinity_2b_reg.pthAs in Infinity, a folder named local_output will be created to save checkpoints and logs. You can monitor the training process by checking local_output/log.txt and local_output/stdout.txt, or using wandb for more detailed logging.
If your experiment is interrupted, simply re-run the command. Training will automatically resume from the latest checkpoint in local_output/ckpt*.pth.
You can also fine-tune the model based on our EchoGen-2B pretrained checkpoints by setting --rush_resume=[echogen_2b.pth] in train.sh. After fine-tuning, you will obtain a checkpoint like [local_output]/ar-ckpt-giter(xxx)K-ep(xxx)-iter(xxx)-last.pth.
We provide infer.sh for inference.
sh scripts/infer.shFor inference with your own control image and text instruction, you can set --control_img and --prompt accordingly.
To perform inference with our EchoGen-2B model at 1024px resolution, please use the following arguments:
pn=1M
model_type=echogen_2b
vae_type=32
vae_path=pretrained/infinity/infinity_vae_d32reg.pth
content_feats_len=4096
semantic_feats_len=1025 If you want to perform inference with the EchoGen-125M model at 256px resolution, use:
pn=0.06M
model_type=echogen_layer12
vae_type=16
vae_path=pretrained/infinity/infinity_vae_d16.pth
content_feats_len=1024
semantic_feats_len=257 The code in this repository is mainly based on Infinity.
This repository is licensed under the MIT License. See the LICENSE file for details.
If you have any questions, please contact us by email: dongruixiaoyx@mail.ustc.edu.cn.
If our work contributes to your research, please don't hesitate to give us a star ⭐ and cite us as follows:
@inproceedings{
dong2026echogen,
title={EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model},
author={Ruixiao Dong and Zhendong Wang and Keli Liu and Li Li and Ying Chen and Kai Li and Daowen Li and Houqiang Li},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=ctmyCjo18u}
}