Xin Yu1,2, Xiaojuan Qi1*†, Zhengqi Li2, Kai Zhang2, Richard Zhang2, Zhe Lin2, Eli Shechtman2, Tianyu Wang2†, Yotam Nitzan2†
1 The University of Hong Kong 2 Adobe Research
* Corresponding author. † Project lead.
We introduce the Self-Evaluating Model (Self-E), a training-from-scratch framework for any-step text to image generation. Self-E is able to unlock such capability without requiring distillation from a pre-trained teacher model. Instead, Self-E learns structure of local distribution from data in a manner similar to conditional flow matching (learning from data), while simultaneously employing a mechanism that evaluates its own few-step generated samples using its own score estimates (self-evaluation).
- From-scratch training — no pretrained teacher, no distillation pipeline
- Flexible-step inference across practical low-step and mid-step sampling budgets
- Self-evaluation via classifier score + auxiliary terms (Eq. 13)
The original experiments were conducted at Adobe Research. Due to licensing constraints, pretrained weights cannot be publicly released. This repository is a clean code re-implementation by the first author, provided for the research community to reproduce the method on their own data and compute.
# Requirements: Python 3.10, CUDA 12, Linux, NVIDIA GPUs
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync --no-devTraining requires three pretrained model families:
| Model | HuggingFace ID | Role |
|---|---|---|
| T5-XXL | google/t5-v1_1-xxl |
Text encoder |
| CLIP ViT-L/14 | openai/clip-vit-large-patch14 |
Text encoder |
| FLUX.1-dev | black-forest-labs/FLUX.1-dev |
VAE (ae.safetensors) |
Note: FLUX.1-dev is a gated model — you need to accept its license on HuggingFace and obtain an access token before downloading.
Use the provided caching script to download everything in one go:
bash scripts/cache_pretrained_only.sh \
--cache-dir /path/to/model_cache \
--hf-token <your_hf_token> \
--flux-vae-onlyThen set the cache directory before training/inference:
export SELFE_CACHE_DIR=/path/to/model_cacheTraining data uses a tab-separated manifest:
path/to/image_0001.jpg A photo of a red bird on a branch
path/to/image_0002.png A watercolor painting of a mountain village
- Column 1: image path
- Column 2: prompt text
- Relative image paths are resolved against
data.params.base_dir
We provide a small bundled dataset in debug_data/t2i_smoke/ to verify the full pipeline before downloading pretrained models:
SELFE_SKIP_PRETRAINED=1 ./run.sh configs/debug/selfe_smoke.yaml 1SELFE_SKIP_PRETRAINED=1 replaces all pretrained weights (T5, CLIP, VAE) with random initialization — no SELFE_CACHE_DIR needed. Outputs are garbage, but the full pipeline (data loading → forward → loss → backward → checkpointing) runs correctly.
- Edit
configs/train/selfe.yaml:data.params.train_manifestdata.params.base_dir- optionally
data.params.val_manifest
- Launch training:
./run.sh configs/train/selfe.yaml 8The public release exposes a single training config:
configs/train/selfe.yaml: full SelfE trainingconfigs/debug/selfe_smoke.yaml: bundled smoke-test config
# Single GPU
./infer.sh configs/infer/selfe.yaml
# Multi-GPU (e.g., 4 GPUs)
./infer.sh configs/infer/selfe.yaml 4Edit configs/infer/selfe.yaml to set:
inferencer.ckpt_dirinferencer.output_dir- prompt file and sampling settings under
inferencer.defaults
configs/
base/selfe_system.yaml
train/selfe.yaml
infer/selfe.yaml
debug/selfe_smoke.yaml
base/: shared system definitiontrain/: dataset-backed training entrypointinfer/: checkpoint-backed inference entrypointdebug/: smoke test config with bundled data
If you find this work useful, please cite:
@article{yu2025selfe,
title={Self-Evaluation Unlocks Any-Step Text-to-Image Generation},
author={Yu, Xin and Qi, Xiaojuan and Li, Zhengqi and Zhang, Kai and Zhang, Richard and Lin, Zhe and Shechtman, Eli and Wang, Tianyu and Nitzan, Yotam},
journal={arXiv preprint arXiv:2512.22374},
year={2025}
}This codebase is built on top of minFM by Kai Zhang et al.
This project is licensed under the Apache 2.0 License — see LICENSE for details.

