Self-E: Self-Evaluation Unlocks Any-Step Text-to-Image Generation

Xin Yu^1,2, Xiaojuan Qi^1*†, Zhengqi Li², Kai Zhang², Richard Zhang², Zhe Lin², Eli Shechtman², Tianyu Wang^2†, Yotam Nitzan^2†

¹ The University of Hong Kong ² Adobe Research

^* Corresponding author. ^† Project lead.

We introduce the Self-Evaluating Model (Self-E), a training-from-scratch framework for any-step text to image generation. Self-E is able to unlock such capability without requiring distillation from a pre-trained teacher model. Instead, Self-E learns structure of local distribution from data in a manner similar to conditional flow matching (learning from data), while simultaneously employing a mechanism that evaluates its own few-step generated samples using its own score estimates (self-evaluation).

Method

Highlights

From-scratch training — no pretrained teacher, no distillation pipeline
Flexible-step inference across practical low-step and mid-step sampling budgets
Self-evaluation via classifier score + auxiliary terms (Eq. 13)

The original experiments were conducted at Adobe Research. Due to licensing constraints, pretrained weights cannot be publicly released. This repository is a clean code re-implementation by the first author, provided for the research community to reproduce the method on their own data and compute.

Installation

# Requirements: Python 3.10, CUDA 12, Linux, NVIDIA GPUs
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync --no-dev

Pretrained Models

Training requires three pretrained model families:

Model	HuggingFace ID	Role
T5-XXL	`google/t5-v1_1-xxl`	Text encoder
CLIP ViT-L/14	`openai/clip-vit-large-patch14`	Text encoder
FLUX.1-dev	`black-forest-labs/FLUX.1-dev`	VAE (`ae.safetensors`)

Note: FLUX.1-dev is a gated model — you need to accept its license on HuggingFace and obtain an access token before downloading.

Use the provided caching script to download everything in one go:

bash scripts/cache_pretrained_only.sh \
  --cache-dir /path/to/model_cache \
  --hf-token <your_hf_token> \
  --flux-vae-only

Then set the cache directory before training/inference:

export SELFE_CACHE_DIR=/path/to/model_cache

Dataset Format

Training data uses a tab-separated manifest:

path/to/image_0001.jpg	A photo of a red bird on a branch
path/to/image_0002.png	A watercolor painting of a mountain village

Column 1: image path
Column 2: prompt text
Relative image paths are resolved against data.params.base_dir

Quick Smoke Test

We provide a small bundled dataset in debug_data/t2i_smoke/ to verify the full pipeline before downloading pretrained models:

SELFE_SKIP_PRETRAINED=1 ./run.sh configs/debug/selfe_smoke.yaml 1

SELFE_SKIP_PRETRAINED=1 replaces all pretrained weights (T5, CLIP, VAE) with random initialization — no SELFE_CACHE_DIR needed. Outputs are garbage, but the full pipeline (data loading → forward → loss → backward → checkpointing) runs correctly.

Training

Edit configs/train/selfe.yaml:
- data.params.train_manifest
- data.params.base_dir
- optionally data.params.val_manifest
Launch training:

./run.sh configs/train/selfe.yaml 8

The public release exposes a single training config:

configs/train/selfe.yaml: full SelfE training
configs/debug/selfe_smoke.yaml: bundled smoke-test config

Inference

# Single GPU
./infer.sh configs/infer/selfe.yaml

# Multi-GPU (e.g., 4 GPUs)
./infer.sh configs/infer/selfe.yaml 4

Edit configs/infer/selfe.yaml to set:

inferencer.ckpt_dir
inferencer.output_dir
prompt file and sampling settings under inferencer.defaults

Config Layout

configs/
  base/selfe_system.yaml
  train/selfe.yaml
  infer/selfe.yaml
  debug/selfe_smoke.yaml

base/: shared system definition
train/: dataset-backed training entrypoint
infer/: checkpoint-backed inference entrypoint
debug/: smoke test config with bundled data

Citation

If you find this work useful, please cite:

@article{yu2025selfe,
  title={Self-Evaluation Unlocks Any-Step Text-to-Image Generation},
  author={Yu, Xin and Qi, Xiaojuan and Li, Zhengqi and Zhang, Kai and Zhang, Richard and Lin, Zhe and Shechtman, Eli and Wang, Tianyu and Nitzan, Yotam},
  journal={arXiv preprint arXiv:2512.22374},
  year={2025}
}

Acknowledgments

This codebase is built on top of minFM by Kai Zhang et al.

License

This project is licensed under the Apache 2.0 License — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
data		data
debug_data/t2i_smoke		debug_data/t2i_smoke
models		models
resources		resources
scripts		scripts
trainers		trainers
utils		utils
utils_fm		utils_fm
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
entrypoint.py		entrypoint.py
infer.sh		infer.sh
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-E: Self-Evaluation Unlocks Any-Step Text-to-Image Generation

Method

Highlights

Installation

Pretrained Models

Dataset Format

Quick Smoke Test

Training

Inference

Config Layout

Citation

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Self-E: Self-Evaluation Unlocks Any-Step Text-to-Image Generation

Method

Highlights

Installation

Pretrained Models

Dataset Format

Quick Smoke Test

Training

Inference

Config Layout

Citation

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages