Skip to content

XinYu-Andy/SelfE

Repository files navigation

Self-E: Self-Evaluation Unlocks Any-Step Text-to-Image Generation

arXiv CVPR 2026 Blog License: Apache 2.0

Xin Yu1,2, Xiaojuan Qi1*†, Zhengqi Li2, Kai Zhang2, Richard Zhang2, Zhe Lin2, Eli Shechtman2, Tianyu Wang2†, Yotam Nitzan2†

1 The University of Hong Kong    2 Adobe Research

* Corresponding author.    Project lead.

We introduce the Self-Evaluating Model (Self-E), a training-from-scratch framework for any-step text to image generation. Self-E is able to unlock such capability without requiring distillation from a pre-trained teacher model. Instead, Self-E learns structure of local distribution from data in a manner similar to conditional flow matching (learning from data), while simultaneously employing a mechanism that evaluates its own few-step generated samples using its own score estimates (self-evaluation).

Method

Highlights

  • From-scratch training — no pretrained teacher, no distillation pipeline
  • Flexible-step inference across practical low-step and mid-step sampling budgets
  • Self-evaluation via classifier score + auxiliary terms (Eq. 13)

The original experiments were conducted at Adobe Research. Due to licensing constraints, pretrained weights cannot be publicly released. This repository is a clean code re-implementation by the first author, provided for the research community to reproduce the method on their own data and compute.

Installation

# Requirements: Python 3.10, CUDA 12, Linux, NVIDIA GPUs
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync --no-dev

Pretrained Models

Training requires three pretrained model families:

Model HuggingFace ID Role
T5-XXL google/t5-v1_1-xxl Text encoder
CLIP ViT-L/14 openai/clip-vit-large-patch14 Text encoder
FLUX.1-dev black-forest-labs/FLUX.1-dev VAE (ae.safetensors)

Note: FLUX.1-dev is a gated model — you need to accept its license on HuggingFace and obtain an access token before downloading.

Use the provided caching script to download everything in one go:

bash scripts/cache_pretrained_only.sh \
  --cache-dir /path/to/model_cache \
  --hf-token <your_hf_token> \
  --flux-vae-only

Then set the cache directory before training/inference:

export SELFE_CACHE_DIR=/path/to/model_cache

Dataset Format

Training data uses a tab-separated manifest:

path/to/image_0001.jpg	A photo of a red bird on a branch
path/to/image_0002.png	A watercolor painting of a mountain village
  • Column 1: image path
  • Column 2: prompt text
  • Relative image paths are resolved against data.params.base_dir

Quick Smoke Test

We provide a small bundled dataset in debug_data/t2i_smoke/ to verify the full pipeline before downloading pretrained models:

SELFE_SKIP_PRETRAINED=1 ./run.sh configs/debug/selfe_smoke.yaml 1

SELFE_SKIP_PRETRAINED=1 replaces all pretrained weights (T5, CLIP, VAE) with random initialization — no SELFE_CACHE_DIR needed. Outputs are garbage, but the full pipeline (data loading → forward → loss → backward → checkpointing) runs correctly.

Training

  1. Edit configs/train/selfe.yaml:
    • data.params.train_manifest
    • data.params.base_dir
    • optionally data.params.val_manifest
  2. Launch training:
./run.sh configs/train/selfe.yaml 8

The public release exposes a single training config:

  • configs/train/selfe.yaml: full SelfE training
  • configs/debug/selfe_smoke.yaml: bundled smoke-test config

Inference

# Single GPU
./infer.sh configs/infer/selfe.yaml

# Multi-GPU (e.g., 4 GPUs)
./infer.sh configs/infer/selfe.yaml 4

Edit configs/infer/selfe.yaml to set:

  • inferencer.ckpt_dir
  • inferencer.output_dir
  • prompt file and sampling settings under inferencer.defaults

Config Layout

configs/
  base/selfe_system.yaml
  train/selfe.yaml
  infer/selfe.yaml
  debug/selfe_smoke.yaml
  • base/: shared system definition
  • train/: dataset-backed training entrypoint
  • infer/: checkpoint-backed inference entrypoint
  • debug/: smoke test config with bundled data

Citation

If you find this work useful, please cite:

@article{yu2025selfe,
  title={Self-Evaluation Unlocks Any-Step Text-to-Image Generation},
  author={Yu, Xin and Qi, Xiaojuan and Li, Zhengqi and Zhang, Kai and Zhang, Richard and Lin, Zhe and Shechtman, Eli and Wang, Tianyu and Nitzan, Yotam},
  journal={arXiv preprint arXiv:2512.22374},
  year={2025}
}

Acknowledgments

This codebase is built on top of minFM by Kai Zhang et al.

License

This project is licensed under the Apache 2.0 License — see LICENSE for details.

About

Code of SelfE (CVPR 2026)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors