Skip to content

Code for Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free Approach

License

Notifications You must be signed in to change notification settings

drboog/ProFusion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ProFusion

ProFusion (with an encoder pre-trained on a large dataset such as CC3M) can be used to efficiently construct customization dataset, which can be used to train a tuning-free customization assistant (CAFE).

Given a testing image, the assistant can perform customized generation in a tuning-free manner. It can take complex user-input, generate text explanation and elaboration along with image, without any fine-tuning.


examples

Results from CAFE



examples

Results from CAFE



Code for Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free Approach.


examples

Results from ProFusion


ProFusion is a framework for customizing pre-trained large-scale text-to-image generation models, which is Stable Diffusion 2 in our examples.

framework

Illustration of the proposed ProFusion


With ProFusion, you can generate infinite number of creative images for a novel/unique concept, with single testing image, on single GPU (~20GB are needed when fine-tune with batch size 1).


examples

Results from ProFusion


Example

  • Install dependencies (we revised original diffusers);

      cd ./diffusers
      pip install -e .
      cd ..
      pip install accelerate==0.16.0 torchvision transformers==4.25.1 datasets ftfy tensorboard Jinja2 regex tqdm joblib 
    
  • Initialize Accelerate;

      accelerate config
    
  • Download a model pre-trained on FFHQ;

  • Customize model with a testing image, example is shown in the notebook test.ipynb;

Train Your Own Encoder

If you want to train a PromptNet encoder for other domains, or on your own dataset.

  • First, prepare an image-only dataset;

    • In our experiments on human face domain, we use FFHQ. Our pre-processed FFHQ can be found at google drive link;
    • We also trained an encoder on CC3M, which leads to good customization results on arbitrary downstream domain;
  • Then, run

      accelerate launch --mixed_precision="fp16" train.py\
            --pretrained_model_name_or_path="stabilityai/stable-diffusion-2-base" \
            --train_data_dir=./images_512 \
            --max_train_steps=80000 \
            --learning_rate=2e-05 \
            --output_dir="./promptnet" \
            --train_batch_size=8 \
            --promptnet_l2_reg=0.000 \
            --gradient_checkpointing
    

Citation

@article{zhou2023enhancing,
  title={Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free Approach},
  author={Zhou, Yufan and Zhang, Ruiyi and Sun, Tong and Xu, Jinhui},
  journal={arXiv preprint arXiv:2305.13579},
  year={2023}
}

About

Code for Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free Approach

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published