Skip to content

Official Pytorch implementation of the paper Learning Input-agnostic Manipulation Directions in StyleGAN with Text Guidance (accepted to ICLR 2023)

Notifications You must be signed in to change notification settings


Repository files navigation

Learning Input-agnostic Manipulation Directions in StyleGAN with Text Guidance — Official PyTorch Implementation


Learning Input-agnostic Manipulation Directions in StyleGAN with Text Guidance
Authors: Yoonjeon Kim1 Hyunsu Kim2 Junho Kim2, Yunjey Choi2, Eunho Yang1,3†
1 Korea Advanced Institute of Science and Technology (KAIST) 2 NAVER AI Lab 3 AITRICS Corresponding author
With the advantages of fast inference and human-friendly flexible manipulation, image-agnostic style manipulation via text guidance enables new applications that were not previously available. The state-of-the-art text-guided image-agnostic manipulation method embeds the representation of each channel of StyleGAN independently in the Contrastive Language-Image Pre-training (CLIP) space, and provides it in the form of a Dictionary to quickly find out the channel-wise manipulation direction during inference time. However, in this paper we argue that this dictionary which is constructed by controlling single channel individually is limited to accommodate the versatility of text guidance since the collective and interactive relation among multiple channels are not considered. Indeed, we show that it fails to discover a large portion of manipulation directions that can be found by existing methods, which manually manipulates latent space without texts. To alleviate this issue, we propose a novel method that learns a Dictionary, whose entry corresponds to the representation of a single channel, by taking into account the manipulation effect coming from the interaction with multiple other channels. We demonstrate that our strategy resolves the inability of previous methods in finding diverse known directions from unsupervised methods and unknown directions from random text while maintaining the real-time inference speed and disentanglement ability.

Manipulation Examples

Here we show several examples of the text-guided manipulation. The first column is the original image to be manipulated, second is the manipulation result of StyleCLIP and the last column is of our method, Multi2One.



We support python3. To install the dependencies run:

    conda install pytorch torchvision torchaudio pytorch-cuda=<YOUR_CUDA_VERSION> -c pytorch -c nvidia
    pip install ftfy regex tqdm pyyaml matplotlib pillow==6.2.2
    pip install git+

1. Pretrained Models and StyleSpace Statistics for Multi2One

Multi2One requires checkpoints of pre-trained StyleGAN2-ADA

Image Type to Edit Size Pretrained Model Dataset
Human face 1024×1024 StyleGAN2-ADA FFHQ
Car 512×384 StyleGAN2-ADA LSUN-Car
Church 256×256 StyleGAN2-ADA LSUN-Church
Cat 512×512 StyleGAN2-ADA AFHQ-CAT
Dog 512×512 StyleGAN2-ADA AFHQ-Dog
  • For convenience, we provide the checkpoints of the pretrained StyleGAN models and the StyleSpace statistics that are used in this experiment. This will be automatically downloaded via google drive into appropriate directories. Checkpoints are stored under Pretrained. Precomputed statistics of pretrained StyleSpace which are used in manipulation process are stored under stylespace.
  • StyleSpace Statistics and checkpoints are the same with those from StyleCLIP.

2. Dictionary for Manipulation

Both Multi2One and StyleCLIP Global Direction requires a Dictionary, which is a CLIP representaion for each of the StyleSpace channel. Under the directory dictionary, each dataset has its corresponding dictionary for Multi2one (named and StyleCLIP (named fs3.npy). We use the files fs3.npy provided by StyleCLIP official github for direct comparison with our method.

3. Manipulation of Real Images

For manipulation purpose, we provide a single inverted code ( of real image for Human face domain (FFHQ). For inversion, we follow StyleCLIP which relies on e4e. Please refer to e4e-github to manipulate real images by inverting them into W space of pretrained StyleGAN.

Image Manipulation

To generate more samples of text-driven manipulation, run

    python --dataset <dataset>

where dataset is one of

  • ffhq
  • church
  • car
  • afhqdog
  • afhqcat

There is a configuration file with manipulation parameters for each dataset of pretrained StyleGAN in config.yaml. See the file to get the details of the manipulation parameters.

  • targets: List of text prompts used for manipulation.
  • alpha: The manipulation strength which controls the intensity of manipulation. (should be between 0 and 10)
  • topk: The number of channels manipulated to control the level of disentanglement. (should range between 0 and 300 for realistic manipulation results)

Learning Dictionary

Though we provide the dictionary learned from the pair of unsupervised directions and their CLIP representations, for those who want to reproduce the dictionary learning process from scratch, we demonstrate the process below.

Train a dictionary using unsupervised directions

    python --dataset <dataset>

where dataset is one of

  • ffhq
  • church
  • car
  • afhqdog
  • afhqcat

This creates a new Multi2One dictionary dictionary/<args.dataset>/ In order to use the new dictionary, simply substitute the directory to the dictionary,, in to Under precomputed_pairs, the pairs of unsupervsed directions and their CLIP representations are stored. As explained in the paper, we rely on these pairs to learn a dictionary.


This code is mainly built upon StyleCLIP Global Direction and Rosinality pytorch implementation of StyleGAN.


title={Learning Input-agnostic Manipulation Directions in StyleGAN with Text Guidance},
author={Kim, Yoonjeon and Kim, Hyunsu and Kim, Junho and Choi, Yunjey and Yang, Eunho},
booktitle={The Eleventh International Conference on Learning Representations},
year={2023} }


Official Pytorch implementation of the paper Learning Input-agnostic Manipulation Directions in StyleGAN with Text Guidance (accepted to ICLR 2023)






No releases published
