Wenjie Xuan, Jing Zhang, Juhua Liu, Bo Du, Dacheng Tao
This is the official implementation for the paper "Rethink Sparse Signals for Pose-guided Text-to-image Generation", accepted by ICCV 2025. We propose a Spatial-Pose ControlNet, namely SP-Ctrl, which equips sparse signals with robust controllability for pose-guided image generation. Our work features an improved spatial-pose representation and a keypoint-concept learning and alignment strategy. Please refer to our paper for more details.
- [2025/07/10]: The codes for training, inference, evaluation, and visualization are released. Some checkpoints are available on AP-10K and HumanArt.
- [2025/06/26]: Our work is accepted by ICCV 2025. We are preparing to release the code in a few weeks.
- Further curation of the code.
- Release the checkpoints of our SP-Ctrl on AP-10K. Coming soon!
- Release the checkpoints of our SP-Ctrl on HumanArt.
- Release the checkpoints of pre-trained ControlNet on AP-10K with different conditions.
- Release the code for training, inference, evaluation, and visualization.
- Release the code for preparing datasets.
- We rethink the sparse signals for pose-guided T2I generation and propose SP-Ctrl, a spatial-pose ControlNet that enables precise pose alignment with sparse signals. It reveals the potential of sparse signals in spatially controllable generations.
- A Spatial-Pose Representation for better pose instructions. We introduce a Spatial-Pose Representation with learnable keypoint embeddings to enhance the expressiveness of sparse pose signals.
- A Keypoint-Concept Learning strategy to encourage text-spatial alignment. We propose Keypoint Concept Learning, a novel strategy that enhances keypoint focus and discrimination, enhancing details and improving pose alignment.
- Better pose accuracy, diversity and cross-species generation ability. Experiments on animal- and human-centric T2I generation validate that our method achieves performance comparable to dense signals. Moreover, our method advances in diversity and cross-species generation.
Recommended: Python=3.11 torch=2.1.2 CUDA=12.2 diffusers=0.25.1 mmpose==0.24.0 mmcv==1.5.0 mmdet==2.26.0
# set up repository
git clone https://github.com/DREAMXFAR/SP-Ctrl.git
cd SP-Ctrl-main
# install conda environment
conda env create -n spctrl python=3.11
pip install -r requirements.txt
# [optional] If you want to run the evaluation with ViTPose, please prepare another environment with mmpose. We provide our environment configuration yaml for reference. And we recommond to follow the official instructions of ViTPose here: https://github.com/ViTAE-Transformer/ViTPose
conda env create -n vitpose python=3.11
pip install -r vitpose_requirement.txtYou can download the following checkpoints and put them in controlnet_checkpoint/.
| Model | Download | Key | Notations |
|---|---|---|---|
| ControlNet-ap10k-mask | link | 2025 |
ControlNet for AP10K conditioned on mask. |
| ControlNet-ap10k-depth | link | 2025 |
ControlNet for AP10K conditioned on depth. |
| ControlNet-ap10k-openpose | link | 2025 |
ControlNet for AP10K conditioned on openpose. |
| Ours-ap10k | β | β | Available in a few weeks. |
| Ours-humanart | link | 2025 |
SP-Ctrl for HumanArt. |
Here we suppose you are in
SP-Ctrl-main/
-
Download our released checkpoints and put them under
./controlnet_checkpoint -
We provide a demo for inference on HumanArt. Run the following script and the output is stored in
./output_dir/demo# 1. config the checkpoint path, prompt first # 2. run the following code for inference bash inference/submit_inference_humanart.sh
-
Download the AP-10K dataset and HumanArt dataset.
-
Pre-process the raw dataset and generate new
jsonlfile for training and validation. Run the following scripts. Note these.pyfiles are just for reference, and you can process the raw dataset on your own. We recommend to have a look at the code here before running them.# generate jsonl for training. Note that mmpose is required. cd custom_dataset/utils # prepare our spa-textual pose, which renders the pose with their index. details refer to the code, where we use the index of keypoints for drawing. # for kpt, idx=kpt_id; for skeleton, idx=sks_id+50. refer to gen_keypoint_spatext.py, line 93 for details python gen_keypoint_spatext.py # [optional] generate other type of conditions, some models are necessary to download. python get_mask_with_sam.py # for samhq mask python gen_pose_gt.py # for openpose python gen_depth_map.py # for zeodepth # generate jsonl for dataset python prepare_image_json.py
-
If you follow the provided scripts, the final dataset structure should be organized as follows. Note this is just used for our data-loaders including
coco_keypoints_offline_dataset.pyandcoco_humanart_keypoints_offline_dataset.pyunder./custom_datasets/. Remember to configure the global variables in these files. You can also write your own dataloader.- AP-10K_triplet - data/ # AP-10K images - pose_kptindex/ # spatext pose - [optional] sam_hq_mask/ # mask - [optional] mmpose_skeleton/ # openpose - [optional] zeo_depth_masked/ # masked depth map - categories.jsonl # category information - train_split1.jsonl # train information - val_split1.jsonl # val information - HumanArt_triplet - images/ # HumanArt images - annotations/ # HumanArt annotations - pose/ # HumanArt pose annotations - humanart_kptindex/ # spatext pose - [optional] humanart_openpose/ # openpose - categories.jsonl # category information - train_split1.jsonl # train information - val_split1.jsonl # val information
-
Download the SD_v1.5 parameters from hugging-face. Put the SD checkpoints in anywhere you like.
-
Configure the hyper-parameters in
./train.shand run the scripts.# train, here is the recommended configuration for training sh train.shbasic configurations
--pretrained_model_name_or_path# Path to pretrained model or model identifier from uggingface.co/models.--controlnet_model_name_or_path# Path to pretrained controlnet model or model identifier from huggingface.co/models. If not specified controlnet weights are initialized from unet.--output_dir# The output directory where the model predictions and checkpoints will be written.--dataset_name# Set datset, e.g. ap10k or humanart--conditioning_mode# Set the condition mode for custom offline dataset, e.g. spatext or openpose.--use_extra_mask# Whether to add extra mask conditionsSpatext related configurations
--add_keypoint_prompt# The percentage of adding keypoint names to prompt--enable_learnable_kpttoken# Use learnble textual embeddings of the keypoint name text.--spatext_cond_channelsthe # num of output channel for spatext_embedding layer.--spatext_init_emb# the initial spatext embedding, None for random, orxxx.binto load.--spatext_skeleton_type# SPATEXT_SKELETON_TYPE, the skeleton embedding.--spatext_enable_learn_kpt# If True, let the model learn spatext embedding of the keypoint.--spatext_enable_learn_sks# If True, let the model learn spatext embedding of the skeleton.--spatext_detach_kpt_grad_from_sks# If true, detach the gradient from keypoint interpolation.--spatext_freeze_embedding# If true, freeze the parameters of the spatext embedding layers--spatext_layer_version# {sigmoid_version} the spatext embedding layer versionCross-attention loss configurations
--add_cross_attn_loss# Add cross-attention loss.--attn_loss_weight# Weight for cross-attention loss, only work when add_cross_attn_loss=True.--heatmap_type# the heatmap type, use thewholeorpartitionheatmap--attn_loss_type# the type of cross-attention map loss, bce or mse, default mse.--attn_loss_start_timestep# From this timestep to compute cross-attention loss.--attn_loss_end_timestep# To this timestep, stop to compute the cross-attention loss.--attn_loss_select_layer_type# Select only some layers to compute the cross-attention loss.{down_blocks.0,down_blocks.1,down_blocks.2,mid_block}--attn_loss_detach_query# Detach the loss backward to init noise in attention map.--text_encoder_lr# learning rate for text encoder.--text_encoder_lr_scheduler# set the learning rate schedule for text_encoder.
-
To run inference, please download the pre-trained checkpoints first. Then set the path in the codes and run the following scirpts.
# for ap10k-mask,depth,openpose,and our spatial-pose bash inference/submit_inference_ap10k.sh # for humanart-spatial-pose bash inference/submit_inference_humanart.sh
-
We provide some utils for convience.
python inference/get_pose_token_prompt.py # generate prompt with keypoint tokens from the jsonl file, only support ap10k -
Here is a tool for visualizing cross-attention maps.
cd tools/cross-attention-map # correct the path of models in this file before running it. python single_inference.py
-
Here are codes for evaluating with ViTPose. To run the evaluation, you need to prepare the environment for ViTPose following the instructions here, and put our checkpoints under
./checkpoints. -
Download the pre-trained ViTPose checkpoints. We use the official pre-trained ViTPose+-H for AP-10K, and mmpose ViTPose-H-humanart-coco.
-
Run the test with provided scripts. We provide our config files of ViTPose under
tools/eval_vitpose_metrics/configfor reference. We recommend users to have a look at the codes before running them.# 1. Setup the config files. # 2. Conduct evaluation with ViTPose sh /tools/eval_vitpose_metrics/submit_test.sh sh /tools/eval_vitpose_metrics/submit_infer.sh # output with visualized results
For more details, you can refer to our paper.
| Dataset | Method | Pose mAP | FID | CLIP-Score |
|---|---|---|---|---|
| AP-10K | T2I-Adapter | 48.16 | 27.29 | 25.52 |
| ControlNet | 44.25 | 19.40 | 24.77 | |
| Ours | 55.63 | 18.52 | 23.86 | |
| HumanArt | T2I-Adapter | 38.37 | 40.01 | 27.36 |
| ControlNet | 45.26 | 26.69 | 27.84 | |
| HumanSD | 49.92 | 35.18 | 27.35 | |
| GRPose | 50.93 | 28.85 | 27.95 | |
| Ours | 29.30 | 25.94 | 9.11 |
Here we showcase several advantages of our method. More details please refer to our paper.
-
Our method can synthesize images with precisely aligned poses and diverse shapes.
-
Our method can realize cross-species generation with accurate pose control.
-
Enjoying the sparse pose, our method is more friendly to pose editing.
If you find some questions, please feel free to contact us. We will response as soon as possible.
- None
- Our implementation is greatly based on the diffusers, textual-inversion, ControlNet, attention-map-diffusers. Thanks for their wonderful works and all contributors.
If you find our findings helpful in your research, please consider giving this repository a β and citing:
@misc{xuan2025rethinksparsesignalsposeguided,
title={Rethink Sparse Signals for Pose-guided Text-to-image Generation},
author={Wenjie Xuan and Jing Zhang and Juhua Liu and Bo Du and Dacheng Tao},
year={2025},
eprint={2506.20983},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.20983},
}



