PixelWizard is a high-resolution text-to-video generation framework for efficient 2K/4K video synthesis. It decouples global spatial-temporal structure modeling from high-resolution detail generation, then accelerates the expensive high-resolution stage with shortcut step-size conditioning.
- [2026.05] Initial repository for PixelWizard.
- Project page, paper link, checkpoints, and demo videos are coming soon.
git clone https://github.com/VisionForge-arch/PixelWizard
cd PixelWizard# 1. Create and activate a clean environment.
conda create -n pixelwizard python=3.10
conda activate pixelwizard
# 2. Install PyTorch first. Choose the command matching your CUDA version.
# Example for CUDA 12.1:
pip install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu121
# 3. Install the remaining Python dependencies.
pip install -r requirements.txt
# 4. Install flash-attn after PyTorch is available.
pip install flash-attn --no-build-isolationPut all model weights under ./weight:
weight/
Wan2.2-TI2V-5B/
PixelWizard/
lr/model.pt
2k/model.pt
4k/model.pt
Download the Wan2.2-TI2V-5B base checkpoint:
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.2-TI2V-5B --local-dir ./weight/Wan2.2-TI2V-5BDownload the PixelWizard checkpoints and place them under ./weight/PixelWizard:
huggingface-cli download wxli318/PixelWizard --local-dir ./weight/PixelWizard--ckpt_dir: Wan2.2-TI2V-5B base checkpoint directory, for example./weight/Wan2.2-TI2V-5B.--lr_ckpt: optional low-resolution anchor checkpoint, for example./weight/PixelWizard/lr/model.pt. If omitted, the LR stage uses base Wan2.2 weights.--hr_ckpt: required PixelWizard HR shortcut checkpoint, for example./weight/PixelWizard/2k/model.pt.
Single-GPU generation:
python generate.py \
--ckpt_dir ./weight/Wan2.2-TI2V-5B \
--lr_ckpt ./weight/PixelWizard/lr/model.pt \
--hr_ckpt ./weight/PixelWizard/<resolution>/model.pt \
--prompt_file prompts.txt \
--video_dir outputs/videos \
--resolution <2k_or_4k>For single-GPU inference, expect approximately 52 GB VRAM for 2K generation and 100 GB VRAM for 4K generation.
Distributed generation:
torchrun --standalone --nproc_per_node=<n_gpus> generate.py \
--ckpt_dir ./weight/Wan2.2-TI2V-5B \
--lr_ckpt ./weight/PixelWizard/lr/model.pt \
--hr_ckpt ./weight/PixelWizard/<resolution>/model.pt \
--prompt_file prompts.txt \
--video_dir outputs/videos \
--resolution <2k_or_4k> \
--dit_fsdp \
--t5_fsdp \
--ulysses_size <n_gpus>Set <resolution> to 2k or 4k. Distributed inference uses FSDP/Ulysses for multi-GPU memory sharding. Set <n_gpus> to the number of GPUs in the job. The pipeline still processes prompts one by one rather than distributing different prompts across GPUs.
By default, generate.py does not save HR latent .pt files. To save HR latents for later decoding or debugging, pass --save_dir outputs/hr_latents.
| Preset | Anchor Resolution | HR Resolution | HR Steps | Shift | Decode Patches |
|---|---|---|---|---|---|
2k |
448x256 | 2560x1440 | 4 | 5.5 | 3 |
4k |
448x256 | 3840x2144 | 4 | 5.8 | 4 |
generate.py processes prompts one by one: LR anchor latent -> HR latent -> decoded video, then moves to the next prompt. The default --model_load_mode auto keeps models resident with CPU offload for single-process runs and reloads LR/HR models per prompt for distributed runs to reduce peak memory.
Decode options:
--num_patches: number of spatial chunks for HR VAE decode.--patch_dim: decode split dimension,wby default.--overlap: latent-space overlap between chunks, blended with a cosine ramp.--vae_path: optional path to the Wan2.2 VAE checkpoint. If omitted, the VAE under--ckpt_diris used.
If PixelWizard is useful for your research, please cite our paper. BibTeX will be updated after publication.
@misc{pixelwizard,
title = {PixelWizard: Towards Efficient High-Fidelity Video Generation at Ultra-Large Spatial Resolutions},
author = {Li, Wenxue and Ren, Jingjing and Zhang, Peng and Ye, Tian and Zhou, Daiguo and Luan, Jian and Zhu, Lei},
year = {2026}
}PixelWizard is built on Wan2.2. We thank the Wan team for releasing their open video generation models and infrastructure.