PixelWizard: Towards Efficient High-Fidelity Video Generation at Ultra-Large Spatial Resolutions

PixelWizard is a high-resolution text-to-video generation framework for efficient 2K/4K video synthesis. It decouples global spatial-temporal structure modeling from high-resolution detail generation, then accelerates the expensive high-resolution stage with shortcut step-size conditioning.

News

[2026.05] Initial repository for PixelWizard.
Project page, paper link, checkpoints, and demo videos are coming soon.

Getting Started

1. Clone the Repository

git clone https://github.com/VisionForge-arch/PixelWizard
cd PixelWizard

2. Set Up the Environment

# 1. Create and activate a clean environment.
conda create -n pixelwizard python=3.10
conda activate pixelwizard

# 2. Install PyTorch first. Choose the command matching your CUDA version.
# Example for CUDA 12.1:
pip install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu121

# 3. Install the remaining Python dependencies.
pip install -r requirements.txt

# 4. Install flash-attn after PyTorch is available.
pip install flash-attn --no-build-isolation

3. Download Weights

Put all model weights under ./weight:

weight/
  Wan2.2-TI2V-5B/
  PixelWizard/
    lr/model.pt
    2k/model.pt
    4k/model.pt

Download the Wan2.2-TI2V-5B base checkpoint:

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.2-TI2V-5B --local-dir ./weight/Wan2.2-TI2V-5B

Download the PixelWizard checkpoints and place them under ./weight/PixelWizard:

huggingface-cli download wxli318/PixelWizard --local-dir ./weight/PixelWizard

--ckpt_dir: Wan2.2-TI2V-5B base checkpoint directory, for example ./weight/Wan2.2-TI2V-5B.
--lr_ckpt: optional low-resolution anchor checkpoint, for example ./weight/PixelWizard/lr/model.pt. If omitted, the LR stage uses base Wan2.2 weights.
--hr_ckpt: required PixelWizard HR shortcut checkpoint, for example ./weight/PixelWizard/2k/model.pt.

4. Run Inference

Single-GPU generation:

python generate.py \
    --ckpt_dir ./weight/Wan2.2-TI2V-5B \
    --lr_ckpt ./weight/PixelWizard/lr/model.pt \
    --hr_ckpt ./weight/PixelWizard/<resolution>/model.pt \
    --prompt_file prompts.txt \
    --video_dir outputs/videos \
    --resolution <2k_or_4k>

For single-GPU inference, expect approximately 52 GB VRAM for 2K generation and 100 GB VRAM for 4K generation.

Distributed generation:

torchrun --standalone --nproc_per_node=<n_gpus> generate.py \
    --ckpt_dir ./weight/Wan2.2-TI2V-5B \
    --lr_ckpt ./weight/PixelWizard/lr/model.pt \
    --hr_ckpt ./weight/PixelWizard/<resolution>/model.pt \
    --prompt_file prompts.txt \
    --video_dir outputs/videos \
    --resolution <2k_or_4k> \
    --dit_fsdp \
    --t5_fsdp \
    --ulysses_size <n_gpus>

Set <resolution> to 2k or 4k. Distributed inference uses FSDP/Ulysses for multi-GPU memory sharding. Set <n_gpus> to the number of GPUs in the job. The pipeline still processes prompts one by one rather than distributing different prompts across GPUs.

By default, generate.py does not save HR latent .pt files. To save HR latents for later decoding or debugging, pass --save_dir outputs/hr_latents.

Resolution Presets

Preset	Anchor Resolution	HR Resolution	HR Steps	Shift	Decode Patches
`2k`	448x256	2560x1440	4	5.5	3
`4k`	448x256	3840x2144	4	5.8	4

generate.py processes prompts one by one: LR anchor latent -> HR latent -> decoded video, then moves to the next prompt. The default --model_load_mode auto keeps models resident with CPU offload for single-process runs and reloads LR/HR models per prompt for distributed runs to reduce peak memory.

Decode options:

--num_patches: number of spatial chunks for HR VAE decode.
--patch_dim: decode split dimension, w by default.
--overlap: latent-space overlap between chunks, blended with a cosine ramp.
--vae_path: optional path to the Wan2.2 VAE checkpoint. If omitted, the VAE under --ckpt_dir is used.

Citation

If PixelWizard is useful for your research, please cite our paper. BibTeX will be updated after publication.

@misc{pixelwizard,
  title   = {PixelWizard: Towards Efficient High-Fidelity Video Generation at Ultra-Large Spatial Resolutions},
  author  = {Li, Wenxue and Ren, Jingjing and Zhang, Peng and Ye, Tian and Zhou, Daiguo and Luan, Jian and Zhu, Lei},
  year    = {2026}
}

Acknowledgements

PixelWizard is built on Wan2.2. We thank the Wan team for releasing their open video generation models and infrastructure.

Name		Name	Last commit message	Last commit date
Latest commit History 1,492 Commits
prompts		prompts
scripts		scripts
teaser		teaser
wan		wan
.gitignore		.gitignore
README.md		README.md
dataset_upsample.py		dataset_upsample.py
decode.py		decode.py
generate.py		generate.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PixelWizard: Towards Efficient High-Fidelity Video Generation at Ultra-Large Spatial Resolutions

News

Getting Started

1. Clone the Repository

2. Set Up the Environment

3. Download Weights

4. Run Inference

Resolution Presets

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PixelWizard: Towards Efficient High-Fidelity Video Generation at Ultra-Large Spatial Resolutions

News

Getting Started

1. Clone the Repository

2. Set Up the Environment

3. Download Weights

4. Run Inference

Resolution Presets

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages