Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints

ECCV 2026

Chenyangguang Zhang · Botao Ye · Boqi Chen · Alexandros Delitzas · Fangjinhua Wang · Marc Pollefeys · Xi Wang

Given a starting frame and a sequence of 3D hand-joint trajectories, our method generates egocentric hand-object interaction videos that faithfully follow the prescribed hand motion — using occlusion-aware conditioning on sparse 3D hand joints instead of dense 2D tracks or projected pose tokens.

Overview

This repository contains the official implementation of "Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints", built on top of the Wan2.1-I2V-14B-480P video diffusion backbone.

Given a starting frame and a sequence of 3D hand-joint trajectories, our model synthesizes an egocentric video that follows the prescribed hand motion, using:

an occlusion-aware conditioning module that down-weights unreliable features from occluded joints,
3D geometric embeddings injected directly into the latent space (no extra tokenizer / dense 2D projection required),
a released HandConditioningModule trained on top of a frozen Wan2.1 backbone via LoRA.

Checkpoints

🤗 CyrusZhang312/JointControlvideo

The checkpoint bundle contains two files: dit.safetensors (LoRA weights + expanded patch_embedding) and hand-controller.safetensors (the HandConditioningModule).

Dataset

🤗 [Coming soon] — an automatically-annotated egocentric dataset of hand-object interaction clips paired with precise 3D hand trajectories.

Environment setup

The training and released checkpoints were produced inside the NGC pytorch:25.03 container (PyTorch 2.7, CUDA 12.8, Python 3.12) with a layered Python venv. The recipe below reproduces that environment from a standard conda + pip install — no NGC image required.

1. Create the conda environment

conda create -n handcontrolvideo python=3.12 -y
conda activate handcontrolvideo

2. Install PyTorch (pick the wheel matching your CUDA driver)

# CUDA 12.8 (matches the NGC 25.03 release used for training)
pip install torch==2.7.0 torchvision --index-url https://download.pytorch.org/whl/cu128

# CUDA 12.4 fallback
# pip install torch==2.7.0 torchvision --index-url https://download.pytorch.org/whl/cu124

3. Install the rest of the dependencies

pip install -r requirements.txt
pip install -e .                          # registers the `diffsynth` package

requirements.txt pins everything to the exact versions that produced the released checkpoints (accelerate==1.12.0, transformers==4.57.2, peft==0.18.0, h5py==3.15.1, lmdb==1.7.5, av==16.0.1, opencv-python-headless==4.11.0.86, smplx==0.1.28, chumpy==0.70, etc.).

4. (Optional) HuggingFace cache

The Wan2.1 backbone (text encoder, VAE, CLIP, base DiT) is downloaded from HuggingFace on first use; set HF_HOME to keep it in a known location:

export HF_HOME=$PWD/.hf_cache

Inference

A single-sample demo script reads a 3D hand-joint trajectory from the dataset and renders one controlled video:

python inf_wm_hand.py

Output goes to outputs/. To use your own data, edit the dataset paths near the bottom of the script.

Training

bash train_script_joint_control.sh

This runs accelerate launch against the trainer (examples/wanvideo/model_training/train.py). Key trainer flags:

--num_joints 42 — MANO joint count for the HandConditioningModule
--use_lmdb_dataset — read videos + H5 joint trajectories from an LMDB store

adjust --num_machines / --num_processes in accelerate launch to match your setup. Training was orchestrated under SLURM; the specific sbatch wrappers are not included since they are cluster-specific (the accelerate launch block inside the shell script is the portable part).

Acknowledgement

This codebase is a focused fork of DiffSynth-Studio; the Wan2.1 pipeline, model loader and training framework are upstream components.

Citation

If you find this work useful, please consider citing:

@inproceedings{zhang2026controllable,
  title     = {Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints},
  author    = {Zhang, Chenyangguang and Ye, Botao and Chen, Boqi and Delitzas, Alexandros and Wang, Fangjinhua and Pollefeys, Marc and Wang, Xi},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assest		assest
diffsynth		diffsynth
examples/wanvideo/model_training		examples/wanvideo/model_training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inf_wm_hand.py		inf_wm_hand.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
train_script_joint_control.sh		train_script_joint_control.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints

ECCV 2026

Overview

Checkpoints

Dataset

Environment setup

1. Create the conda environment

2. Install PyTorch (pick the wheel matching your CUDA driver)

3. Install the rest of the dependencies

4. (Optional) HuggingFace cache

Inference

Training

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints

ECCV 2026

Overview

Checkpoints

Dataset

Environment setup

1. Create the conda environment

2. Install PyTorch (pick the wheel matching your CUDA driver)

3. Install the rest of the dependencies

4. (Optional) HuggingFace cache

Inference

Training

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages