Official implementation for the paper:
DuoMo: Dual Motion Diffusion for World-space Human Reconstruction
Yufu Wang, Evonne Ng, Soyong Shin, Rawal Khirodkar, Yuan Dong, Zhaoen Su, Jinhyung Park
Kris Kitani, Alexander Richard, Fabian Prada, Michael Zollhoefer
[Project Page]
[Arxiv]
- Clone the repository.
git clone https://github.com/facebookresearch/duomo.git
cd duomo- Setup Python environment.
conda create -n duomo python=3.12
conda activate duomo- Install dependencies.
# Set CUDA_HOME
export CUDA_HOME=$CONDA_PREFIX
# Install CUDA toolkit
conda install -c nvidia cuda-toolkit=12.8
# Install standard packages (with PyTorch cu128 wheel specified)
pip install -r requirements.txt
# Compile and install custom GitHub packages
pip install "git+https://github.com/mattloper/chumpy@9b045ff5d6588a24a0bab52c83f032e2ba433e17" --no-build-isolation
pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable" --no-build-isolation- Install third-party dependencies into
third_party/.
# Pulling external repositories (e.g., GVHMR, PromptHMR)
bash scripts/install_third_party.shRun the following commands to download all checkpoints and processed dataset features into data/. The second command will prompt you to register and log in to access SMPL.
# Checkpoints and annotations
bash scripts/download_data.sh
# SMPLX family models
bash scripts/download_smplx.shDemo for an exmple video with static camera:
# By default it assumes static camera
python scripts/inference.py --video_path data/dance.mp4By default, the demo assumes a static camera setup. While this release does not include a SLAM module, the inference pipeline supports moving cameras. If you have pre-computed camera motion (e.g., from SLAM or device sensors), you can provide them as input.
As an example, we sample a video from EMDB with precomputed camera poses from TRAM and ground truth bounding boxes. Please take a look at scripts/data_prep/create_emdb_example.py for the definition of camera param.
# Get an EMDB video, save to data/
python scripts/data_prep/create_emdb_example.py --dataset_dir /your_emdb_dir
# Inference the video, with SLAM camera and GT boxes
python scripts/inference.py --video_path data/emdb_sample.mp4 --camera_param data/emdb_sample_cam.pt --boxes data/emdb_sample_boxes.ptOutput Coordinates: When camera extrinsics are provided, the final reconstructed motion will be aligned to the coordinate system of the first video frame (first camera pose as world origin).
We provide pre-computed features (dense keypoints, image features, etc) under data/processed for evaluation. However, to complete the evaluation, you need to obtain the annotations from the official EMDB, RICH, and EgoBody websites. For EgoBody, we only need a subset and you can use scripts/download_egobody.py to download them. After obtaining the official annotations, combine them with our pre-computed features as follows. Please see scripts/data_prep/README.md for me details.
# EMDB
python scripts/data_prep/process_emdb.py --dataset_dir /your_emdb_dir
# Egobody
python scripts/data_prep/process_egobody.py --dataset_dir /your_egobody_dirAfter that, the dateset files in data/processed are updated. Use the following for the actual evaluation.
# Available datasets: EMDB, RICH, EgoBody
python scripts/evaluation.py --dataset emdb2We have included the full training pipeline for our models. However, we do not provide preprocessed dataset labels. To train the models from scratch, you will need to implement your own dataset preprocessing. We provide some example preprocessing scripts (e.g. scripts/data_prep/process_amass.py). Please reference our data preprocess and loading implementation to understand the expected structure. The following commands run the training loop.
# Training the camera-space motion diffusion model
sbatch scripts/train_stage1.sh
# Training the world-space motion diffusion model
sbatch scripts/train_stage2.shDuoMo/
├── src/
│ ├── data/ # <-- dataset loading
│ ├── models/ # <-- architecture
│ ├── processors/ # <-- wrapper for third-party processes
│ ├── recipes/ # <-- configurations
│ ├── trainer/ # <-- training
│ ├── utils/
│ ├── vis/
│ ├── __init__.py
│ ├── inference.py # <-- pipeline
│ └── evaluation.py # <-- evaluation
│
├── third_party/ (after installation)
│ ├── GVHMR/ # <-- for synthetic camera motion on AMASS
│ └── PromptHMR/ # <-- for image feature encoding
│
├── scripts/ # <-- demo, training and evlauation scripts
└── data/ # <-- hold checkpoints
@article{wang2026duomo,
title={DuoMo: Dual Motion Diffusion for World-space Human Reconstruction},
author={Wang, Yufu and Ng, Evonne and Shin, Soyong and Khirodkar, Rawal and Dong, Yuan and Su, Zhaoen and Park, Jinhyung and Kitani, Kris and Richard, Alexander and Prada, Fabian and Zollhoefer, Michael},
year={2026}
}DuoMo is licensed under the XRCIA Noncommercial Research License Agreement License. A copy of the license can be found here.
