Skip to content

animotionlab26/MocapAnything

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MoCapAnything V2

End-to-End Motion Capture for Arbitrary Skeletons from Monocular Videos

Project Page · Paper (arXiv) · Install & Run

⚠️ Unofficial code release. This repository is a reimplementation based on the paper — use as a reference, not a reproduction.

MoCapAnything V2 teaser — click to watch the video on the project page

▶ Click the image to watch the 90-second teaser on the project page.

Highlights

  • 🔗 Fully end-to-end. Video → Pose → Rotation jointly optimized — no analytical IK in the loop.
  • Reference-anchored rotation. A single reference pose–rotation pair from the target asset defines the rotation coordinate system, turning pose-to-rotation into a well-constrained problem.
  • Mesh-free and fast. Joints predicted directly from video, ~20× faster than mesh-based pipelines.

Pipeline

The V2 main model is video2pose2rot — a single end-to-end network that maps a video directly to BVH-ready joint rotations. Internally it composes two subtasks, video2pose and pose2rot, that share weights and are jointly fine-tuned; they can be run standalone (e.g. for ablations or debugging), but normal usage is the joint model. The V1 mesh-based pipeline (video2mesh + mesh2pose) is included as a baseline for comparison.

Stage Role Input Output
video2pose2rot V2 — main end-to-end model Image sequence + reference Joint rotations (BVH)
  ↳ video2pose V2 subtask (standalone-runnable) Image sequence + reference pose Joint positions
  ↳ pose2rot V2 subtask (standalone-runnable) Joint positions + rest pose / reference pose-rot pair Joint rotations
video2mesh V1 baseline — mesh sequence (TripoSG) Image sequence Mesh (.glb / latent)
mesh2pose V1 baseline — joints from per-frame meshes Mesh sequence + reference pose Joint positions

A reference frame from a matching species guides the per-species skeleton and scale, enabling generalization to unseen animals.

Install & Run

Environment setup, dataset layout, training commands, and inference (including in-the-wild mode) live in RUN.md.

Citation

If you use this code, please consider cite:

@article{gong2026mocapanythingv2,
  title   = {MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons},
  author  = {Gong, Kehong and Wen, Zhengyu and Phong, Dao Thien and
             Xu, Mingxi and He, Weixia and Wang, Qi and Zhang, Ning and
             Li, Zhengyu and Hou, Guanli and Lian, Dongze and He, Xiaoyu and
             Zhang, Mingyuan and Zhang, Hanwang},
  journal = {arXiv preprint arXiv:2604.28130},
  year    = {2026}
}

If you build on the V1 baselines, please also cite the underlying papers — mesh2pose is from MoCapAnything (V1), and video2mesh is from SWiT-4D:

@article{gong2025mocapanything,
  title   = {MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos},
  author  = {Gong, Kehong and Wen, Zhengyu and He, Weixia and Xu, Mingxi and
             Wang, Qi and Zhang, Ning and Li, Zhengyu and
             Lian, Dongze and Zhao, Wei and He, Xiaoyu and Zhang, Mingyuan},
  journal = {arXiv preprint arXiv:2512.10881},
  year    = {2025}
}

@article{gong2025swit4d,
  title   = {SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation},
  author  = {Gong, Kehong and Wen, Zhengyu and Xu, Mingxi and He, Weixia and Wang, Qi and
             Zhang, Ning and Li, Zhengyu and Li, Chenbin and Lian, Dongze and
             Zhao, Wei and He, Xiaoyu and Zhang, Mingyuan},
  journal = {arXiv preprint arXiv:2512.10860},
  year    = {2025}
}

License

See the repository for license information.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors