Upon reviewing the papers available on paperswithcode.com, it becomes evident that the majority of SOTA Monocular 3D Human Pose Estimation pipelines are increasingly leveraging Transformer-based models. Despite significant strides in 3D human pose estimation with monocular systems, a considerable performance gap with multi-view systems still persists.
This repository is dedicated to the implementation of the Monocular 3D Human Pose Estimation Pipeline based on the paper MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation (Wenhao Li, Hong Liu, Hao Tang, Pichao Wang, and Luc Van Gool, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) in 2022).
While the authors' original research code and pre-trained models form the foundation for this repository, some refactoring and optimizations have been carried out to transform "research-like" code into a more "inference-ready" format.
Furthermore, a simple yet powerful approach for the action recognition of clapping above the head has been proposed. This heuristic algorithm is based on tracking the predicted 3D positions of wrist joints. A clapping above the head action is detected when the distance between these joints is less than a certain threshold. Several additional conditions must be met to avoid detecting multiple clapping actions in sequential frames. However, it is important to note that there are definitely more powerful deep learning-based approaches available, as detailed here: paper, paper and paper.
For more detailed information on the original research and model, please refer to the MHFormer paper and the authors' original code.
It's also worth noting that there are many potential enhancements to be made, such as the usage of different filters (i.e. Extended Kalman Filter), better and faster landmark detectors like OpenPose or MoveNet (or custom ones), and further optimization of the transformer model for inference on low-end devices. And, generally speaking, just usage of Multi-view system :)
- Put pretrained MHFormer model to the ./mhformer_checkpoints/receptive_field_81 directory. Put pretrained HRNet and YOLOv3 models to ./pose_estimation/utils/data directory. As a result the structure should look like this:
.
├── mhformer_checkpoints
│ └── receptive_field_81
│ └── mhformer_pretrained.pth
├── pose_estimation
│ └── utils
│ └── data
│ ├── yolov3.weights
│ ├── pose_hrnet_w48_384x288.pth
│ ...
- Put in-the-wild video to the ./data/input directory.
- Build image
docker build -t <tag name> .
- Run container
docker run -v $(pwd)/data/input:/app/data/input -v $(pwd)/data/output:/app/data/output -it <tag name> /bin/bash
Inside container:
cd ./scripts && python run_pose_estimation.py --video '../data/input/<video_name>.mp4'
- video : Path to the video.
Results will be stored in ./data/output/<video_name>/ in the following format:
.
├── data
│ ├── output
│ ├── <video_name>
│ ├── 2d+3d.mp4 (predicted 2d and 3d skeletons , clapping counter showed)
│ ├── body_2d_landmarks.json
│ ├── body_3d_landmarks.json
│ ├── 2d+3d
│ ├── 0000.png
│ ├── 00001.png
│ ...
This project is licensed under the terms of the MIT license.