The table below shows the pose estimation models available for each task category.
Category | Model | Documentation |
---|---|---|
Whole body | HRNet | :mod:`model.hrnet` |
PoseNet | :mod:`model.posenet` | |
MoveNet | :mod:`model.movenet` |
The table below shows the frames per second (FPS) of each model type.
Model | Type | Size | CPU | GPU | ||
---|---|---|---|---|---|---|
single | multiple | single | multiple | |||
PoseNet | 50 | 225 | 64.46 | 51.95 | 136.31 | 89.37 |
75 | 225 | 57.62 | 47.01 | 132.84 | 83.73 | |
100 | 225 | 44.70 | 37.60 | 132.73 | 81.24 | |
resnet | 225 | 18.77 | 17.21 | 73.15 | 51.65 | |
HRNet (YOLO) | (v4tiny) | 256 × 192 (416) | 5.86 | 1.09 | 21.91 | 13.86 |
MoveNet | SinglePose Lightning | 192 | 40.78 | 40.54 | 99.47 | -- |
SinglePose Thunder | 256 | 25.13 | 24.87 | 92.05 | -- | |
MultiPose Lightning | 256 or multiple of 32 | 25.33 | 24.90 | 80.64 | 79.32 |
- The following hardware were used to conduct the FPS benchmarks:
- -
CPU
: 2.8 GHz 4-Core Intel Xeon (2020, Cascade Lake) CPU and 16GB RAM-GPU
: NVIDIA A100, paired with 2.2 GHz 6-Core Intel Xeon CPU and 85GB RAM
- The following test conditions were followed:
- - :mod:`input.visual`, the model of interest, and :mod:`dabble.fps` nodes were used to perform inference on videos- 2 videos were used to benchmark each model, one with only 1 human (
single
), and the other with multiple humans (multiple
)- Both videos are about 1 minute each, recorded at ~30 FPS, which translates to about 1,800 frames to process per video- 1280×720 (HD ready) resolution was used, as a bridge between 640×480 (VGA) of poorer quality webcams, and 1920×1080 (Full HD) of CCTVs
The table below shows the performance of our pose estimation models using the keypoint evaluation metrics from COCO. Description of these metrics can be found here.
Model | Type | Size | AP | AP OKS=.50 | AP OKS=.75 | AP medium | AP large | AR | AR OKS=.50 | AR OKS=.75 | AR medium | AR large |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PoseNet | 50 | 225 | 5.2 | 15.5 | 2.7 | 0.8 | 11.8 | 9.6 | 22.7 | 7.1 | 1.4 | 20.7 |
75 | 225 | 7.2 | 19.7 | 3.6 | 1.3 | 15.9 | 12.1 | 26.5 | 9.3 | 2.2 | 25.5 | |
100 | 225 | 7.7 | 20.8 | 4.4 | 1.5 | 17.1 | 12.6 | 27.7 | 10.1 | 2.3 | 26.5 | |
resnet | 225 | 11.9 | 27.4 | 8.3 | 2.2 | 25.3 | 17.3 | 32.5 | 15.9 | 2.9 | 36.8 | |
HRNet (YOLO) | (v4tiny) | 256 × 192 (416) | 35.8 | 61.5 | 37.5 | 30.1 | 44.0 | 40.2 | 64.4 | 42.7 | 33.0 | 50.2 |
MoveNet | singlepose_lightning | 256 x 256 | 7.3 | 15.7 | 5.7 | 1.3 | 15.4 | 8.8 | 17.6 | 7.7 | 1.1 | 19.2 |
singlepose_thunder | 256 x 256 | 11.6 | 21.3 | 10.7 | 3.0 | 23.1 | 13.1 | 22.5 | 12.8 | 2.8 | 27.1 | |
multipose_lightning | 256 x 256 | 18.7 | 36.8 | 16.3 | 9.0 | 31.8 | 21.0 | 38.5 | 19.2 | 9.3 | 37.0 |
The MS COCO (val 2017) dataset is used. We integrated the COCO API into the PeekingDuck pipeline for loading the annotations and evaluating the outputs from the models. All values are reported in percentages.
All images from the "person" category in the MS COCO (val 2017) dataset were processed.
- The following test conditions were followed:
- - The tests were performed using pycocotools on the MS COCO dataset- The evaluation metrics have been compared with the original repository of the respective pose estimation models for consistency
Keypoint | ID | Keypoint | ID |
---|---|---|---|
nose | 0 | left wrist | 9 |
left eye | 1 | right wrist | 10 |
right eye | 2 | left hip | 11 |
left ear | 3 | right hip | 12 |
right ear | 4 | left knee | 13 |
left shoulder | 5 | right knee | 14 |
right shoulder | 6 | left ankle | 15 |
left elbow | 7 | right ankle | 16 |
right elbow | 8 |