About inference in real time #13

liamsun2019 · 2022-04-02T10:51:35Z

Hi author,

Thanks for your such excellent work. I did some training and tests based on your paper and codes and the results are good. I am now curious about the inference in real time. My intention is to estimate the 3D coordinates while playing back a video. According to your strategy and demo code, estimation against a center frame need the 2D poses before and after it, which means the 3D pose of a certain frame cannot be achieved until the 2D poses after it are calculated. I have no idea how to handle such case. I don't think it's a good idea just to pad some dummy data such as zeros. Appreciate any suggestions from you. Thanks.

Vegetebird · 2022-04-04T01:15:53Z

Hi~Thanks for your interest fou our work.

MHFormer can output a sequence of 3D poses and we just select the pose of center frame to evaluate the performance. You can directly take the sequence of 3D poses as output, it is more efficient with negligible performance degradation and can be applied to real-time scenarios.

liamsun2019 · 2022-04-04T02:30:10Z

Thanks for your prompt reply. I understand that the output 3D poses are a sequence. I'm just curious about the input preparation during inference. For example, I intend to get the 3D poses for all the video frames one by one in real time. For the 1st video frame 3D pose estimation, I firstly get the 2D poses of it. Then, I must prepare the input for MHFormer to get the 3D pose. Seen from the codes, I have to collect 243(alternatices are 9/27/81...) frame data containing 2D poses, among which 121 are before the frame and 121 are after the frame. But these information is unavailable at all at that moment when the 1st video frame is processed. I wonder how to prepare the other needed 121x2 2D poses. Just set them to zeros? I do not think it's a good way.

Vegetebird · 2022-04-04T06:03:42Z

Yeah, this is a video-based method, we need a sequence to feed to the model. For the first frame in real time, we pad with the edge values of array rather than set them to zeros.

On the other hand, in real-time applications, the "causal setting", mentioned in [1], is more appropriate, i.e, it takes a video sequence as input and outputs the pose of the final frame.

[1] Pavllo, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training, CVPR 2019.

liamsun2019 · 2022-04-06T01:33:25Z

Thanks for your tip. I am now in a dilemma. I have already a 2D pose estimator which achieves a good balance between performance and speed even after being deployed on a mobile device after quantization. My thought is to use it plus MHFormer to act as a real-time 3D pose estimator, i.e, firstly get the 2D poses and then acquire the 3D pose. Actually I am a little confused about the training strategy. My understanding is that the frames "before" current frame are supposed to be enough for prediction, why are the frames "after" current one also collected for training? It's the case seen from the training code. My naive idea is just using the "before" frames as the input sequence for inference exclusive of the "after" frames. Need your comment, big thanks.

mnauf · 2022-04-09T14:29:57Z

@liamsun2019 As told by the author, this is a video-based method. Such methods have greater accuracy than methods that work in realtime because the latter doesn't have knowledge of future frames. If you think intuitively, this makes sense as well. Because as a human you too can come up with a 3D pose with greater confidence/accuracy, had you known what's going on in the past and what's going to be the pose in the future.

liamsun2019 · 2022-04-11T01:07:15Z

@mnauf Thanks for your comment. I understand that the accuracy is better when considering the frames before as well as after current frame. But such strategy is not applicable for real-time application where the 2D poses of the following frames cannot be accessed for current frame. In fact, I tried to train the model with only left-padding, the resulted model still achieves an acceptable accuracy. Appreciate if you could kindly provide better approaches.

Vegetebird · 2022-04-14T11:36:20Z

@liamsun2019 In the real-time application, training the model with only left-padding is a good choice.

liamsun2019 · 2022-04-15T01:59:44Z

Thanks for your suggestion. Closed.

henbucuoshanghai · 2022-07-15T09:12:04Z

training the model with only left-padding is a good choice ,the resout is good?
only left-padding,SO the model can be tested on camera?

liamsun2019 closed this as completed Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About inference in real time #13

About inference in real time #13

liamsun2019 commented Apr 2, 2022

Vegetebird commented Apr 4, 2022 •

edited

liamsun2019 commented Apr 4, 2022

Vegetebird commented Apr 4, 2022

liamsun2019 commented Apr 6, 2022

mnauf commented Apr 9, 2022 •

edited

liamsun2019 commented Apr 11, 2022

Vegetebird commented Apr 14, 2022

liamsun2019 commented Apr 15, 2022

henbucuoshanghai commented Jul 15, 2022

About inference in real time #13

About inference in real time #13

Comments

liamsun2019 commented Apr 2, 2022

Vegetebird commented Apr 4, 2022 • edited

liamsun2019 commented Apr 4, 2022

Vegetebird commented Apr 4, 2022

liamsun2019 commented Apr 6, 2022

mnauf commented Apr 9, 2022 • edited

liamsun2019 commented Apr 11, 2022

Vegetebird commented Apr 14, 2022

liamsun2019 commented Apr 15, 2022

henbucuoshanghai commented Jul 15, 2022

Vegetebird commented Apr 4, 2022 •

edited

mnauf commented Apr 9, 2022 •

edited