Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About inference in real time #13

Closed
liamsun2019 opened this issue Apr 2, 2022 · 9 comments
Closed

About inference in real time #13

liamsun2019 opened this issue Apr 2, 2022 · 9 comments

Comments

@liamsun2019
Copy link

Hi author,

Thanks for your such excellent work. I did some training and tests based on your paper and codes and the results are good. I am now curious about the inference in real time. My intention is to estimate the 3D coordinates while playing back a video. According to your strategy and demo code, estimation against a center frame need the 2D poses before and after it, which means the 3D pose of a certain frame cannot be achieved until the 2D poses after it are calculated. I have no idea how to handle such case. I don't think it's a good idea just to pad some dummy data such as zeros. Appreciate any suggestions from you. Thanks.

@Vegetebird
Copy link
Owner

Vegetebird commented Apr 4, 2022

Hi~Thanks for your interest fou our work.

MHFormer can output a sequence of 3D poses and we just select the pose of center frame to evaluate the performance. You can directly take the sequence of 3D poses as output, it is more efficient with negligible performance degradation and can be applied to real-time scenarios.

@liamsun2019
Copy link
Author

Thanks for your prompt reply. I understand that the output 3D poses are a sequence. I'm just curious about the input preparation during inference. For example, I intend to get the 3D poses for all the video frames one by one in real time. For the 1st video frame 3D pose estimation, I firstly get the 2D poses of it. Then, I must prepare the input for MHFormer to get the 3D pose. Seen from the codes, I have to collect 243(alternatices are 9/27/81...) frame data containing 2D poses, among which 121 are before the frame and 121 are after the frame. But these information is unavailable at all at that moment when the 1st video frame is processed. I wonder how to prepare the other needed 121x2 2D poses. Just set them to zeros? I do not think it's a good way.

@Vegetebird
Copy link
Owner

Yeah, this is a video-based method, we need a sequence to feed to the model. For the first frame in real time, we pad with the edge values of array rather than set them to zeros.

On the other hand, in real-time applications, the "causal setting", mentioned in [1], is more appropriate, i.e, it takes a video sequence as input and outputs the pose of the final frame.

[1] Pavllo, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training, CVPR 2019.

@liamsun2019
Copy link
Author

Thanks for your tip. I am now in a dilemma. I have already a 2D pose estimator which achieves a good balance between performance and speed even after being deployed on a mobile device after quantization. My thought is to use it plus MHFormer to act as a real-time 3D pose estimator, i.e, firstly get the 2D poses and then acquire the 3D pose. Actually I am a little confused about the training strategy. My understanding is that the frames "before" current frame are supposed to be enough for prediction, why are the frames "after" current one also collected for training? It's the case seen from the training code. My naive idea is just using the "before" frames as the input sequence for inference exclusive of the "after" frames. Need your comment, big thanks.

@mnauf
Copy link
Contributor

mnauf commented Apr 9, 2022

@liamsun2019 As told by the author, this is a video-based method. Such methods have greater accuracy than methods that work in realtime because the latter doesn't have knowledge of future frames. If you think intuitively, this makes sense as well. Because as a human you too can come up with a 3D pose with greater confidence/accuracy, had you known what's going on in the past and what's going to be the pose in the future.

@liamsun2019
Copy link
Author

@mnauf Thanks for your comment. I understand that the accuracy is better when considering the frames before as well as after current frame. But such strategy is not applicable for real-time application where the 2D poses of the following frames cannot be accessed for current frame. In fact, I tried to train the model with only left-padding, the resulted model still achieves an acceptable accuracy. Appreciate if you could kindly provide better approaches.

@Vegetebird
Copy link
Owner

@liamsun2019 In the real-time application, training the model with only left-padding is a good choice.

@liamsun2019
Copy link
Author

Thanks for your suggestion. Closed.

@henbucuoshanghai
Copy link

training the model with only left-padding is a good choice ,the resout is good?
only left-padding,SO the model can be tested on camera?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants