Speeding up 3D model fitting #1728
Replies: 3 comments
-
In general, I suggest collecting traces of your fit with the pytorch profiler, both CPU+GPU and GPU-only. You can learn a lot from them, including seeing where the most time is spent and getting ideas for speeding a workload up. If the GPU is very inactive during a job because the CPU isn't giving it enough tasks fast enough, then reducing precision might not help much. Some tricks which might be helpful are (1) Using torch.compile if you can (2) separating parallel parts of the calculation on cuda streams (3) cuda graphing a group of operations which are often needed in the same sequence. |
Beta Was this translation helpful? Give feedback.
-
Thank you @bottler . My impression is that most time is spent within the Pytorch3d internals and backpropagation. Trying to apply the CUDA graphing lead me to errors in internal Pytorch3D methods related to cameras and point projection functions. I wonder if delving deeper here is the right path. Is it maybe possible to downsample the mesh? Does Pytorch3D offer mesh subsampling utilities? I'm thinking of possibly using a hierarchal approach to reduce the number of iterations and a coarser mesh may just do, |
Beta Was this translation helpful? Give feedback.
-
I don't think pytorch3d has the subsampling you want, e.g. to coarsen a mesh. |
Beta Was this translation helpful? Give feedback.
-
I'm trying to fit 3DMM models to to ~100,000 images using Pytorch3D with an analysis-by-synthesis type framework. Each fit is independent of the others, so batching doesn't make sense here, but is there a way to parallelize the fitting process or speed up the individual fits? Is there a way to perform the fitting using lower precision arithmetic to gain an additional boost?
Right now, it's about 2 min/frame and the GPU doesn't seem to be fully utilized.
I'm under the impression that the size of the rendered image has little to no effect on the runtime (128x128 vs. 512x512 didn't make a difference)
I've tried running multiple processes using mp.Process, but I can usually run about 4 frames per GPU. I'm not sure this is the most efficient way to using the GPU.
I'd appreciate any ideas.
Beta Was this translation helpful? Give feedback.
All reactions