Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training reproducible with PyTorch but not with PyTorch + PyTorch3D #659

Closed
abhi1kumar opened this issue Apr 27, 2021 · 11 comments
Closed
Assignees
Labels
do-not-reap Do not delete this pull request or issue due to inactivity. question Further information is requested

Comments

@abhi1kumar
Copy link

abhi1kumar commented Apr 27, 2021

❓ How to ensure reproducibility of training with PyTorch3D

I am trying to reproduce the training with PyTorch + PyTorch3D. When I only use PyTorch and do not use PyTorch3D, my entire training is reproducible. In other words, when I execute my training script, the errors and the logs match. However, when I introduce PyTorch3D based rendering in training, the training becomes irreproducible.

Libraries and their versions -

  • PyTorch3D 0.4.0
  • PyTorch 1.5.1
  • Torchvision 0.6.1
  • Cuda 10.1

Code to seed out the training

def init_torch(rng_seed, cuda_seed):
    """
    Initializes the seeds for ALL potential randomness, including numpy, random and  torch.

    Args:
        rng_seed (int): the shared random seed to use for numpy and random
        cuda_seed (int): the random seed to use for pytorch's torch.cuda.manual_seed_all function
    """
    np.random.seed(rng_seed)
    random.seed(rng_seed)
    os.environ['PYTHONHASHSEED'] = str(rng_seed)
    
    torch.manual_seed(rng_seed)
    torch.cuda.manual_seed(cuda_seed)
    torch.cuda.manual_seed_all(cuda_seed)

    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

I also looked if I am missing something in the PyTorch 1.5.1 reproducibility documentation but could not find anything else.

The latest PyTorch reproducibility documentation says that
Furthermore, if you are using CUDA tensors, and your CUDA version is 10.2 or greater, you should set the environment variable CUBLAS_WORKSPACE_CONFIG according to CUDA documentation
Since I am using Cuda 10.1, so I assume this problem should not arise.

It would be great if you could tell how do we remove randomness while using PyTorch3D in order to fully reproduce the training.

@bottler
Copy link
Contributor

bottler commented Apr 27, 2021

Which parts of PyTorch3D are you using?

@bottler bottler added the question Further information is requested label Apr 27, 2021
@abhi1kumar
Copy link
Author

@bottler Thank you for your reply. I am using MeshRasterizer with the following settings.

class MeshRendererWithDepth(nn.Module):
    def __init__(self, rasterizer):
        super().__init__()
        self.rasterizer = rasterizer

    def forward(self, meshes_world, **kwargs) -> torch.Tensor:
        fragments = self.rasterizer(meshes_world, **kwargs)
        return fragments.zbuf

raster_settings = RasterizationSettings(
                        image_size= raster_image_size,
                        blur_radius= 0,
                        faces_per_pixel= 2,
                        perspective_correct=False,
                        cull_backfaces= True,
                        max_faces_per_bin= 320
                    )
renderer = MeshRendererWithDepth(
                        rasterizer=MeshRasterizer(
                            cameras=cameras,
                            raster_settings=raster_settings
                        )
                    )

depth_maps = renderer(meshes_world= mesh, R=R_camera, T= T_camera)

@bottler
Copy link
Contributor

bottler commented Apr 28, 2021

Separate CUDA threads deal with separate faces. If there are two or more faces which have exactly the same distance to a certain pixel, then the order in which they appear in the output for that pixel is not determined. Further, if the nearest faces_per_pixel faces needs to include one but not all of these equally-distant faces, then it is not determined which of them will be included in the output. In many applications, these exact ties should be rare.

It should be possible to change PyTorch3D to remove this non-determinism, e.g. by making a lower-indexed equally-distant face count as "closer".

@abhi1kumar
Copy link
Author

@bottler Thank you for your reply.

One option is that I change the faces_per_pixel=1 instead of 2. However, I am not sure if I can still get useful gradients.
The other option is to always consider a lower-indexed face as the closest one among all equally distant faces .

Can we ensure lower-indexed equally-distant face as the closer face through a RasterizationSettings option? In case your answer is yes, would you mind telling me which option does this? Or do we need to re-compile PyTorch3D and change its internals?

@bottler
Copy link
Contributor

bottler commented Apr 28, 2021

Or do we need to re-compile PyTorch3D and change its internals?

Yes. This would be a code change in a couple of places in /pytorch3d/csrc/rasterize_meshes/rasterize_meshes.cu.

@bottler
Copy link
Contributor

bottler commented Apr 28, 2021

You might know more about your specific meshes. But in general, setting faces_per_pixel=1 doesn't solve the problem. You may be able to increase faces_per_pixel to more than you need, and then sort the rasterization output to resolve ties, and then truncate it to what you need.

@abhi1kumar
Copy link
Author

Or do we need to re-compile PyTorch3D and change its internals?

Yes. This would be a code change in a couple of places in /pytorch3d/csrc/rasterize_meshes/rasterize_meshes.cu.

Will determinism be added as a PyTorch3D feature in the future? In other words, is reproducibility in your TODO list? In my opinion, reproducibility in Rasterization is an important feature to add to PyTorch3D. This addition easily reproduces the training.

You may be able to increase faces_per_pixel to more than you need, and then sort the rasterization output to resolve ties, and then truncate it to what you need.

Could you elaborate more on this? A code snippet explaining the same would be great.

@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jun 23, 2021
@github-actions
Copy link

This issue was closed because it has been stalled for 5 days with no activity.

@bottler bottler removed the Stale label Jun 29, 2021
@bottler bottler reopened this Jun 29, 2021
@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jul 30, 2021
@bottler bottler added do-not-reap Do not delete this pull request or issue due to inactivity. and removed Stale labels Jul 30, 2021
@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-reap Do not delete this pull request or issue due to inactivity. question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants