Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A simple trick for a fully deterministic ROIAlign, and thus MaskRCNN training and inference #4723

Open
ASDen opened this issue Dec 28, 2022 · 6 comments
Labels
enhancement Improvements or good new features

Comments

@ASDen
Copy link

ASDen commented Dec 28, 2022

Non-determinism of MaskRCNN

There have been a lot of discussions and inquiries in this repo about a fully deterministic MaskRCNN e.g. #4260, #3203 , #2615, #2480, and also on other detection repositories (e.g. MMDetection here and here and also torchvision here). Unfortunately, even after seeding everything and setting Pytorch's deterministic flags, results are still non-repeatable.

It boils down to the fact that some of the used Pytorch / torchvision ops doesn't have a deterministic GPU implementation (most notably, due to using atomicAdd in the backward pass). So, the only solution is to train for as long as possible to reduce variance in the results. It is worth noting that not only training, but also evaluation (see #2480) of MaskRCNN (and actually most detectron2 models) is not deterministic

Based on the minimal example in #4260, I made an analysis on the ops used for MaskRCNN and found that the main reason of non-determinism is the backward pass of ROIAlign (see here).

Proposed solution

I am here proposing a simple trick that makes ROIAlign practically fully reproducible, without touching the cuda kernel!! it introduces trivial additional memory and computation. It can be summarized as:

  • Truncate the input to a smaller datatype, this gives a starting point with a very small number of significand bits used
  • Then, cast to a larger data-type just before doing the computations that involve atomicAdd

In terms of code, this is translated to simply modifying this function call to

return roi_align(
    input.half().double(),
    rois.half().double(),
    self.output_size,
    self.spatial_scale,
    self.sampling_ratio,
    self.aligned,
).to(dtype=input.dtype)

Test

The conversion to double results in a trivial increase in memory & computation, but performing it after the truncation, significantly increases reproducibility.

This solution was tested and found fully deterministic (losses values, and evaluation results on COCO) upto tens of thousands of steps (using same code as in #4260) for:

  • MaskRCNN based on ResNet-50 bakbone
  • MaskRCNN based on ResNeXt-101 bakbone
  • Wide range of batch sizes
  • Mixed-precision training
  • Single and Multi-GPU training
  • A100's & V100's

Note on A100

Ampere by default uses TF32 format for tensor-core computations, which means that the above truncation is done implicitly! so on Ampere based devices it is enough just to cast to double, i.e.

return roi_align(
    input.double(),
    rois.double(),
    self.output_size,
    self.spatial_scale,
    self.sampling_ratio,
    self.aligned,
).to(dtype=input.dtype)

Note: This is the default mode for PyTorch, but if TF32 is disabled for some reason (i.e. torch.backends.cudnn.allow_tf32 = False) then the above truncation with .half() is still necessary

Note

  • This solution was tested and found to work well for other non-deterministic Pytorch ops, including: F.interpolate and F.grid_sample
  • This is not a general solution to the problem of random-order reproducible floating point summation, but a practical mitigation that works well for this setup / scenario
  • At least in theory, this should work even better if applied inside the kernel right before atomicAdd
  • The only alternative currently is training each experiment for very long, which isn't practical in many setups, and still isn't fully reproducible

Would love to hear what people think about this!
@ppwwyyxx @fmassa

@ASDen ASDen added the enhancement Improvements or good new features label Dec 28, 2022
@muncok
Copy link

muncok commented Jan 16, 2023

Thank you for sharing your tip.

I have tried your solution, but the loss values across runs were not identical.

You said

This solution was tested and found fully deterministic (losses values, and evaluation results on COCO) upto tens of thousands of steps (using same code as in #4260) for:

Was the losses values perfectly identical in your experiments?

Could you please let me know what version of Pytorch you are using?

@LucQueen
Copy link

Thank you @ASDen
but i tried your solution , i got a error when i train on A100

  warnings.warn(
/opt/conda/lib/python3.8/site-packages/torch/functional.py:599: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:2299.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "tools/train_net.py", line 170, in <module>
    launch(
  File "~/detectron2/detectron2/engine/launch.py", line 82, in launch
    main_func(*args)
  File "tools/train_net.py", line 150, in main
    return trainer.train()
  File "~/detectron2/detectron2/engine/defaults.py", line 484, in train
    super().train(self.start_iter, self.max_iter)
  File "~/detectron2/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
  File "~/detectron2/detectron2/engine/defaults.py", line 494, in run_step
    self._trainer.run_step()
  File "~/detectron2/detectron2/engine/train_loop.py", line 274, in run_step
    loss_dict = self.model(data)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1111, in _call_impl
    return forward_call(*input, **kwargs)
  File "~/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 167, in forward
    _, detector_losses = self.roi_heads(images, features, proposals, gt_instances)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1111, in _call_impl
    return forward_call(*input, **kwargs)
  File "~/detectron2/detectron2/modeling/roi_heads/roi_heads.py", line 739, in forward
    losses = self._forward_box(features, proposals)
  File "~/detectron2/detectron2/modeling/roi_heads/roi_heads.py", line 798, in _forward_box
    box_features = self.box_pooler(features, [x.proposal_boxes for x in proposals])
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1111, in _call_impl
    return forward_call(*input, **kwargs)
  File "~/detectron2/detectron2/modeling/poolers.py", line 261, in forward
    output.index_put_((inds,), pooler(x[level], pooler_fmt_boxes_level))
RuntimeError: expected scalar type Double but found Float

@muncok
Copy link

muncok commented Feb 2, 2023

Thank you for sharing your tip.

I have tried your solution, but the loss values across runs were not identical.

You said

This solution was tested and found fully deterministic (losses values, and evaluation results on COCO) upto tens of thousands of steps (using same code as in #4260) for:

Was the losses values perfectly identical in your experiments?

Could you please let me know what version of Pytorch you are using?

Oh, my code starts to be reproducible after applying the following code snippet.

Thank you!!

import os
os.environ["CUBLAS_WORKSPACE_CONFIG"]=":4096:8"
import torch
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.use_deterministic_algorithms(True)

@GeoffreyChen777
Copy link

You are a real lifesaver.

@luca-serra
Copy link

Works like a charm!
The training is fully reproducible and the performance does not worsen.
Thanks for this hack!

@katherinegls
Copy link

Thank you @ASDen. I have tried your solution on the Sparse R-CNN based on the detectron2 on GPU 3090. Although the training is fully reproducible, the loss did not decrease. My modifications to RoI Align are as follows.
return roi_align( input.half().double(), rois.half().double(), self.output_size, self.spatial_scale, self.sampling_ratio, self.aligned, ).to(dtype=input.dtype)
How to solve the problem of loss not decreasing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvements or good new features
Projects
None yet
Development

No branches or pull requests

6 participants