Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: device-side assert triggered in sample_points_from_meshes function #117

Closed
rahuldey91 opened this issue Mar 19, 2020 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@rahuldey91
Copy link

rahuldey91 commented Mar 19, 2020

Hi. First of all, thanks for developing this long-desired tool. Now, coming to the bug.

I just started working with PyTorch3D and was trying the tutorial from here: https://github.com/facebookresearch/pytorch3d/blob/master/docs/tutorials/deform_source_mesh_to_target_mesh.ipynb

I started with my own jupyter notebook to reproduce the code. However, when I tried to visualize the meshes, by calling the plot_pointcloud() function in the tutorial, I came across the following error:
plot_pointcloud(trg_mesh, "Target mesh")

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-39-1e1d27f1793b> in <module>
      3 # print(trg_mesh._N)
      4 # trg_mesh.valid
----> 5 plot_pointcloud(trg_mesh, "Target mesh")
      6 # plot_pointcloud(src_mesh, "Source mesh")

<ipython-input-30-fa31b9ded440> in plot_pointcloud(mesh, title)
      2     # Sample points uniformly from the surface of the mesh
      3     print(mesh)
----> 4     points = sample_points_from_meshes(mesh, 5000)
      5     x, y, z = points.clone().detach().cpu().squeeze().unbind(1)
      6     fig = plt.figure(figsize=(5, 5))

~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/pytorch3d/ops/sample_points_from_meshes.py in sample_points_from_meshes(meshes, num_samples, return_normals)
     39           be filled with 0.
     40     """
---> 41     if meshes.isempty():
     42         raise ValueError("Meshes are empty.")
     43 

~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/pytorch3d/structures/meshes.py in isempty(self)
    430             bool indicating whether there is any data.
    431         """
--> 432         return self._N == 0 or self.valid.eq(False).all()
    433 
    434     def verts_list(self):

RuntimeError: CUDA error: device-side assert triggered

I noticed the error was coming by the member mesh.valid. When I called that member directly from the script, I got similar error.
trg_mesh.valid

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    392                         if cls is not object \
    393                                 and callable(cls.__dict__.get('__repr__')):
--> 394                             return _repr_pprint(obj, self, cycle)
    395 
    396             return _default_pprint(obj, self, cycle)

~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    682     """A pprint that just redirects to the normal repr function."""
    683     # Find newlines and replace them with p.break_()
--> 684     output = repr(obj)
    685     lines = output.splitlines()
    686     with p.group():

~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/torch/tensor.py in __repr__(self)
    157         # characters to replace unicode characters with.
    158         if sys.version_info > (3,):
--> 159             return torch._tensor_str._str(self)
    160         else:
    161             if hasattr(sys.stdout, 'encoding'):

~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/torch/_tensor_str.py in _str(self)
    309                 tensor_str = _tensor_str(self.to_dense(), indent)
    310             else:
--> 311                 tensor_str = _tensor_str(self, indent)
    312 
    313     if self.layout != torch.strided:

~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/torch/_tensor_str.py in _tensor_str(self, indent)
    207     if self.dtype is torch.float16 or self.dtype is torch.bfloat16:
    208         self = self.float()
--> 209     formatter = _Formatter(get_summarized_data(self) if summarize else self)
    210     return _tensor_str_with_formatter(self, indent, formatter, summarize)
    211 

~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/torch/_tensor_str.py in __init__(self, tensor)
     81         if not self.floating_dtype:
     82             for value in tensor_view:
---> 83                 value_str = '{}'.format(value)
     84                 self.max_width = max(self.max_width, len(value_str))
     85 

~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/torch/tensor.py in __format__(self, format_spec)
    407     def __format__(self, format_spec):
    408         if self.dim() == 0:
--> 409             return self.item().__format__(format_spec)
    410         return object.__format__(self, format_spec)
    411 

RuntimeError: CUDA error: device-side assert triggered

My configuration is:
Ubuntu: 18.04
Python: 3.6.10
Pytorch: 1.4.0
Pytorch3D: 0.1.1
CUDA: 10.1

Thanks!

@gkioxari
Copy link
Contributor

gkioxari commented Mar 19, 2020

Hi @rahuldey91! Thank you for your kind words.

This is issue has been reported before (see #82 and #63) and is likely due to nans in your meshes. Could you print out or check for nans before you execute sampling?
In the meantime, I will add a check at the beginning of mesh sampling which will raise a better error message!

@gkioxari gkioxari self-assigned this Mar 19, 2020
@nikhilaravi nikhilaravi added the bug Something isn't working label Mar 19, 2020
@gkioxari
Copy link
Contributor

I added a check that raises an error if non finite values are passed (see 6c48ff6).

@rahuldey91
Copy link
Author

Hi @gkioxari! Thanks for your quick response and pointing out related issues. I was trying to check for the presence of nans in the mesh, but I was getting the same error even while calling trg_mesh.verts_list(). Then I noticed that my mesh was in device "cuda:7". I reran the code after changing the device to "cuda:0" and I got the desired output without any errors. Could you help me understand why the data being on a device other than cuda:0 would produce an error?

@gkioxari
Copy link
Contributor

This shouldn't create a problem. Note that we use these ops to train on multiple GPUs, e.g. when training Mesh R-CNN models with distributed training on 8 gpus. Is it possible that your data was living on different devices, or that your GPU is corrupt in any way? I can't think of other reasons why it would fail.

@rahuldey91
Copy link
Author

Here is my ipynb file to reproduce the error. If you change the device to device = torch.device("cuda:0"), it will run without errors. For any other gpu, it shoots this error.
sphere_to_dolphin.zip

@nikhilaravi
Copy link
Contributor

nikhilaravi commented Mar 20, 2020

@rahuldey91 are you using one gpu or multiple gpus? If you are using a GPU other than the default (cuda:0) you may need set it explicitly as :

device = torch.device("cuda:7")
torch.cuda.set_device(device)

@rahuldey91
Copy link
Author

Oh I see. That resolves the issue. You can go ahead and close it. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants