Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codegen error on transform_replay #1538

Open
jjsjann123 opened this issue Mar 29, 2022 · 2 comments
Open

Codegen error on transform_replay #1538

jjsjann123 opened this issue Mar 29, 2022 · 2 comments
Assignees
Labels

Comments

@jjsjann123
Copy link
Collaborator

🐛 Describe the bug

Error I'm running into are:

root@f3d8903f445f:/raid/playground# PYTORCH_NVFUSER_DISABLE_FALLBACK=1 python animesh_repro.py

Traceback (most recent call last):
  File "animesh_repro.py", line 19, in <module>
    forward(*inps)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: it != replay_CasP.getReplay().end() INTERNAL ASSERT FAILED at "/raid/pytorch_10_1/torch/csrc/jit/codegen/cuda/transform_replay.cpp":491, please report a bug to PyTorch. Could not find axis, iS235{( ceilDiv(( i10 * i11 ), 1) )}, requested in replay.

Script for repro:

import torch                                                                                                            
                                                                                                                        
def forward(i0, i1, i2, i3, i4, i5, i6, i7):                                                                                                                                                                       
  i_tb1 = torch.ops.aten.threshold_backward(i6, i7, 0)                                                                  
  i18 = torch.ops.aten.view(i_tb1, [1, 1024, 128, 128])                                                                 
  i19, i20, i21 = torch.ops.aten.native_batch_norm_backward(i18, i0, i1, i2, i3, i4, i5, False, 1e-5, [True, True, True])
  i22 = torch.ops.aten.view(i20, [16, 64])                                                                              
  i_s2 = torch.ops.aten.sum(i22, 0)                                                                                     
  i24 = torch.ops.aten.view(i21, [16, 64])                                                                              
  i_s1 = torch.ops.aten.sum(i24, 0)                                                                                     
  i26 = torch.ops.aten.view(i19, [16, 64, 128, 128])                                                                    
  return (i26, i_s1, i_s2)                                                                                              
                                                                                                                        
inps = [(torch.Size([1, 1024, 128, 128]), torch.float32), (torch.Size([1024]), torch.float32), (torch.Size([1024]), torch.float32), (torch.Size([1024]), torch.float32), (torch.Size([0]), torch.float32), (torch.Size([0]), torch.float32), (torch.Size([16, 64, 128, 128]), torch.float32), (torch.Size([16, 64, 128, 128]), torch.float32)]
inps = [torch.randn(shape, dtype=dtype, device='cuda') for shape, dtype in inps]                                        
forward = torch.jit.script(forward)                                                                                     
with torch.jit.fuser("fuser2"):                                                                                         
  forward(*inps)                                                                                                        
  forward(*inps)                                                                                                        
  forward(*inps)

Versions

This repros on my devel HEAD commit 6df7b77b5ccf681694097a111ee0525f7bf1350f

@jjsjann123
Copy link
Collaborator Author

So this is one of the view issues (as Kevin pointed out in team). Naoya tried Christian's fix with view (I guess #1535 ) but seems to not help with this case.

At this time, Naoya is looking into the issue. I'm stamping his name on this one for now

@naoyam
Copy link
Collaborator

naoyam commented Apr 19, 2022

@jjsjann123 Is this still an issue? We've made a couple of improvements so far, including more robust handling of views and trivial reductions. As far as I remember, @rdspring1 said he's working on a scheduler fix, but I'm not sure what the current status is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants