-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Conv2DTranspose with groups not working correctly #10223
Comments
Thanks, yes |
Fixed in #10235. You shouldn't be using |
Great, thanks for the quick response! I can confirm that it's converting from PyTorch correctly for my more complex model on your branch. Now I'm getting an AssertionError on
Are there plans to support grouped transposed convolutions on GPU? Is there a better workaround than splitting the groups manually? |
Yeah, cuda backend doesnt support groups. We should fix that but I am not looking to do it. A PR welcome. You can try the cpu backend to verify the result. If you are ok with using cudnn, I can quickly enable groups support for cudnn conv transpose 2d. |
Yes using CUDNN is fine with me, that would be great :) |
Are you sure you can run your PT model on cuda via cudnn (use nvprof)? I'm getting A WIP branch https://github.com/apache/tvm/compare/main...masahi:conv2d-transpose-group-cudnn?expand=1 if you want to hack on it. |
Hello @masahi , sorry for the slow response, I somehow missed the notification on this one, thanks for enabling the op! I took a look at running on your branch and was also getting BAD_PARAMs on both the above test case and my full model. After a bit of mucking around I noticed this change is incorrect. The argument is the conv_mode which should be left as 1 (according to the main branch). Changing that back I'm able to run both the test case and my larger model with grouped conv2d_transpose ops on the CUDNN backend 🎉 Sadly I'm still just a few FPS shy of my performance target so I'll have to keep on digging for speedups. RE: support for groups in the regular cuda backend. Do you have a general idea of what kind of changes are necessary for that? I'm no expert, but I might be able to figure it out if it's just adapting similar code from grouped conv2d to work for grouped conv2d_transpose. For good measure: the updated test code which now works with the one line change to your branchimport torch
import tvm.relay
from torch.nn.functional import conv_transpose2d
from tvm import relay
from tvm.contrib import graph_executor
class ModulatedConvTranspose2D(torch.nn.Module):
def forward(self, x, w, s):
B, C, H, W = x.shape
I, O, KH, KW = w.shape
# weight is different for each input in batch (this is why we want grouped conv transpose)
w = w.unsqueeze(0) * s.reshape(B, 1, 1, 1, 1)
w = w.reshape(B * I, O, KH, KW)
x = x.reshape(1, B * C, H, W)
x = conv_transpose2d(x, w, stride=(2, 2), padding=(1, 1), output_padding=(1, 1), groups=B)
x = x.reshape(B, O, H * 2, W * 2)
return x
with torch.inference_mode():
device = "cuda"
target = "cuda -libs=cudnn"
dtype = torch.float16
tvm_dtype = dtype.__repr__().split(".")[-1]
b, c, h, w, k = 4, 512, 8, 16, 3
inputs = torch.rand((b, c, h, w), dtype=dtype, device=device)
weights = torch.rand((c, c // 2, k, k), dtype=dtype, device=device)
styles = torch.rand((b), dtype=dtype, device=device)
torch_mod = torch.jit.trace(ModulatedConvTranspose2D().eval().to(device), (inputs, weights, styles))
outputs_torch = torch_mod(inputs, weights, styles)
print("Torch output shape", tuple(outputs_torch.shape)) # (4, 256, 16, 32)
tvm_mod, tvm_params = relay.frontend.pytorch.from_pytorch(
torch_mod,
[
("inputs", (tuple(inputs.shape), tvm_dtype)),
("weights", (tuple(weights.shape), tvm_dtype)),
("styles", (tuple(styles.shape), tvm_dtype)),
],
)
with tvm.transform.PassContext(opt_level=10):
lib = relay.build(tvm_mod, target=target, params=tvm_params)
m = graph_executor.GraphModule(lib["default"](tvm.cuda()))
m.run(
inputs=tvm.nd.array(inputs.cpu(), device=tvm.cuda()),
weights=tvm.nd.array(weights.cpu(), device=tvm.cuda()),
styles=tvm.nd.array(styles.cpu(), device=tvm.cuda()),
)
print("TVM output shape ", m.get_output(0).numpy().shape) # (4, 256, 16, 32) |
oops good find! Can you send a PR? I can quickly merge it (if I do it you need to wait until next week).
How TVM + cuDNN compares to PT? Since you are running on fp16, I'd hope that we can use tensorcore. But I've never seen grouped convolution running on tensorcore. Also cutlass is generally faster than cuDNN but it doesn't support grouped or depth wise afaik.
Yes, you can try adding
and update You may try our auto-scheduler to see if it can beat cuDNN. |
I'm not quite sure what the best way to benchmark/profile things is. I've been trying to use At the moment the PyTorch models (vanilla, traced, optimize_for_inference) reach about 9-11 fps and the TVM + CUDNN is about 15-19 fps. I'm hoping to get into the 25-30 fps range. I'm trying to get more of the computation to be done in fp16, but I've ran into this issue #10397 .
So far the autotvm tuner hasn't been successful for me. It took a couple days to tune all of the ops with the default settings from the tutorial and it actually ended up slower than the untuned TVM + CUDNN version. I haven't looked too deep into the auto-scheduler yet because I couldn't find a good tutorial of applying it to a large model (I think the only tutorial is for single ops?). I figured it would also be less effective due to using CUDNN, which might reduce the flexibility of the scheduler, although I'm not sure if that's actually the case. |
https://github.com/apache/tvm/tree/main/gallery/how_to/tune_with_autoscheduler has e2e examples of using the auto scheduler. But yeah, I don't expect it to beat cuDNN, unless cuDNN implementation of dgrad with group is really poor. |
I'm trying to convert a PyTorch model which makes use of
torch.nn.functional.conv_transpose2d
and am running into issues with my converter to the correspondingtvm.relay.op.nn.conv2d_transpose
operation.I've done a little monkey patching on the PyTorchOpConverter as the operations that
torch.nn.functional.conv2d/conv_transpose2d
trace to (aten::conv2d
andaten::conv_transpose2d
) aren't covered by default. I've added functions to convert each one to the PyTorchOpConverter so that I have access toself.infer_shape(weight)
in the functions as follows:Converter implementation
The implementations of the convertors are adapted from
tvm.relay.frontend.pytorch.PyTorchOpConverter.convolution(inputs, input_types)
but updated to support the call signature oftorch.nn.functional.conv2d/conv_transpose2d
.The problem I'm seeing is that it seems like
tvm.relay.op.nn.conv2d_transpose()
doesn't respect thegroups
argument. When I print the input and outputs of the first 4 conv(_transpose) ops in my network, the PyTorch shapes are the following:PyTorch shapes
While the TVM shapes are:
TVM shapes
Notice that the output shape of
tvm.relay.op.nn.conv2d_transpose()
does not have the correct number of channels (output is as ifgroups
= 1). This leads to the error in the next conv2d operation:Error traceback
As a workaround, I've rewritten my conv_transpose2d converter to manually split the data and weights into groups, perform each transposed conv, and then concatenate them back. This converter does seem to give the correct output shape although I haven't yet tested the outputs for correctness, I might have just gotten lucky with the shapes.
Workaround converter implementation (manual grouping)
Expected behavior
The groups argument of
tvm.relay.op.nn.conv2d_transpose
should work correctly liketvm.relay.op.nn.conv2d
does.Actual behavior
The transposed convolution seems to only be applied to a single group?
Environment
Ubuntu 20.04
PyTorch 1.12.0.dev20220210
TVM 0.9.dev525+g8aeb72265 (compiled from main a couple hours ago)
CUDA 11.4
Steps to reproduce
The text was updated successfully, but these errors were encountered: