Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: more than one group is unsupported on GPU #386

Open
zoeStartover opened this issue Jun 17, 2022 · 7 comments
Open

AssertionError: more than one group is unsupported on GPU #386

zoeStartover opened this issue Jun 17, 2022 · 7 comments

Comments

@zoeStartover
Copy link

I met a issue when I used Conv2d in model.

Assertion as follow:
File "/home/data/anaconda3/anaconda/envs/mpc/lib/python3.7/site-packages/crypten/cuda/cuda_tensor.py", line 195, in __patched_conv_ops
), f"more than one group is unsupported on GPU (groups = {groups})"
AssertionError: more than one group is unsupported on GPU (groups = 256)

256 is the channel in conv2d.
I define the model as "self.conv3 = nn.Conv2d(128, 256, 5, 1, 2)"

How can I solve the problem?

@lvdmaaten
Copy link
Member

Can you please provide a minimal repro so that I can reproduce this issue?

@sonnguyenasu
Copy link

Hello,
I think the problem is with this line in gradient calculation

grad_kernel = input.conv2d(

Basically on GPU you have to make groups=1, but this line in conv2d grad computation has groups > 1, which will cause the problem for all model that has convolution operation.

@zoeStartover
Copy link
Author

Hello, I think the problem is with this line in gradient calculation

grad_kernel = input.conv2d(

Basically on GPU you have to make groups=1, but this line in conv2d grad computation has groups > 1, which will cause the problem for all model that has convolution operation.

Unfortunately I didn't have the code snnipet which caused this problem after that, so I didn't post it.
But I think you are right. It's very kind of you. Do you have any solutions?

@Tobias512
Copy link

Tobias512 commented Nov 24, 2022

Can you please provide a minimal repro so that I can reproduce this issue?

Hi,
I ran into the same issue. Here is some code that produces the error:

import torchvision
import crypten

class model_CNN(torch.nn.Module):
    def __init__(self):
        super(model_CNN, self).__init__()
        self.conv1 = torch.nn.Conv2d(1, 16, kernel_size=5, padding=0)
        self.fc1 = torch.nn.Linear(16 * 12 * 12, 100)
        self.fc2 = torch.nn.Linear(100, 10)
        
    def forward(self, x):
        out = self.conv1(x)
        out = torch.nn.functional.relu(out)
        out = torch.nn.functional.max_pool2d(out, 2)
        out = out.view(-1, 16 * 12 * 12)
        out = self.fc1(out)
        out = torch.nn.functional.relu(out)
        out = self.fc2(out)
        return out

if __name__ == "__main__":
    crypten.init()

    # load data
    train_data = torchvision.datasets.MNIST(root="./data", train=True, download=True, transform=torchvision.transforms.Compose([
         torchvision.transforms.ToTensor(),
         torchvision.transforms.Normalize((0.1307,), (0.3081,))
    ]))
    train_loader = torch.utils.data.DataLoader(train_data, batch_size=100, shuffle=False, pin_memory=True)
    data, labels = next(iter(train_loader))

    data_enc = crypten.cryptensor(data).cuda()
    label_enc = crypten.cryptensor(torch.nn.functional.one_hot(labels)).cuda()

    # load model
    model_plaintext = model_CNN()
    dummy_input = torch.empty(size=(1, 1, 28, 28))
    model = crypten.nn.from_pytorch(model_plaintext, dummy_input)
    model.encrypt()
    model.cuda()

    loss_fn = crypten.nn.CrossEntropyLoss()

    output = model(data_enc)
    loss = loss_fn(output, label_enc)
    loss.backward()  # AssertionError occurs during backward pass of Conv2D layer

The AssertionError is during the backward pass of the Conv2D layer in the model. Other models without Conv2D layer work without problem.

@Tobias512
Copy link

Hi,
I found a fix for the backward pass working. You need to change this function:

def __patched_conv_ops(op, x, y, *args, **kwargs):
if "groups" in kwargs:
groups = kwargs["groups"]
assert (
groups == 1
), f"more than one group is unsupported on GPU (groups = {groups})"
del kwargs["groups"]
bs, c, *img = x.size()
c_out, c_in, *ks = y.size()
kernel_elements = functools.reduce(operator.mul, ks)
nb = 3 if kernel_elements < 256 else 4
nb2 = nb**2
x_encoded = CUDALongTensor.__encode_as_fp64(x, nb).data
y_encoded = CUDALongTensor.__encode_as_fp64(y, nb).data
repeat_idx = [1] * (x_encoded.dim() - 1)
x_enc_span = x_encoded.repeat(nb, *repeat_idx)
y_enc_span = torch.repeat_interleave(y_encoded, repeats=nb, dim=0)
x_enc_span = x_enc_span.transpose_(0, 1).reshape(bs, nb2 * c, *img)
y_enc_span = y_enc_span.reshape(nb2 * c_out, c_in, *ks)
c_z = c_out if op in ["conv1d", "conv2d"] else c_in
z_encoded = getattr(torch, op)(
x_enc_span, y_enc_span, *args, **kwargs, groups=nb2
)
z_encoded = z_encoded.reshape(bs, nb2, c_z, *z_encoded.size()[2:]).transpose_(
0, 1
)
return CUDALongTensor.__decode_as_int64(z_encoded, nb)

The function needs to be changed as follows:

def __patched_conv_ops(op, x, y, *args, **kwargs):
        if "groups" in kwargs:
            groups = kwargs["groups"]
            del kwargs["groups"]
        else:
            groups = 1

        bs, c, *img = x.size()
        c_out, c_in, *ks = y.size()
        kernel_elements = functools.reduce(operator.mul, ks)

        nb = 3 if kernel_elements < 256 else 4
        nb2 = nb**2

        x_encoded = CUDALongTensor.__encode_as_fp64(x, nb).data
        y_encoded = CUDALongTensor.__encode_as_fp64(y, nb).data

        repeat_idx = [1] * (x_encoded.dim() - 1)
        x_enc_span = x_encoded.repeat(nb, *repeat_idx)
        y_enc_span = torch.repeat_interleave(y_encoded, repeats=nb, dim=0)

        x_enc_span = x_enc_span.transpose_(0, 1).reshape(bs, nb2 * c, *img)
        y_enc_span = y_enc_span.reshape(nb2 * c_out, c_in, *ks)

        c_z = c_out if op in ["conv1d", "conv2d"] else c_in

        z_encoded = getattr(torch, op)(
            x_enc_span, y_enc_span, *args, **kwargs, groups=(nb2 * groups)
        )
        z_encoded = z_encoded.reshape(bs, nb2, c_z, *z_encoded.size()[2:]).transpose_(
            0, 1
        )
        return CUDALongTensor.__decode_as_int64(z_encoded, nb)

I successfully trained models with this fix using a GPU, so I am pretty sure the backpropagation is calculated correctly.

@kwmaeng91
Copy link

Are there any updates on this? I am experiencing the same problem, and @Tobias512 's solution does not work. It gives me some dimension mismatch, which I haven't looked into too closely yet.

@Tobias512
Copy link

I found a better solution looking at CryptGPU. They implement it as follows:

https://github.com/jeffreysijuntan/CryptGPU/blob/2ff57b2b4d718f9665f4b2ac8245b0bcd7e65165/crypten/cuda/cuda_tensor.py#L183-L217

I changed the CrypTen implementation to:

@staticmethod
def __patched_conv_ops(op, x, y, *args, **kwargs):
        if "groups" in kwargs:
            groups = kwargs["groups"]
            #del kwargs["groups"]
        else:
            groups = 1

        bs, c, *img = x.size()
        c_out, c_in, *ks = y.size()
        kernel_elements = functools.reduce(operator.mul, ks)

        nb = 3 if kernel_elements < 256 else 4
        nb2 = nb**2

        x_encoded = CUDALongTensor.__encode_as_fp64(x, nb).data
        y_encoded = CUDALongTensor.__encode_as_fp64(y, nb).data

        repeat_idx = [1] * (x_encoded.dim() - 1)
        x_enc_span = x_encoded.repeat(nb, *repeat_idx)
        y_enc_span = torch.repeat_interleave(y_encoded, repeats=nb, dim=0)

        x_enc_span = x_enc_span.transpose_(0, 1).reshape(bs, nb2 * c, *img)
        y_enc_span = y_enc_span.reshape(nb2 * c_out, c_in, *ks)

        c_z = c_out if op in ["conv1d", "conv2d"] else c_in

        if "groups" in kwargs:
            kwargs["groups"] *= nb2
        else:
            kwargs["groups"] = nb2

        z_encoded = getattr(torch, op)(
            x_enc_span, y_enc_span, *args, **kwargs
        )

        groups = kwargs["groups"] // nb2 if op in ["conv_transpose1d", "conv_transpose2d"] else 1
        z_encoded = z_encoded.reshape(bs, nb2, c_z * groups, *z_encoded.size()[2:]).transpose_(
            0, 1
        )

        return CUDALongTensor.__decode_as_int64(z_encoded, nb)

It nearly the same code a before but the groups argument is set differently depending on if groups is in kwargs. @kwmaeng91 I hope this really fixes the bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants