After adding self-implemented Layer-Normalization, the backward time of gradient_penalty became large #10

santisy · 2017-06-14T17:20:11Z

My implementation of layer-normalization is:

class Layer_Norm(nn.Module):

    def __init__(self, dim):
        super(Layer_Norm, self).__init__()
        self.dim = dim
        self.g = Parameter(torch.zeros(1, dim))
        self.b = Parameter(torch.zeros(1, dim))
        self.init_weights()

    def forward(self, input):
        miu = torch.sum(input, 1).unsqueeze(1)/self.dim
        input_minus_miu = input - miu.expand_as(input)
        sigma = (torch.sum((input_minus_miu).pow(2), 1)/self.dim).sqrt().unsqueeze(1)
        input = input_minus_miu*self.g.expand(input.size())/sigma.expand_as(input) + self.b.expand(input.size())

        return input

    def init_weights(self):
        self.g.data.fill_(1)
        self.b.data.fill_(0)

After plugging in this before ReLU, the backward of gradient_penalty became large 0.1149s compared to the former value 0.0075s.

I compiled the source code from master branch, commit deb0aef30cdaa78f9840bfa4a919ad206e8e73a7 and also modified the ReLU source code before compiling according to your instruction.
I am wondering if it is because the my implementation of layer-normalization contains something not suitable for double backward?

The text was updated successfully, but these errors were encountered:

caogang · 2017-06-15T02:35:13Z

Do you confirm that the value is wrong? Maybe you can refer to the gradgradcheck to test your double backward instead of just using the value. The gradgradcheck will be soon added in PR pytorch/pytorch#1643. You can write a gradgradcheck similar to the PR.

santisy · 2017-06-15T02:38:42Z

@caogang The result seems reasonable but the time seems a bit too long. I'll test using gradgradcheck later. Thanks for your reply.

caogang · 2017-06-15T02:47:55Z

Oh, your problem is the bigger time cost after pluggin your Layer_Norm module? From my intuition, it may because of the expand or expand_as. Maybe you can test it.

santisy · 2017-06-15T03:57:26Z

@caogang what is the other way to complete similar behavior here as expand and expand_as. Broadcasting mechanism seems to still have some problems. pytorch/pytorch#1787. And repeat costs more about 0.3s.

caogang · 2017-06-15T05:51:45Z

Yeah, maybe it is the best methods using expand in current branch. Just waiting for the broadcasting feature to be merged, or you can contrubute to the above PR. :)

caogang closed this as completed Sep 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After adding self-implemented Layer-Normalization, the backward time of gradient_penalty became large #10

After adding self-implemented Layer-Normalization, the backward time of gradient_penalty became large #10

santisy commented Jun 14, 2017

caogang commented Jun 15, 2017

santisy commented Jun 15, 2017

caogang commented Jun 15, 2017

santisy commented Jun 15, 2017 •

edited

caogang commented Jun 15, 2017

After adding self-implemented Layer-Normalization, the backward time of gradient_penalty became large #10

After adding self-implemented Layer-Normalization, the backward time of gradient_penalty became large #10

Comments

santisy commented Jun 14, 2017

caogang commented Jun 15, 2017

santisy commented Jun 15, 2017

caogang commented Jun 15, 2017

santisy commented Jun 15, 2017 • edited

caogang commented Jun 15, 2017

santisy commented Jun 15, 2017 •

edited