Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is SeLU alone having positive impact on accuracy? #5

Closed
jczaja opened this issue Jul 17, 2017 · 6 comments
Closed

Is SeLU alone having positive impact on accuracy? #5

jczaja opened this issue Jul 17, 2017 · 6 comments

Comments

@jczaja
Copy link

jczaja commented Jul 17, 2017

Hi,

In MNIST, Cifar-10 tutorials there is Selu as well as alpha dropout used and result after experiment is that SNN outperforms ReLU, ELU based models. Mnist based models (Lenet) can work without dropout, batchNorm with quite good accuracy, so my question is if Selu alone (no dropout and Batch Norm) is according to your observations, increasing accuracy? What I mean is that I have model that is working on MNIST and is a basic CNN eg. convolutions, ReLU, Fully Connected and softmax, and assuming that initialization of weights and normalization of input was done correctly can I expect increased accuracy?

@jczaja jczaja changed the title Is SeLU alone also Is SeLU alone having positive impact on accuracy? Jul 17, 2017
@gklambauer
Copy link
Member

gklambauer commented Jul 17, 2017

Hello jczaja,

No, just adding SELUs (with according normalization) will in general not improve the accuracy of your CNNs -- in fact we are even a bit surprised that SELUs work well also for CNNs (not only for FNNs). Especially if you have a "working model" developed with ReLUs, it cannot be expected that this works well or even better with SELUs.

There are multiple reasons why this is the case:

  • The architectures were developed/optimized for ReLUs and are therefore biased towards ReLUs. SELUs can code more information and they could lead to an overfitting network. Typically, the architectures, such as LeNet, went through a large optimization process and it would be a coincidence if the selected architectures also work well with SELUs.
  • Convolutional and max-pooling layers could have effects that cannot be countered by SELUs alone.
  • Hyperparameters such as learning rates, dropout rates, regularization parameters, etc were also optimized for ReLU networks and are therefore biased towards ReLUs.

This being said, there were quite a number of successes, where we just exchanged the activation function (and initialization and dropout) and ended up with improved networks, e.g. SqueezeNet, the CIFAR10 example in this repository, and some unnamed/in-house CNNs for biological data. This means that your strategy is definitely a possible way to go...

Regards,
Günter

@jczaja
Copy link
Author

jczaja commented Jul 17, 2017

Hi gklambauer,

You mentioned that :
"Convolutional and max-pooling layers could have effects that cannot be countered by SELUs alone"

I understand that max-pooling is changing variance of signal (increasing it) and may counter the normalization that SELU performs, as SELU normalized the signal iteratively. But what are the problems with convolutions in terms of normalization performed by SELU ?

@gklambauer
Copy link
Member

gklambauer commented Jul 17, 2017

Conv layers could be problematic for the central limit theorem since only few inputs are summed. However, those are "averaged" across positions of the image/feature map, which could be beneficial again.

@jczaja
Copy link
Author

jczaja commented Jul 18, 2017

I have a question about normalization of input. Is it really needed? I understand that making input zero-meaned is inportant but is Variance(x) = 1 , needed? Therem 2 and 3 are giving some boundries to input considering that input is zero-meaned and weights are initalized properly?

@gklambauer
Copy link
Member

gklambauer commented Jul 18, 2017

Yes, as you stated, in theory, after a couple of fully-connected layers, the variance goes to one anyway. However, empirically, we found that scaling inputs to unit variance helps the network to learn faster. If you are thinking about ConvNets... there we typically use a global mean and variance for input normalization (as in our example https://github.com/bioinf-jku/SNNs/blob/master/SelfNormalizingNetworks_CNN_CIFAR10.ipynb).

@jczaja
Copy link
Author

jczaja commented Jul 18, 2017

Thanks very much for your work on SNN and very detailed answers to my questions.

Regards,
Jacek

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants