Add example to compare RELU with SELU #6990

zafarali · 2017-06-14T19:52:35Z

SELU was added in #6924.
This PR adds a script to compare RELU and SELU performance in an MLP

fchollet · 2017-06-14T20:41:13Z

So SELU is significantly worse than ReLU on this example?

zafarali · 2017-06-14T20:43:59Z

Yes. Which is also what @bigsnarfdude sees in bigsnarfdude/SELU_Keras_Tutorial

tboquet · 2017-06-14T20:46:40Z

In the paper they use more layers to show that a self normalizing neural net could work better. I guess you should add more layers to fully test the claim this work.

zafarali · 2017-06-14T20:56:53Z

So I repeated with:

model_selu = Sequential()
model_selu.add(Dense(512, input_shape=(max_words,)))
model_selu.add(Activation('selu'))
model_selu.add(AlphaDropout(0.5))
model_selu.add(Dense(512))
model_selu.add(Activation('selu'))
model_selu.add(AlphaDropout(0.5))
model_selu.add(Dense(512))
model_selu.add(Activation('selu'))
model_selu.add(AlphaDropout(0.5))
model_selu.add(Dense(512))
model_selu.add(Activation('selu'))
model_selu.add(AlphaDropout(0.5))
model_selu.add(Dense(num_classes))
model_selu.add(Activation('softmax'))

and I got no appreciable improvement:

fchollet · 2017-06-14T21:03:54Z

In theory, normalization is only useful when you have very deep networks; it helps with gradient propagation. Try less dropout, more layers, and training longer. You could also use smaller layers but more of them.

What's the likelihood that we're dealing with an implementation bug?

drauh · 2017-06-14T22:49:05Z

In the paper the dropout rate range is .05-.1, inputs are standardized, kernel initializer is lecun_uniform and the optimizer is pure sgd. Maybe this can help...

zafarali · 2017-06-14T22:52:40Z

Good catch. I'll make those changes and try again. I'll also read the paper to see what MLP architecture parameters they used. @fchollet: Not sure if we can say if it's an implementation bug or not until we try to replicate results from the paper.

…

On Wed, Jun 14, 2017 at 6:49 PM drauh ***@***.***> wrote: In the paper the dropout rate range is .05-.1, inputs are standardized, kernel initializer is lecun_uniform and the optimizer is pure sgd. Maybe this can help... — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#6990 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGAO_LMrqtbE8j0S2wv2OaMWoyb8QafZks5sEGN4gaJpZM4N6XF-> .

zafarali · 2017-06-14T23:06:34Z

The reduced size helped with the overfitting issue.

The reduced dropout and lecun_normal initializer helped it converge very fast and perform better than the regular MLP. Will make a few more modifications and paste an graph here.

fchollet · 2017-06-14T23:11:30Z

Try the baseline network (relu) with both lecun_normal and the default glorot_uniform, so we can see the impact of the initializer alone.

bigsnarfdude · 2017-06-15T00:11:12Z

The reduced size helped with the overfitting issue.

The reduced dropout and lecun_normal initializer helped it converge very fast and perform better than the regular MLP.

The latest code with: kernel_initializer='lecun_normal'

graph below

fchollet · 2017-06-15T00:17:52Z

Looks like it's overfitting pretty badly, it starts overfitting at epoch 1 (selu version). Maybe increase dropout or reduce layer size?

zafarali · 2017-06-15T02:58:24Z

Thanks @bigsnarfdude !

So making a side by side comparison with some of the parameters:

Effect of initializer: `glorot_uniform` vs `lecun_normal`

4 dense layers, 16 units each
relu activation
dropout = 0.5

Effect of activation: `relu` vs `selu`

4 dense layers, 16 units each
dropout = 0.5
initializer=lecun_normal

Effect of dropout type: `Dropout(0.5)` vs `AlphaDropout(0.5)` with `relu`

4 dense layers, 16 units each
relu activation
initializer=lecun_normal

Effect of dropout type + rate: `Dropout(0.5)` vs `AlphaDropout(0.05)` with `relu`

4 dense layers, 16 units each
relu activation
initializer=lecun_normal

"Recommended" structure of SNNs: `Dropout(0.5)` vs `AlphaDropout(0.05)` with `selu`

4 dense layers, 16 units each
selu activation
initializer=lecun_normal

"Recommended" structure of SNNs: `Dropout(0.5)` vs `AlphaDropout(0.1)` with `selu`

4 dense layers, 16 units each
selu activation
initializer=lecun_normal

Effect of dropout rate on `relu` vs `selu`

4 dense layers, 16 units each
dropout = 0.5 for relu network
dropout = 0.1 for selu network
initializer=lecun_normal

Effect of Dropout type on selu networks

4 dense layers, 16 units each
selu activation function
initializer=lecun_normal

Effect of initializer: `glorot_uniform` vs `lecun_normal`

4 dense layers, 16 units each
selu activation
alphadropout = 0.1

zafarali · 2017-06-15T03:14:07Z

Graph for 222f613:

It does seem like synergize well together and the naive implementations are prone to dismissing it.

I still think the net is overfitting. What do you think @bigsnarfdude @fchollet?

zafarali · 2017-06-15T03:26:24Z

@fchollet I think there might be some value in making this a command-line executable script with default arguments (using argparse). This way someone who quickly wants to compare two architectures of SELU vs RELU can do:

python examples/reuters_mlp_with_selu.py -d 4 -h 16 -a1 relu -a2 selu -d1 dropout -d2 alphadropout -dr1 0.5 -dr2 0.05 -i1 glorot_uniform -i2 lecun_normal

Will make the above graphs reproducible. Opinions?

fchollet · 2017-06-15T03:50:49Z

Certainly, it is good to fully parameterize the models in order to be able to run different configurations by only changing one variable. But the advantages of having access to command line arguments vs. just editing global variables at the beginning of a file, are slim. I'd recommend just having a list of global parameters, with reasonable defaults.

We could even make layer depth a configurable parameter.

bigsnarfdude · 2017-06-15T14:40:43Z

@zafarali

RE: Graph for 222f613

The SELU with kernel_initializer='lecun_normal' loss appears more consistent with results I found using TF provided paper code. SELU function definitely requires AlphaDropout and the kernel_initializer to find the "magic".

zafarali · 2017-06-15T15:01:55Z

Commit 86e16ff:

Network 1 results
Hyperparameters: {'dropout': <class 'keras.layers.core.Dropout'>, 'kernel_initializer': 'glorot_uniform', 'dropout_rate': 0.5, 'n_dense': 6, 'dense_units': 16, 'activation': 'relu', 'optimizer': 'adam'}
Test score: 1.93495889508
Test accuracy: 0.567230632235
Network 2 results
Hyperparameters: {'dropout': <class 'keras.layers.noise.AlphaDropout'>, 'kernel_initializer': 'lecun_normal', 'dropout_rate': 0.1, 'n_dense': 6, 'dense_units': 16, 'activation': 'selu', 'optimizer': 'sgd'}
Test score: 1.75557634412
Test accuracy: 0.614425645645

tboquet

Now this makes sense! I think most of the examples export graphs in png could you also follow this convention?

fchollet · 2017-06-15T19:37:39Z

examples/reuters_mlp_relu_vs_selu.py

+    'dropout': AlphaDropout,
+    'dropout_rate': 0.1,
+    'kernel_initializer': 'lecun_normal',
+    'optimizer': 'sgd'


Not sure how fair the comparison is, if using a different optimizer and different dropout rate...

I can set the optimizer to sgd in both.

What do you recommend we do with dropout? To be fair, they are not comparable 1-1 anyways (i.e Dropout(0.5) ≠ AlphaDropout(0.5))

They performed grid search, so I'm wondering if its possible to do apple to apple comparisons as it looks we are hand tuning the SELU/SSN parameters.

Best performing SNNs have 8 layers, compared to the runner-ups ReLU networks with layer normalization with 2 and 3 layers ... we preferred settings with a higher number of layers, lower learning rates and higher dropout rates

Changed to sgd

fchollet · 2017-06-16T04:26:36Z

examples/reuters_mlp_relu_vs_selu.py

@@ -0,0 +1,153 @@
+'''


Please start file with a one-line description ending with a period.

fchollet · 2017-06-16T04:27:28Z

examples/reuters_mlp_relu_vs_selu.py

+        kernel_initializer: str. the initializer for the weights
+        optimizer: str/keras.optimizers.Optimizer. the optimizer to use
+        num_classes: int > 0. the number of classes to predict
+        max_words: int > 0. the maximum number of words per data point


Docstring needs a # Returns section

fchollet · 2017-06-16T04:27:45Z

examples/reuters_mlp_relu_vs_selu.py

+                   optimizer='adam',
+                   num_classes=1,
+                   max_words=max_words):
+    """Generic function to create a fully connect neural network


"fully-connected". One-line description must end with a period.

fchollet · 2017-06-16T04:28:35Z

examples/reuters_mlp_relu_vs_selu.py

+
+
+score_model1 = model1.evaluate(x_test, y_test,
+                               batch_size=batch_size, verbose=1)


One line per keyword argument (to avoid overly long lines), same below

antonmbk · 2017-07-15T03:18:26Z

@zafarali , @bigsnarfdude
Not sure if anyone will see this comment, but I'm curious: why was use_bias=False not set in all but the last Dense layers for the selu network? Isn't a bias not necessary due to the self-normalization? I may be mistaken, but wanted to ask if anyone knew the answer.

zafarali added 3 commits June 14, 2017 15:46

Add exampe to compare RELU with SELU keras-team#6924

052ad80

Add header description

8c95d9b

Add axes labels

5d87bc9

Increase size of MLP keras-team#6990

9ab8709

Reduce network size, reduce dropout rate, reduce dense units

82a01d3

Reduce network size, add recommendations to reduce overfitting

222f613

zafarali added 2 commits June 15, 2017 11:00

Encapsulate hyperparameters and create generic network builder

86e16ff

Rename file to be more descriptive

3381273

tboquet suggested changes Jun 15, 2017

View reviewed changes

Add @tboquet's suggestion to export to png keras-team#6990

88bf0a4

fchollet reviewed Jun 15, 2017

View reviewed changes

fchollet approved these changes Jun 16, 2017

View reviewed changes

Docstring clean-up

898a5ed

zafarali and others added 3 commits June 16, 2017 13:30

Change optimizer to sgd, increase epochs

db2cc55

Update docstrings

c226e86

Fix PEP8

e2bc1e7

fchollet merged commit 8d5b2ce into keras-team:master Jun 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add example to compare RELU with SELU #6990

Add example to compare RELU with SELU #6990

zafarali commented Jun 14, 2017

fchollet commented Jun 14, 2017

zafarali commented Jun 14, 2017

tboquet commented Jun 14, 2017

zafarali commented Jun 14, 2017

fchollet commented Jun 14, 2017

drauh commented Jun 14, 2017

zafarali commented Jun 14, 2017 via email

zafarali commented Jun 14, 2017

fchollet commented Jun 14, 2017

bigsnarfdude commented Jun 15, 2017 •

edited

Loading

fchollet commented Jun 15, 2017

zafarali commented Jun 15, 2017 •

edited

Loading

zafarali commented Jun 15, 2017

zafarali commented Jun 15, 2017

fchollet commented Jun 15, 2017 •

edited

Loading

bigsnarfdude commented Jun 15, 2017

zafarali commented Jun 15, 2017

tboquet left a comment

fchollet Jun 15, 2017

zafarali Jun 15, 2017 •

edited

Loading

bigsnarfdude Jun 15, 2017 •

edited

Loading

zafarali Jun 16, 2017

fchollet Jun 16, 2017

fchollet Jun 16, 2017

fchollet Jun 16, 2017

fchollet Jun 16, 2017

antonmbk commented Jul 15, 2017 •

edited

Loading



		score_model1 = model1.evaluate(x_test, y_test,
		batch_size=batch_size, verbose=1)

Add example to compare RELU with SELU #6990

Add example to compare RELU with SELU #6990

Conversation

zafarali commented Jun 14, 2017

fchollet commented Jun 14, 2017

zafarali commented Jun 14, 2017

tboquet commented Jun 14, 2017

zafarali commented Jun 14, 2017

fchollet commented Jun 14, 2017

drauh commented Jun 14, 2017

zafarali commented Jun 14, 2017 via email

zafarali commented Jun 14, 2017

fchollet commented Jun 14, 2017

bigsnarfdude commented Jun 15, 2017 • edited Loading

fchollet commented Jun 15, 2017

zafarali commented Jun 15, 2017 • edited Loading

Effect of initializer: glorot_uniform vs lecun_normal

Effect of activation: relu vs selu

Effect of dropout type: Dropout(0.5) vs AlphaDropout(0.5) with relu

Effect of dropout type + rate: Dropout(0.5) vs AlphaDropout(0.05) with relu

"Recommended" structure of SNNs: Dropout(0.5) vs AlphaDropout(0.05) with selu

"Recommended" structure of SNNs: Dropout(0.5) vs AlphaDropout(0.1) with selu

Effect of dropout rate on relu vs selu

Effect of Dropout type on selu networks

Effect of initializer: glorot_uniform vs lecun_normal

zafarali commented Jun 15, 2017

zafarali commented Jun 15, 2017

fchollet commented Jun 15, 2017 • edited Loading

bigsnarfdude commented Jun 15, 2017

zafarali commented Jun 15, 2017

tboquet left a comment

Choose a reason for hiding this comment

fchollet Jun 15, 2017

Choose a reason for hiding this comment

zafarali Jun 15, 2017 • edited Loading

Choose a reason for hiding this comment

bigsnarfdude Jun 15, 2017 • edited Loading

Choose a reason for hiding this comment

zafarali Jun 16, 2017

Choose a reason for hiding this comment

fchollet Jun 16, 2017

Choose a reason for hiding this comment

fchollet Jun 16, 2017

Choose a reason for hiding this comment

fchollet Jun 16, 2017

Choose a reason for hiding this comment

fchollet Jun 16, 2017

Choose a reason for hiding this comment

antonmbk commented Jul 15, 2017 • edited Loading

bigsnarfdude commented Jun 15, 2017 •

edited

Loading

zafarali commented Jun 15, 2017 •

edited

Loading

Effect of initializer: `glorot_uniform` vs `lecun_normal`

Effect of activation: `relu` vs `selu`

Effect of dropout type: `Dropout(0.5)` vs `AlphaDropout(0.5)` with `relu`

Effect of dropout type + rate: `Dropout(0.5)` vs `AlphaDropout(0.05)` with `relu`

"Recommended" structure of SNNs: `Dropout(0.5)` vs `AlphaDropout(0.05)` with `selu`

"Recommended" structure of SNNs: `Dropout(0.5)` vs `AlphaDropout(0.1)` with `selu`

Effect of dropout rate on `relu` vs `selu`

Effect of initializer: `glorot_uniform` vs `lecun_normal`

fchollet commented Jun 15, 2017 •

edited

Loading

zafarali Jun 15, 2017 •

edited

Loading

bigsnarfdude Jun 15, 2017 •

edited

Loading

antonmbk commented Jul 15, 2017 •

edited

Loading