Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example to compare RELU with SELU #6990

Merged
merged 13 commits into from
Jun 16, 2017
Merged

Conversation

zafarali
Copy link
Contributor

SELU was added in #6924.
This PR adds a script to compare RELU and SELU performance in an MLP

image

@fchollet
Copy link
Member

So SELU is significantly worse than ReLU on this example?

@zafarali
Copy link
Contributor Author

Yes. Which is also what @bigsnarfdude sees in bigsnarfdude/SELU_Keras_Tutorial

@tboquet
Copy link
Contributor

tboquet commented Jun 14, 2017

In the paper they use more layers to show that a self normalizing neural net could work better. I guess you should add more layers to fully test the claim this work.

@zafarali
Copy link
Contributor Author

So I repeated with:

model_selu = Sequential()
model_selu.add(Dense(512, input_shape=(max_words,)))
model_selu.add(Activation('selu'))
model_selu.add(AlphaDropout(0.5))
model_selu.add(Dense(512))
model_selu.add(Activation('selu'))
model_selu.add(AlphaDropout(0.5))
model_selu.add(Dense(512))
model_selu.add(Activation('selu'))
model_selu.add(AlphaDropout(0.5))
model_selu.add(Dense(512))
model_selu.add(Activation('selu'))
model_selu.add(AlphaDropout(0.5))
model_selu.add(Dense(num_classes))
model_selu.add(Activation('softmax'))

and I got no appreciable improvement:

image

@fchollet
Copy link
Member

In theory, normalization is only useful when you have very deep networks; it helps with gradient propagation. Try less dropout, more layers, and training longer. You could also use smaller layers but more of them.

What's the likelihood that we're dealing with an implementation bug?

@drauh
Copy link
Contributor

drauh commented Jun 14, 2017

In the paper the dropout rate range is .05-.1, inputs are standardized, kernel initializer is lecun_uniform and the optimizer is pure sgd. Maybe this can help...

@zafarali
Copy link
Contributor Author

zafarali commented Jun 14, 2017 via email

@zafarali
Copy link
Contributor Author

The reduced size helped with the overfitting issue.

The reduced dropout and lecun_normal initializer helped it converge very fast and perform better than the regular MLP. Will make a few more modifications and paste an graph here.

@fchollet
Copy link
Member

Try the baseline network (relu) with both lecun_normal and the default glorot_uniform, so we can see the impact of the initializer alone.

@bigsnarfdude
Copy link

bigsnarfdude commented Jun 15, 2017

The reduced size helped with the overfitting issue.

The reduced dropout and lecun_normal initializer helped it converge very fast and perform better than the regular MLP. 

The latest code with: kernel_initializer='lecun_normal'

graph below

selu

@fchollet
Copy link
Member

Looks like it's overfitting pretty badly, it starts overfitting at epoch 1 (selu version). Maybe increase dropout or reduce layer size?

@zafarali
Copy link
Contributor Author

zafarali commented Jun 15, 2017

Thanks @bigsnarfdude !

So making a side by side comparison with some of the parameters:

Effect of initializer: glorot_uniform vs lecun_normal

4 dense layers, 16 units each
relu activation
dropout = 0.5
image

Effect of activation: relu vs selu

4 dense layers, 16 units each
dropout = 0.5
initializer=lecun_normal

image

Effect of dropout type: Dropout(0.5) vs AlphaDropout(0.5) with relu

4 dense layers, 16 units each
relu activation
initializer=lecun_normal

image

Effect of dropout type + rate: Dropout(0.5) vs AlphaDropout(0.05) with relu

4 dense layers, 16 units each
relu activation
initializer=lecun_normal

image

"Recommended" structure of SNNs: Dropout(0.5) vs AlphaDropout(0.05) with selu

4 dense layers, 16 units each
selu activation
initializer=lecun_normal

image

"Recommended" structure of SNNs: Dropout(0.5) vs AlphaDropout(0.1) with selu

4 dense layers, 16 units each
selu activation
initializer=lecun_normal

image

Effect of dropout rate on relu vs selu

4 dense layers, 16 units each
dropout = 0.5 for relu network
dropout = 0.1 for selu network
initializer=lecun_normal

image

Effect of Dropout type on selu networks

4 dense layers, 16 units each
selu activation function
initializer=lecun_normal

image

Effect of initializer: glorot_uniform vs lecun_normal

4 dense layers, 16 units each
selu activation
alphadropout = 0.1

image

@zafarali
Copy link
Contributor Author

Graph for 222f613:

image

It does seem like synergize well together and the naive implementations are prone to dismissing it.

I still think the net is overfitting. What do you think @bigsnarfdude @fchollet?

@zafarali
Copy link
Contributor Author

@fchollet I think there might be some value in making this a command-line executable script with default arguments (using argparse). This way someone who quickly wants to compare two architectures of SELU vs RELU can do:

python examples/reuters_mlp_with_selu.py -d 4 -h 16 -a1 relu -a2 selu -d1 dropout -d2 alphadropout -dr1 0.5 -dr2 0.05 -i1 glorot_uniform -i2 lecun_normal

Will make the above graphs reproducible. Opinions?

@fchollet
Copy link
Member

fchollet commented Jun 15, 2017

Certainly, it is good to fully parameterize the models in order to be able to run different configurations by only changing one variable. But the advantages of having access to command line arguments vs. just editing global variables at the beginning of a file, are slim. I'd recommend just having a list of global parameters, with reasonable defaults.

We could even make layer depth a configurable parameter.

@bigsnarfdude
Copy link

@zafarali

RE: Graph for 222f613

The SELU with kernel_initializer='lecun_normal' loss appears more consistent with results I found using TF provided paper code. SELU function definitely requires AlphaDropout and the kernel_initializer to find the "magic".

@zafarali
Copy link
Contributor Author

Commit 86e16ff:

image

Network 1 results
Hyperparameters: {'dropout': <class 'keras.layers.core.Dropout'>, 'kernel_initializer': 'glorot_uniform', 'dropout_rate': 0.5, 'n_dense': 6, 'dense_units': 16, 'activation': 'relu', 'optimizer': 'adam'}
Test score: 1.93495889508
Test accuracy: 0.567230632235
Network 2 results
Hyperparameters: {'dropout': <class 'keras.layers.noise.AlphaDropout'>, 'kernel_initializer': 'lecun_normal', 'dropout_rate': 0.1, 'n_dense': 6, 'dense_units': 16, 'activation': 'selu', 'optimizer': 'sgd'}
Test score: 1.75557634412
Test accuracy: 0.614425645645

Copy link
Contributor

@tboquet tboquet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this makes sense! I think most of the examples export graphs in png could you also follow this convention?

'dropout': AlphaDropout,
'dropout_rate': 0.1,
'kernel_initializer': 'lecun_normal',
'optimizer': 'sgd'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how fair the comparison is, if using a different optimizer and different dropout rate...

Copy link
Contributor Author

@zafarali zafarali Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can set the optimizer to sgd in both.

What do you recommend we do with dropout? To be fair, they are not comparable 1-1 anyways (i.e Dropout(0.5)AlphaDropout(0.5))

Copy link

@bigsnarfdude bigsnarfdude Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They performed grid search, so I'm wondering if its possible to do apple to apple comparisons as it looks we are hand tuning the SELU/SSN parameters.

Best performing SNNs have 8 layers, compared to the runner-ups ReLU networks with 
layer normalization with 2 and 3 layers ... we preferred settings with a
higher number of layers, lower learning rates and higher dropout rates

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to sgd

@@ -0,0 +1,153 @@
'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please start file with a one-line description ending with a period.

kernel_initializer: str. the initializer for the weights
optimizer: str/keras.optimizers.Optimizer. the optimizer to use
num_classes: int > 0. the number of classes to predict
max_words: int > 0. the maximum number of words per data point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring needs a # Returns section

optimizer='adam',
num_classes=1,
max_words=max_words):
"""Generic function to create a fully connect neural network
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"fully-connected". One-line description must end with a period.



score_model1 = model1.evaluate(x_test, y_test,
batch_size=batch_size, verbose=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One line per keyword argument (to avoid overly long lines), same below

@fchollet fchollet merged commit 8d5b2ce into keras-team:master Jun 16, 2017
@antonmbk
Copy link
Contributor

antonmbk commented Jul 15, 2017

@zafarali , @bigsnarfdude
Not sure if anyone will see this comment, but I'm curious: why was use_bias=False not set in all but the last Dense layers for the selu network? Isn't a bias not necessary due to the self-normalization? I may be mistaken, but wanted to ask if anyone knew the answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants