Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More comparison with existing methods? #24

Closed
DonaldTsang opened this issue Jan 21, 2020 · 10 comments
Closed

More comparison with existing methods? #24

DonaldTsang opened this issue Jan 21, 2020 · 10 comments
Labels
question Further information is requested

Comments

@digantamisra98 digantamisra98 added the question Further information is requested label Jan 21, 2020
@digantamisra98
Copy link
Owner

Hi. Thanks for raising the issue. Comparison/ Benchmarks have been presented in this paper/ repository for Mish vs Swish, GELU and SELU. PAU and X-Units are not in my priority list to have a comparison against, but I can definitely run some experiments in the coming week. Additionally, xUnit is a block and not a function, so the more sensible approach would be to replace the non-linearity in xUnit with Mish and compare the two variant.
Also, for all the other activations that has been mentioned in the links you have posted are not in my To-do list currently. And as I see those activation functions are pretty exotic, usually we try to compare against activation functions used in General practices. But, if you'd like to benchmark against some of the activation in those list which are already not compared against, feel free to report the results. Thanks!
Will be closing the issue as of now. Please re-open the issue at your own discretion.

@DonaldTsang
Copy link
Author

DonaldTsang commented Jan 22, 2020

And as I see those activation functions are pretty exotic, usually we try to compare against activation functions used in General practices.

It is due to their exotic nature that make it interesting to compare, as they may contain hidden information regarding what an optimal activation function should or should not look like. I would like to look into the issue as well.

For reference one of the paper "Searching for Activation Functions" has a GitHub at https://github.com/Neoanarika/Searching-for-activation-functions and it might be possible to integrate that into the test.

@digantamisra98
Copy link
Owner

@DonaldTsang "Searching for Activation Functions" is the paper for Swish. All of my tests have compared Mish with Swish. What do you mean exactly by tests?

@DonaldTsang
Copy link
Author

DonaldTsang commented Jan 22, 2020

@digantamisra98 the paper itself did list other "exotic forms" (not Swish itself) in Table 2 that are not on the table in the readME of the Mish, I would assume it is due to differences in naming schemes? If it is not just a difference in naming schemes, and that there are some activation functions that could be integrated into the repo for benchmarks, that would be great.

@digantamisra98
Copy link
Owner

@DonaldTsang The authors of that paper used a reinforcement learning algorithm to search the function space to obtain the best possible non-linear function which qualifies as an activation function. Out of all that were obtained in that search, Swish performed the best and hence I used Swish as a comparison benchmark against Mish and not the other activation which the algorithm found in that paper.

@DonaldTsang
Copy link
Author

DonaldTsang commented Jan 22, 2020

@DonaldTsang So the other algorithms in https://github.com/Neoanarika/Searching-for-activation-functions/blob/master/src/rnn_controller.py#L22 might not be as useful or as common, but worth exploring, I would assume? Or are you saying that the activation functions listed in the paper itself is "filler"?

@digantamisra98
Copy link
Owner

@DonaldTsang The other activation found by the search in that paper were not as efficient as Swish.
Quoting from the paper's abstract itself:

Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the effectiveness of the
searches by conducting an empirical evaluation with the best discovered activation function. Our experiments show that the best discovered activation function,
f(x) = x · sigmoid(βx), which we name Swish, tends to work better than ReLU
on deeper models across a number of challenging datasets.

@DonaldTsang
Copy link
Author

DonaldTsang commented Jan 22, 2020

But would they at least have some historical significance, as some are "close calls" e.g. (atan(x))**2−x and cos(x)-x?
For notes though max(x,tanh(x)) is basically ISRLU with tanh instead of ISRU, which can also be switched out with atan or softsign.

@digantamisra98
Copy link
Owner

@DonaldTsang my current work with Mish involves more about Mean Field Theory - helping to find the Edge of Chaos and Rate of Convergence for Mish. These are more relevant since it will help to understand more of what's an ideal activation function like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants