Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swish-1 and GELU activation for transformers #229

Merged
merged 3 commits into from
Dec 6, 2017
Merged

Swish-1 and GELU activation for transformers #229

merged 3 commits into from
Dec 6, 2017

Conversation

mjdenkowski
Copy link
Contributor

@mjdenkowski mjdenkowski commented Nov 30, 2017

This adds support for two activation types to transformers:

The first paper reports that Swish-1 beats ReLU on transformer models. The second reports that GELU beats Swish-1, but not explicitly tested on transformers.

Edit: If we want to avoid a bunch of sequential major version bumps, this can wait until #226 is ready to go and we can have a single bump for all transformer updates.

Pull Request Checklist

  • Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]'
    until you can check this box.
  • Unit tests pass (pytest)
  • System tests pass (pytest test/system)
  • Passed code style checking (./pre-commit.sh or manual run of pylint & mypy)
  • You have considered writing a test
  • Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
  • Updated CHANGELOG.md

@mjdenkowski mjdenkowski changed the title Swish-1 anf GELU activation for transformers Swish-1 and GELU activation for transformers Nov 30, 2017
@fhieber
Copy link
Contributor

fhieber commented Nov 30, 2017

+1 on syncing with #226 and #222 to not end up with Sockeye 1.inf :)

Copy link
Contributor

@fhieber fhieber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool change, thanks!

:param act_type: Type of activation.
:return: output Symbol with same shape as input.
"""
# TODO: Contribute these to MXNet? For now it appears that registered activation types must be implemented in C++.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you observed increased memory or speed decreases with the GELU?
You could think about implementing these as custom operators to avoid creating multiple nodes in the computation graph if these turn out to work better than the Relu.
At least for swish, the backward implementation is easy, but the savings will be negligible there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared to ReLU, decoding with Swish-1 takes 100.5% time and GELU takes 103.5% time. Swish-1 uses the same amount of memory and GELU uses 110% memory. Swish-1 is basically free while GELU is more expensive, but so far shows the best perplexity. If they work consistently better than ReLU, we should definitely look at performance optimization. I think this base implementation is fine for experiments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, we can revisit this if and when it becomes a bottleneck.

@mjdenkowski
Copy link
Contributor Author

This should be up to date and ready to merge.

@fhieber fhieber merged commit 320553d into master Dec 6, 2017
@fhieber fhieber deleted the swish1 branch December 6, 2017 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants