New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Usage of alpha #10
Comments
Hi, We did define a trainable parameter for each head per layer. Here's a small code snippet that we used:
However, it's possible to have a single alpha per layer, or per transformer block! |
Thanks so much! @goncalomcorreia |
Hello! How can I compute the entmax_bisect when the size of alpha is larger than 1 ? |
Hi,
May I know if we need to define a new trainable parameter for each head per layer for the alpha value? Could anyone be kind enough to show a simple example of how it could be used in normal transformer?
Thanks!
The text was updated successfully, but these errors were encountered: