-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for the Falcon new decoder architecture #253
Merged
shadeMe
merged 10 commits into
explosion:main
from
danieldk:maintenance/falcon-shared-decoder-layer
Jul 20, 2023
Merged
Add support for the Falcon new decoder architecture #253
shadeMe
merged 10 commits into
explosion:main
from
danieldk:maintenance/falcon-shared-decoder-layer
Jul 20, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The 40B Falcon model uses the so-called new decoder architecture. This change adds support for the new decoder architecture. This necessitates a bunch of changes across the board: * So far we supported having a uniform number of query/key/value heads or full head-sharing of key/value heads. The new decoder architecture provides a configurable number of key/value heads, where the number of query heads is a multiple of the number of key/value heads. In order to support this, we replace the `QkvHeadSharing` enum by a `AttentionHeads` class that allows more flexible configurations. The attention layer is extended to support this new scenario. * The new decoder architecture's transformer layer is much more canonical, allowing us to reuse the shared decoder layer. However, in contrast to the other decoders that use the shared layer, Falcon puts the dropout after parallel attention. To accomodate more flexible dropouts configurations, we introduce the `TransformerDropouts` class, which works similar to the `TransformerLayerNorms` class, but for dropouts. * Split the HF configuration parsing for Falcon in functions for the `RefinedWebModel` and `Falcon` model types. This change also adds two new models to test the new-decoder architecture.
danieldk
added
type/feature
Type: Feature
feat/model
Feature: models
feat/layers
Feature: Layers
labels
Jul 19, 2023
shadeMe
reviewed
Jul 19, 2023
shadeMe
reviewed
Jul 19, 2023
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
…shared-decoder-layer
shadeMe
approved these changes
Jul 20, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! One minor fix.
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
This pull request was closed.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The 40B Falcon model uses the so-called new decoder architecture. This change adds support for the new decoder architecture. This necessitates a bunch of changes across the board:
So far we supported having a uniform number of query/key/value heads or full head-sharing of key/value heads. The new decoder architecture provides a configurable number of key/value heads, where the number of query heads is a multiple of the number of key/value heads. In order to support this, we replace the
QkvHeadSharing
enum by aAttentionHeads
class that allows more flexible configurations. The attention layer is extended to support this new scenario.The new decoder architecture's transformer layer is much more canonical, allowing us to reuse the shared decoder layer. However, in contrast to the other decoders that use the shared layer, Falcon puts the dropout after parallel attention. To accomodate more flexible dropouts configurations, we introduce the
TransformerDropouts
class, which works similar to theTransformerLayerNorms
class, but for dropouts.Split the HF configuration parsing for Falcon in functions for the
RefinedWebModel
andfalcon
model types.This change also adds two new models to test the new-decoder architecture.
Types of change
Feature
Checklist