Add support for the Falcon new decoder architecture #253

danieldk · 2023-07-19T12:13:23Z

Description

The 40B Falcon model uses the so-called new decoder architecture. This change adds support for the new decoder architecture. This necessitates a bunch of changes across the board:

So far we supported having a uniform number of query/key/value heads or full head-sharing of key/value heads. The new decoder architecture provides a configurable number of key/value heads, where the number of query heads is a multiple of the number of key/value heads. In order to support this, we replace the QkvHeadSharing enum by a AttentionHeads class that allows more flexible configurations. The attention layer is extended to support this new scenario.
The new decoder architecture's transformer layer is much more canonical, allowing us to reuse the shared decoder layer. However, in contrast to the other decoders that use the shared layer, Falcon puts the dropout after parallel attention. To accomodate more flexible dropouts configurations, we introduce the TransformerDropouts class, which works similar to the TransformerLayerNorms class, but for dropouts.
Split the HF configuration parsing for Falcon in functions for the RefinedWebModel and falcon model types.

This change also adds two new models to test the new-decoder architecture.

Types of change

Feature

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

The 40B Falcon model uses the so-called new decoder architecture. This change adds support for the new decoder architecture. This necessitates a bunch of changes across the board: * So far we supported having a uniform number of query/key/value heads or full head-sharing of key/value heads. The new decoder architecture provides a configurable number of key/value heads, where the number of query heads is a multiple of the number of key/value heads. In order to support this, we replace the `QkvHeadSharing` enum by a `AttentionHeads` class that allows more flexible configurations. The attention layer is extended to support this new scenario. * The new decoder architecture's transformer layer is much more canonical, allowing us to reuse the shared decoder layer. However, in contrast to the other decoders that use the shared layer, Falcon puts the dropout after parallel attention. To accomodate more flexible dropouts configurations, we introduce the `TransformerDropouts` class, which works similar to the `TransformerLayerNorms` class, but for dropouts. * Split the HF configuration parsing for Falcon in functions for the `RefinedWebModel` and `Falcon` model types. This change also adds two new models to test the new-decoder architecture.

curated_transformers/layers/attention.py

curated_transformers/layers/transformer.py

curated_transformers/models/falcon/_hf.py

curated_transformers/models/falcon/config.py

docs/source/building-blocks.rst

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

…shared-decoder-layer

shadeMe

LGTM! One minor fix.

curated_transformers/layers/attention.py

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

danieldk added type/feature Type: Feature feat/model Feature: models feat/layers Feature: Layers labels Jul 19, 2023

danieldk added 3 commits July 19, 2023 14:23

isoooooooooort

e0dbb72

Fix argument

681fa98

HF transformers release does not have Falcon yet

8c15ba9

shadeMe reviewed Jul 19, 2023

View reviewed changes

docs/source/building-blocks.rst Show resolved Hide resolved

danieldk and others added 5 commits July 19, 2023 19:44

Fixes from @shadeMe

0be5089

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

num_kv_heads -> num_key_value_heads

b3d49c8

black

2e7264b

Documentation/docstring fixes

115704d

Merge remote-tracking branch 'upstream/main' into maintenance/falcon-…

52cfc8f

…shared-decoder-layer

shadeMe approved these changes Jul 20, 2023

View reviewed changes

curated_transformers/layers/attention.py Outdated Show resolved Hide resolved

Docstring markup fix

b718ba3

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

shadeMe merged commit 6de9b98 into explosion:main Jul 20, 2023
7 checks passed

danieldk deleted the maintenance/falcon-shared-decoder-layer branch August 2, 2023 17:23

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for the Falcon new decoder architecture #253

Add support for the Falcon new decoder architecture #253

danieldk commented Jul 19, 2023 •

edited

Loading

shadeMe left a comment

Add support for the Falcon new decoder architecture #253

Add support for the Falcon new decoder architecture #253

Conversation

danieldk commented Jul 19, 2023 • edited Loading

Description

Types of change

Checklist

shadeMe left a comment

Choose a reason for hiding this comment

danieldk commented Jul 19, 2023 •

edited

Loading