[GPT] Use flash-attention and enable dropout by comaniac · Pull Request #40 · awslabs/slapo

comaniac · 2023-02-03T23:20:51Z

To existing developers: please update your environments by referring to examples/README.md or Dockerfile.

Description

This PR updates epoi used in GPT schedule to the latest version, which uses flash-attention triton kernel for better performance and attention dropout supports. The GPT schedule is also updated to support different random seeds within a TP group for attention dropout. With this PR, Slapo 3D could align the loss value of ZeRO-3 even with dropout enabled.

Update GPT schedule dependency. epoi, xformers are updated to the latest compatible commits. Introduce flash_attention as the new dependency of GPT scheduling.
The latest xformers doesn't allow CUTLASS kernel to take bias. We use a patch to fix it temporary.
Dockerfile. The docker file is updated. The CI docker image has been updated to v0.02.
Data loader. Use gpt-neo tokenizer which has maximum token length 2048.
Data loader. Support configurable sequence length.
GPT. Support configurable dropout probabilities and maximum sequence length.
GPT. When the GPU we are running is sm_80 (e.g., A100), use triton flash attention kernel.
GPT. Use different random seeds for attention dropout.
Utility. A simple utility to calculate the TFLOPS of decoder model. It can be used like this

# samples/sec, ngpu, seq_len, nlayers, hs, vocab_size
python3 -c "from slapo.utils.report import calc_decoder_tflops; print(calc_decoder_tflops(35, 8, 2048, 24, 2048, 50528))"

With all above changes, we could use the following command to run GPT with 3D (DP, PP, TP)=(2, 2, 2):

deepspeed ./examples/gpt/deepspeed_hf.py --pmp 2 --tmp 2 \
--batch_size 128 --micro_batch_size 4 --model_name EleutherAI/gpt-neo-2.7B \
--iter_nums 170 --hidden-size 2048 --nlayers 24 --num-attn-heads 16 \
--dropout 0.1 --seq_len 1024

And here is another example:

deepspeed ./examples/gpt/deepspeed_hf.py --pmp 2 --tmp 2 \
--batch_size 32 --micro_batch_size 4 --model_name EleutherAI/gpt-neo-2.7B \
--iter_nums 170 --hidden-size 2048 --nlayers 24 --num-attn-heads 16 \
--dropout 0.1 --seq_len 2048

Checklist

PR's title starts with a category (e.g. [Bugfix], [Model], [Tutorial], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

cc @szhengac

examples/gpt/schedule.py

comaniac · 2023-02-04T02:29:32Z

Thanks @szhengac

[GPT] Use flash-attention and enable dropout

366d0a8

szhengac reviewed Feb 3, 2023

View reviewed changes

examples/gpt/schedule.py Outdated Show resolved Hide resolved

comaniac added 2 commits February 4, 2023 00:02

add script

cc26570

fix bias

da82abc

szhengac approved these changes Feb 4, 2023

View reviewed changes

comaniac added 3 commits February 4, 2023 00:50

fix path

bff93a5

final fix

4a6c62f

fix others

e5a72a1

comaniac merged commit 79436ea into awslabs:main Feb 4, 2023

comaniac deleted the gpt-triton branch February 4, 2023 02:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPT] Use flash-attention and enable dropout#40

[GPT] Use flash-attention and enable dropout#40
comaniac merged 6 commits intoawslabs:mainfrom
comaniac:gpt-triton

comaniac commented Feb 3, 2023 •

edited

Loading

Uh oh!

Uh oh!

comaniac commented Feb 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

comaniac commented Feb 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

Uh oh!

comaniac commented Feb 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

comaniac commented Feb 3, 2023 •

edited

Loading