Skip to content

[GPT] Use flash-attention and enable dropout#40

Merged
comaniac merged 6 commits intoawslabs:mainfrom
comaniac:gpt-triton
Feb 4, 2023
Merged

[GPT] Use flash-attention and enable dropout#40
comaniac merged 6 commits intoawslabs:mainfrom
comaniac:gpt-triton

Conversation

@comaniac
Copy link
Contributor

@comaniac comaniac commented Feb 3, 2023

To existing developers: please update your environments by referring to examples/README.md or Dockerfile.

Description

This PR updates epoi used in GPT schedule to the latest version, which uses flash-attention triton kernel for better performance and attention dropout supports. The GPT schedule is also updated to support different random seeds within a TP group for attention dropout. With this PR, Slapo 3D could align the loss value of ZeRO-3 even with dropout enabled.

  1. Update GPT schedule dependency. epoi, xformers are updated to the latest compatible commits. Introduce flash_attention as the new dependency of GPT scheduling.
  2. The latest xformers doesn't allow CUTLASS kernel to take bias. We use a patch to fix it temporary.
  3. Dockerfile. The docker file is updated. The CI docker image has been updated to v0.02.
  4. Data loader. Use gpt-neo tokenizer which has maximum token length 2048.
  5. Data loader. Support configurable sequence length.
  6. GPT. Support configurable dropout probabilities and maximum sequence length.
  7. GPT. When the GPU we are running is sm_80 (e.g., A100), use triton flash attention kernel.
  8. GPT. Use different random seeds for attention dropout.
  9. Utility. A simple utility to calculate the TFLOPS of decoder model. It can be used like this
# samples/sec, ngpu, seq_len, nlayers, hs, vocab_size
python3 -c "from slapo.utils.report import calc_decoder_tflops; print(calc_decoder_tflops(35, 8, 2048, 24, 2048, 50528))"

With all above changes, we could use the following command to run GPT with 3D (DP, PP, TP)=(2, 2, 2):

deepspeed ./examples/gpt/deepspeed_hf.py --pmp 2 --tmp 2 \
--batch_size 128 --micro_batch_size 4 --model_name EleutherAI/gpt-neo-2.7B \
--iter_nums 170 --hidden-size 2048 --nlayers 24 --num-attn-heads 16 \
--dropout 0.1 --seq_len 1024

And here is another example:

deepspeed ./examples/gpt/deepspeed_hf.py --pmp 2 --tmp 2 \
--batch_size 32 --micro_batch_size 4 --model_name EleutherAI/gpt-neo-2.7B \
--iter_nums 170 --hidden-size 2048 --nlayers 24 --num-attn-heads 16 \
--dropout 0.1 --seq_len 2048

Checklist

  • PR's title starts with a category (e.g. [Bugfix], [Model], [Tutorial], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

cc @szhengac

@comaniac comaniac merged commit 79436ea into awslabs:main Feb 4, 2023
@comaniac comaniac deleted the gpt-triton branch February 4, 2023 02:29
@comaniac
Copy link
Contributor Author

comaniac commented Feb 4, 2023

Thanks @szhengac

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants