Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Falcon support #499

Merged
merged 12 commits into from
Sep 3, 2023
Merged

Add Falcon support #499

merged 12 commits into from
Sep 3, 2023

Conversation

borzunov
Copy link
Collaborator

@borzunov borzunov commented Sep 2, 2023

This PR adds:

  • Support for models based on transformers.FalconModel (the in-library format for Falcon). Tested on Falcon-40B.
  • CI tests for Falcon-RW-1B.
  • --throughput dry_run option to evaluate throughput and exit right away (implemented by @mryab).

Limitations:

@borzunov borzunov force-pushed the falcon-new branch 7 times, most recently from bcbd950 to 2137876 Compare September 3, 2023 10:46
@borzunov
Copy link
Collaborator Author

borzunov commented Sep 3, 2023

Falcon-40B benchmarks

These are measured before 4537c77 that slows down inference by 1-2% (but necessary to make MQA models work properly with the rest of Petals).

H100 (80 GB):

Sep 03 14:07:43.798 [INFO] Inference throughput: 728.4 tokens/sec per block (1 tokens/batch, NVIDIA H100 PCIe GPU, bfloat16, quantized to nf4)
Sep 03 14:07:57.270 [INFO] Forward pass throughput: 93138.6 tokens/sec per block (1024 tokens/batch, NVIDIA H100 PCIe GPU, bfloat16, quantized to nf4)

A100 (80 GB):

Sep 03 13:22:40.739 [INFO] Inference throughput: 710.3 tokens/sec per block (1 tokens/batch, NVIDIA A100-SXM4-80GB GPU, bfloat16, quantized to nf4)
Sep 03 13:22:50.803 [INFO] Forward pass throughput: 61680.6 tokens/sec per block (1024 tokens/batch, NVIDIA A100-SXM4-80GB GPU, bfloat16, quantized to nf4)

RTX A6000 Ada (48 GB):

Sep 03 15:14:46.634 [INFO] Inference throughput: 785.9 tokens/sec per block (1 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)
Sep 03 15:14:57.330 [INFO] Forward pass throughput: 62151.1 tokens/sec per block (1024 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)

@borzunov borzunov force-pushed the falcon-new branch 2 times, most recently from a8cb6dc to 72033d1 Compare September 3, 2023 14:34
@borzunov borzunov force-pushed the falcon-new branch 4 times, most recently from 872f80d to 2747255 Compare September 3, 2023 19:22
@borzunov borzunov marked this pull request as ready for review September 3, 2023 21:41
@borzunov borzunov merged commit dd4a323 into main Sep 3, 2023
11 checks passed
@borzunov borzunov deleted the falcon-new branch September 3, 2023 21:45
mryab added a commit that referenced this pull request Sep 4, 2023
This PR attempts to optimize the inference of Falcon models in the single-token setup by reducing the majority of Python overhead and making several assumptions about the setup. Specifically,

* Layer normalization, QKV projection (with splitting) and rotary embeddings are executed through CUDA graphs, which reduces most overhead related to small kernel launche
* If no sin/cos tensors are cached by the rotary embedding layer, we cache them for 8192 tokens (INFERENCE_MAX_LENGTH) during the first forward pass. In general, it should be beneficial to always run a max-length sequence before starting a block, but this is a question for another PR

The PR also adds a small test to ensure that the results (without quantization) of the block before and after quantization indeed match.

Lastly, the pull request makes the backward pass work (as discussed in #499) by making cached sin/cos for RotaryEmbedding into buffers and disabling the inference mode during their creation.
d-popov pushed a commit to d-popov/petals-ai that referenced this pull request Sep 6, 2023
This PR adds:

- Support for models based on `transformers.FalconModel` (the in-library format for Falcon). Tested on Falcon-40B.
- CI tests for Falcon-RW-1B.
- `--throughput dry_run` option to evaluate throughput and exit right away (implemented by @mryab).

Limitations:

- Backward pass support is broken for now, will be fixed in bigscience-workshop#500.

Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
d-popov pushed a commit to d-popov/petals-ai that referenced this pull request Sep 6, 2023
This PR attempts to optimize the inference of Falcon models in the single-token setup by reducing the majority of Python overhead and making several assumptions about the setup. Specifically,

* Layer normalization, QKV projection (with splitting) and rotary embeddings are executed through CUDA graphs, which reduces most overhead related to small kernel launche
* If no sin/cos tensors are cached by the rotary embedding layer, we cache them for 8192 tokens (INFERENCE_MAX_LENGTH) during the first forward pass. In general, it should be beneficial to always run a max-length sequence before starting a block, but this is a question for another PR

The PR also adds a small test to ensure that the results (without quantization) of the block before and after quantization indeed match.

Lastly, the pull request makes the backward pass work (as discussed in bigscience-workshop#499) by making cached sin/cos for RotaryEmbedding into buffers and disabling the inference mode during their creation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant