Implement `pad` as a CUDA kernel #860

danieldk · 2023-03-10T16:10:13Z

Description

Ops.pad was a fairly slow operation on GPU. It iterates over all sequences and copies each sequence into the padded array. This results in a lot of kernel launches. In the biaffine parser, padding the inputs was more costly than applying the biaffine layers.

This change optimizes the pad op using a custom CUDA kernel. The kernel get an array of pointers to the CuPy arrays that are provided as a list. The output array is then filled, parallelizing over the 'time steps'. This should provides the largest amount of parallelism, since we usually have n_steps * hidden_size to parallelize over.

Before:

After:

Warning: please do not review yet! There are still some todo items and I still want to be able to rebase the branch.

Extend to int32 and int64, so that we can use this optimization in curated transformers as well.
Test by training in an actual model.
Rebase to Thinc master.
Clean up some redundancies in the Ops.pad code.

Types of change

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

`Ops.pad` was a fairly slow operation on GPU. It iterates over all sequences and copies each sequence into the padded array. This results in a lot of kernel launches. In the biaffine parser, padding the inputs was more costly than applying the biaffine layers. This change optimizes the `pad` op using a custom CUDA kernel. The kernel get an array of pointers to the CuPy arrays that are provided as a list. The output array is then filled, parallelizing over the 'time steps'. This should provides the largest amount of parallelism, since we usually have n_steps * hidden_size to parallelize over.

thinc/backends/_custom_kernels.py

netlify · 2023-04-19T08:46:17Z

👷 Deploy request for thinc-ai pending review.

Visit the deploys page to approve it

Name	Link
🔨 Latest commit	`4df1ba2`

shadeMe · 2023-04-19T16:19:19Z

@explosion-bot please test_slow_gpu

explosion-bot · 2023-04-19T16:19:46Z

🪁 Successfully triggered build on Buildkite

URL: https://buildkite.com/explosion-ai/thinc-slow-gpu-tests/builds/41

This reverts commit 12c03cc.

danieldk added performance Speed and memory use feat / ops Backends and maths labels Mar 10, 2023

danieldk force-pushed the wip/gpu-pad branch 2 times, most recently from 5d60959 to db04e9d Compare March 11, 2023 10:32

danieldk changed the base branch from v9 to master March 11, 2023 15:57

danieldk changed the base branch from master to v9 March 11, 2023 15:57

danieldk force-pushed the wip/gpu-pad branch from db04e9d to 9d0056a Compare March 11, 2023 16:01

danieldk changed the base branch from v9 to master March 11, 2023 16:02

danieldk marked this pull request as ready for review March 11, 2023 16:24

shadeMe reviewed Apr 17, 2023

View reviewed changes

thinc/backends/_custom_kernels.py Outdated Show resolved Hide resolved

thinc/backends/_custom_kernels.py Outdated Show resolved Hide resolved

thinc/backends/_custom_kernels.py Outdated Show resolved Hide resolved

danieldk added 4 commits April 19, 2023 09:01

Rename variables for clarification

f478ef9

Better validation of incorrect rounding

b92e765

Simplify rounding using modular arithmetic, add test

755be28

Merge remote-tracking branch 'upstream/master' into wip/gpu-pad

4df1ba2

shadeMe approved these changes Apr 19, 2023

View reviewed changes

shadeMe merged commit 12c03cc into explosion:master Apr 19, 2023

shadeMe added a commit that referenced this pull request Apr 19, 2023

Revert "Implement pad as a CUDA kernel (#860)"

0a5f7fc

This reverts commit 12c03cc.

danieldk deleted the wip/gpu-pad branch April 19, 2023 17:51

shadeMe mentioned this pull request Jun 6, 2023

Make the memory use of the pairwise linear model constant explosion/spacy-experimental#45

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `pad` as a CUDA kernel #860

Implement `pad` as a CUDA kernel #860

danieldk commented Mar 10, 2023 •

edited

Loading

netlify bot commented Apr 19, 2023

shadeMe commented Apr 19, 2023

explosion-bot commented Apr 19, 2023 •

edited

Loading

Implement pad as a CUDA kernel #860

Implement pad as a CUDA kernel #860

Conversation

danieldk commented Mar 10, 2023 • edited Loading

Description

Types of change

Checklist

netlify bot commented Apr 19, 2023

👷 Deploy request for thinc-ai pending review.

shadeMe commented Apr 19, 2023

explosion-bot commented Apr 19, 2023 • edited Loading

Implement `pad` as a CUDA kernel #860

Implement `pad` as a CUDA kernel #860

danieldk commented Mar 10, 2023 •

edited

Loading

explosion-bot commented Apr 19, 2023 •

edited

Loading