Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement pad as a CUDA kernel #860

Merged
merged 5 commits into from
Apr 19, 2023
Merged

Conversation

danieldk
Copy link
Contributor

@danieldk danieldk commented Mar 10, 2023

Description

Ops.pad was a fairly slow operation on GPU. It iterates over all sequences and copies each sequence into the padded array. This results in a lot of kernel launches. In the biaffine parser, padding the inputs was more costly than applying the biaffine layers.

This change optimizes the pad op using a custom CUDA kernel. The kernel get an array of pointers to the CuPy arrays that are provided as a list. The output array is then filled, parallelizing over the 'time steps'. This should provides the largest amount of parallelism, since we usually have n_steps * hidden_size to parallelize over.

Before:

before-padding-change

After:

padding-after-change

Warning: please do not review yet! There are still some todo items and I still want to be able to rebase the branch.

  • Extend to int32 and int64, so that we can use this optimization in curated transformers as well.
  • Test by training in an actual model.
  • Rebase to Thinc master.
  • Clean up some redundancies in the Ops.pad code.

Types of change

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@danieldk danieldk added performance Speed and memory use feat / ops Backends and maths labels Mar 10, 2023
@danieldk danieldk force-pushed the wip/gpu-pad branch 2 times, most recently from 5d60959 to db04e9d Compare March 11, 2023 10:32
@danieldk danieldk changed the base branch from v9 to master March 11, 2023 15:57
@danieldk danieldk changed the base branch from master to v9 March 11, 2023 15:57
`Ops.pad` was a fairly slow operation on GPU. It iterates over all
sequences and copies each sequence into the padded array. This results
in a lot of kernel launches. In the biaffine parser, padding the inputs
was more costly than applying the biaffine layers.

This change optimizes the `pad` op using a custom CUDA kernel. The
kernel get an array of pointers to the CuPy arrays that are provided as
a list. The output array is then filled, parallelizing over the 'time
steps'. This should provides the largest amount of parallelism, since
we usually have n_steps * hidden_size to parallelize over.
@danieldk danieldk changed the base branch from v9 to master March 11, 2023 16:02
@danieldk danieldk marked this pull request as ready for review March 11, 2023 16:24
thinc/backends/_custom_kernels.py Outdated Show resolved Hide resolved
thinc/backends/_custom_kernels.py Outdated Show resolved Hide resolved
thinc/backends/_custom_kernels.py Outdated Show resolved Hide resolved
@netlify
Copy link

netlify bot commented Apr 19, 2023

👷 Deploy request for thinc-ai pending review.

Visit the deploys page to approve it

Name Link
🔨 Latest commit 4df1ba2

@shadeMe
Copy link
Collaborator

shadeMe commented Apr 19, 2023

@explosion-bot please test_slow_gpu

@explosion-bot
Copy link
Collaborator

explosion-bot commented Apr 19, 2023

🪁 Successfully triggered build on Buildkite

URL: https://buildkite.com/explosion-ai/thinc-slow-gpu-tests/builds/41

@shadeMe shadeMe merged commit 12c03cc into explosion:master Apr 19, 2023
shadeMe added a commit that referenced this pull request Apr 19, 2023
@danieldk danieldk deleted the wip/gpu-pad branch April 19, 2023 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / ops Backends and maths performance Speed and memory use
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants