Async gmem copy support on sm80+ #1619

shmsong · 2022-04-25T22:33:54Z

This PR is separated from #1440 to address initial support of cp.async.

This PR introduces the preliminary use cases in cp.async (sm80+):

Introduce resource string for cp.async and cp.async.wait
Separate IR node for cp.async.wait
Create a separate sync insertion logic for cp.async.wait so that it can be inserted independent of whether block sync is inserted or not.
Include predicate inversion WAR and related checks for cp.async initialization.
Support cp.async in double buffering, particularly pipe down LoadStoreOp in the analysis, and insert cp.async.wait in the double buffer loop.

…ync_initial_support

shmsong · 2022-05-07T01:39:17Z

These two C++ tests fail on my system:

[  FAILED  ] 2 tests, listed below:
[  FAILED  ] NVFuserTest.FusionPersistentNormLocalShared_CUDA
[  FAILED  ] NVFuserTest.FusionSmemBlockGemmCacheDoubleBuffer_CUDA

[ RUN      ] NVFuserTest.FusionPersistentNormLocalShared_CUDA
unknown file: Failure
C++ exception with description "(dynamic_smem_size) < (available_dynamic_smem_without_reconfiguration + additional_dynamic_smem_available_through_reconfiguration) INTERNAL ASSERT FAILED at "../torch/csrc/jit/codegen/cuda/executor.cpp":571, please report a bug to PyTorch. The total shared memory allocation is larger than available memory. Dynamic size: 66048. Available size: 49136. Configured smem size: 49152. Device limit size: 65536

[ RUN      ] NVFuserTest.FusionSmemBlockGemmCacheDoubleBuffer_CUDA
unknown file: Failure
C++ exception with description "aten_output_tensor.allclose( fusion_output_tensor.to(aten_output_tensor.dtype()), tolerance_values.second, tolerance_values.first) INTERNAL ASSERT FAILED at "../torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h":436, please report a bug to PyTorch.

Validation error in output 0 on line 20743 in file ../torch/csrc/jit/codegen/cuda/test/test_gpu.cpp.
  Detected abs error of: 3.25719
    absolute tolerance was set to 8.57934e-06
    and relative tolerance set to 8.57934e-08

Thanks for checking.

Both failures are caused/triggered by #1440 and second one has something to do with TOT. Will put in more details and fix/track.

Need to go offline, will come back to the review discussions. Thanks for the initial feedback 👍

…ync_initial_support

naoyam · 2022-05-12T01:01:27Z

torch/csrc/jit/codegen/cuda/test/test_gpu.cpp

+TEST_F(NVFuserTest, FusionDoubleBufferCpAsync1_CUDA) {
+  // requires ampere+ GPU
+  if (!deviceMajorMinorCheck(8)) {
+    GTEST_SKIP() << "skipping tests on pre-AMPERE GPUs";


I'd prefer these arch-dependent tests go through the lowering and skip the rest so that they are tested on an older GPU to the extent possible.

There's also validation of arch versions at the beginning of lowering, which is inconvenient when I just wanted to see the generated code on my system with a Turing GPU. Perhaps, another PYTORCH_NVFUSER_DISABLE flag to skip the validation?

Certain conditions must hold to make the lowering passes work, but there are also other conditions that do not prevent them to go through, like this use of arch-specific instructions. Another example is non-constant allocations of register arrays. For debugging, I'd like a way to relax those conditions.

I think it'd better to have this test run through the scheduling and then do ASSERT_ANY_THROW with the lowering on pre Ampere devices.

Sure it'd be very helpful debug and test feature. Some extra assertions would be needed in this case to make sure we never send any device mismatched kernel to nvrtc, even nvrtc compile.

Let's move these discussion and changes into #1702.

Are you still working on #1702?

Not really. Just rebasing after master bump, and maybe now is almost good time to start merging.

I will need to propagate changes in #1702 to #1440 , and this one though.

naoyam · 2022-05-12T01:31:36Z

torch/csrc/jit/codegen/cuda/lower_double_buffer.cpp

+      //
+      // TODO: we may not need this sync at all if we didn't need
+      //  to sync the block before. This information is in sync info.
+      //  In this case shared mem double buffering "collapses" into


Not sure what collapsing means.

Does this mean we don't need block sync when the async wait inserted above is sufficient?

Yes, I will shorten the comment or even just do it in this PR. It's basically that double buffer sync should now look at sync info to decide if a sync should be inserted.

It's generic double buffer change but I guess if it's a few lines I could just do it in this PR as well.

Sounds good.

This is the only part I still need to review. Let me know if you're done.

I think I added all I wanted to include about double buffering in this PR.

Will have a follow up on double buffering for sure, to clear all the todo's.

@naoyam This PR should be ready for re-review.

Will need to merge back other changes before merging this one, but I expect all remaining commits fully mechanical.

…ync_initial_support

naoyam

Thanks for addressing all my concerns!

csarofeen

This PR is really cool and very clean now. Good work.

csarofeen · 2022-05-21T19:11:05Z

torch/csrc/jit/codegen/cuda/lower_predicate.cpp

+    // Initialize buffer -> predicated load
+    // For memcpy async:
+    //    If we initialized and then loaded (without sync) it would be undefined
+    //    behavior.


This is only supported for memcpy async, is that true? It's a little hard for me to tell from the comment exactly where the support works and doesn't, but I think SMEM + MemCpy ASync is the only place this is usable.

Yes. This is only done for cp.async, and actually this is the only scenario that I know of where this is required.

In follow ups I think this can be removed for prioritized cases as well even for cp.async, but completely not getting rid of this didn't seem possible.

csarofeen · 2022-05-21T19:21:09Z

torch/csrc/jit/codegen/cuda/lower_predicate.cpp

    kir::IrVisitor::handle(expr);
  }

+  // Invert the predicate of given expr.


The limitation here seems to be that we're predicating anything outside the "valid" regions as being the initialization regions. This is fine as long as we never go outside the allocated regime. However what we'd likely want is predicates that are true only outside the "valid region" and within the "allocated region". Is this understanding correct?

Yes that's right. For best reliability in all cases we want to cover we'd need to find precisely the region that's allocated but invalid. But as mentioned above, this is mostly for functionality coverage, and in most perf critical cases we should be able to find alternatives to doing this, which will be in follow ups on cp.async.

csarofeen · 2022-05-21T19:23:56Z

torch/csrc/jit/codegen/cuda/lower_predicate_elimination.cpp

+    //  input or its shared mem output, or both.
+    // Due to the WAR discussed in [Predicate Inversion for CpAsync],
+    //  we currently cannot support use cases where both the gmem read
+    //  and the smem write need to be predicated.


This just requires a predicate for the GMEM read no? I'm failing to see how this is different from the logic/limitation in my comment in lower_predicate.cpp. We would want to read valid GMEM locations which would have to have a valid associated SMEM write location. So the "valid region" is the same for GMEM and SMEM and guaranteed to be in "allocated memory". I think this is the same limitation, just wasn't immediately obvious to me.

Yes this is the same limitation and the assertion is put here because here we have access to the allocated region info. If we have valid data we are guaranteed to have allocated region but due to the limitation discussed in lower_predicate.cpp, for cp.async we just also require allocated region for all active threads in this instruction.

I will still need to work through pathological cases in follow ups, for functional coverage but not perf critical use case.

torch/csrc/jit/codegen/cuda/test/test_gpu.cpp

…ial_support

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440 Commits that's actually in this PR from the csarofeen branch ``` * dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f Fix missing cooperative launch (#1726) * dc670a2 Async gmem copy support on sm80+ (#1619) * 5e6a8da Add turing mma support and test (#1643) * d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39 Mma op integration on ampere (#1440) * fade8da patch python test for bfloat16 (#1724) * 8fbd0b1 Fine-grained kernel profiling (#1720) * 77c1b4f Adding dry run mode to skip arch dependent checks (#1702) * 151d95b More precise concretization analysis (#1719) * f4d3630 Enable complex python tests (#1667) * 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830 Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7 updating_ci_machine (#1718) * 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453 Allow using nvFuser on CUDA extension (#1701) * 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: #78244 Approved by: https://github.com/csarofeen, https://github.com/malfet

Summary: Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440 Commits that's actually in this PR from the csarofeen branch ``` * dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f Fix missing cooperative launch (#1726) * dc670a2 Async gmem copy support on sm80+ (#1619) * 5e6a8da Add turing mma support and test (#1643) * d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39 Mma op integration on ampere (#1440) * fade8da patch python test for bfloat16 (#1724) * 8fbd0b1 Fine-grained kernel profiling (#1720) * 77c1b4f Adding dry run mode to skip arch dependent checks (#1702) * 151d95b More precise concretization analysis (#1719) * f4d3630 Enable complex python tests (#1667) * 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830 Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7 updating_ci_machine (#1718) * 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453 Allow using nvFuser on CUDA extension (#1701) * 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: #78244 Reviewed By: ejguan Differential Revision: D36678948 Pulled By: davidberard98 fbshipit-source-id: 0ccde965acbd31da67d99c6adb2eaaa888948105

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen/pytorch#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen/pytorch#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen/pytorch#1440 Commits that's actually in this PR from the csarofeen branch ``` * dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726) * dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619) * 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643) * d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440) * fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724) * 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720) * 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702) * 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719) * f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667) * 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718) * 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701) * 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: pytorch/pytorch#78244 Approved by: https://github.com/csarofeen, https://github.com/malfet

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen/pytorch#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen/pytorch#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen/pytorch#1440 Commits that's actually in this PR from the csarofeen branch ``` * 939e6c9 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * e4a514b Fix missing cooperative launch (#1726) * 1bb7b65 Async gmem copy support on sm80+ (#1619) * 69354da Add turing mma support and test (#1643) * 7ca0fa9 Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 8c5fb93 Mma op integration on ampere (#1440) * fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724) * 1278624 Fine-grained kernel profiling (#1720) * 34cb422 Adding dry run mode to skip arch dependent checks (#1702) * 4c3cba4 More precise concretization analysis (#1719) * 5a9ad9c Enable complex python tests (#1667) * 8102c05 Minor bugfix in transform_rfactor.cpp (#1715) * 2c0363c Separate root domain and rfactor domain in TransformPrinter (#1716) * 1679226 Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718) * acde15c Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701) * e57cc6b Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: pytorch/pytorch#78244 Approved by: https://github.com/csarofeen, https://github.com/malfet

shmsong added 30 commits February 21, 2022 15:56

initial volta support

a449f49

mma parallel type && cleanup

edd43d9

cleanup

ddac459

alignment

2f08d09

comment

ca77ff4

change request

9caeb18

fix same parallel type

de1d3ec

Merge remote-tracking branch 'origin/devel' into volta_mma_op

b393cbe

move validation pass

74f8c12

comment and cleanup

db34181

lint

4dec827

comment and cleanup

5ecd102

Merge remote-tracking branch 'origin/devel' into volta_mma_op

adb19b2

comment and format

c97c605

initial turing and ampere mma support

264bc77

Merge remote-tracking branch 'origin/devel' into ampere_mma_op

5bc918c

fix rebase

2f0ae5e

more rebase fix

b96462a

test comment

5347fda

Merge remote-tracking branch 'origin/devel' into ampere_mma_op

3f580d1

submodule

4834995

cleanup and comments

9e18e04

minor fix

1fb32ed

test cleanup

43d0e72

format

c1f4374

newline

7c91a0a

Merge remote-tracking branch 'origin/devel' into ampere_mma_op

ae2bd1f

move all implementation to Ampere space

f4c6f12

comment and naming

d719bd8

leave cp async in a separate PR

a2c28c9

Merge remote-tracking branch 'origin/ampere_mma_op' into ampere_cp_as…

b2d8e79

…ync_initial_support

shmsong added 5 commits May 6, 2022 22:45

WAR for buffer re-use

119f0eb

clean up

abcd1e8

update comment

ebc215a

Merge remote-tracking branch 'origin/devel' into ampere_mma_op

5e8128f

Merge remote-tracking branch 'origin/ampere_mma_op' into ampere_cp_as…

a5cd419

…ync_initial_support

naoyam reviewed May 12, 2022

View reviewed changes

shmsong added 5 commits May 17, 2022 10:13

Merge remote-tracking branch 'origin/devel' into ampere_mma_op

80e840c

fix rebase

1634f9c

Merge remote-tracking branch 'origin/devel' into ampere_mma_op

5458e4b

Merge remote-tracking branch 'origin/ampere_mma_op' into ampere_cp_as…

206fe6b

…ync_initial_support

remove redundant sync in double buffering

218e00d

naoyam approved these changes May 20, 2022

View reviewed changes

csarofeen approved these changes May 21, 2022

View reviewed changes

Base automatically changed from ampere_mma_op to devel May 23, 2022 23:50

shmsong added 4 commits May 24, 2022 10:03

Merge remote-tracking branch 'origin/devel' into ampere_cp_async_init…

5b5cff8

…ial_support

rebase fix

c95d8e2

fix rebase

31d803e

move code and cleanup

5ecdd7c

shmsong merged commit dc670a2 into devel May 24, 2022

shmsong deleted the ampere_cp_async_initial_support branch May 24, 2022 18:58

jjsjann123 mentioned this pull request May 25, 2022

[nvfuser_upstream_push] nvfuser code base bump 052422 pytorch/pytorch#78244

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async gmem copy support on sm80+ #1619

Async gmem copy support on sm80+ #1619

shmsong commented Apr 25, 2022 •

edited

Loading

shmsong commented May 7, 2022 •

edited

Loading

naoyam May 12, 2022

naoyam May 12, 2022

naoyam May 12, 2022

shmsong May 17, 2022

naoyam May 19, 2022

shmsong May 19, 2022

naoyam May 12, 2022

shmsong May 17, 2022

naoyam May 19, 2022

naoyam May 19, 2022

shmsong May 19, 2022

shmsong May 19, 2022

naoyam left a comment

csarofeen left a comment

csarofeen May 21, 2022

shmsong May 24, 2022

csarofeen May 21, 2022

shmsong May 24, 2022

csarofeen May 21, 2022

shmsong May 24, 2022

Async gmem copy support on sm80+ #1619

Async gmem copy support on sm80+ #1619

Conversation

shmsong commented Apr 25, 2022 • edited Loading

shmsong commented May 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naoyam left a comment

Choose a reason for hiding this comment

csarofeen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shmsong commented Apr 25, 2022 •

edited

Loading

shmsong commented May 7, 2022 •

edited

Loading