Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async gmem copy support on sm80+ #1619

Merged
merged 81 commits into from
May 24, 2022
Merged

Conversation

shmsong
Copy link

@shmsong shmsong commented Apr 25, 2022

This PR is separated from #1440 to address initial support of cp.async.

This PR introduces the preliminary use cases in cp.async (sm80+):

  • Introduce resource string for cp.async and cp.async.wait
  • Separate IR node for cp.async.wait
  • Create a separate sync insertion logic for cp.async.wait so that it can be inserted independent of whether block sync is inserted or not.
  • Include predicate inversion WAR and related checks for cp.async initialization.
  • Support cp.async in double buffering, particularly pipe down LoadStoreOp in the analysis, and insert cp.async.wait in the double buffer loop.

@shmsong
Copy link
Author

shmsong commented May 7, 2022

These two C++ tests fail on my system:

[  FAILED  ] 2 tests, listed below:
[  FAILED  ] NVFuserTest.FusionPersistentNormLocalShared_CUDA
[  FAILED  ] NVFuserTest.FusionSmemBlockGemmCacheDoubleBuffer_CUDA
[ RUN      ] NVFuserTest.FusionPersistentNormLocalShared_CUDA
unknown file: Failure
C++ exception with description "(dynamic_smem_size) < (available_dynamic_smem_without_reconfiguration + additional_dynamic_smem_available_through_reconfiguration) INTERNAL ASSERT FAILED at "../torch/csrc/jit/codegen/cuda/executor.cpp":571, please report a bug to PyTorch. The total shared memory allocation is larger than available memory. Dynamic size: 66048. Available size: 49136. Configured smem size: 49152. Device limit size: 65536
[ RUN      ] NVFuserTest.FusionSmemBlockGemmCacheDoubleBuffer_CUDA
unknown file: Failure
C++ exception with description "aten_output_tensor.allclose( fusion_output_tensor.to(aten_output_tensor.dtype()), tolerance_values.second, tolerance_values.first) INTERNAL ASSERT FAILED at "../torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h":436, please report a bug to PyTorch.

Validation error in output 0 on line 20743 in file ../torch/csrc/jit/codegen/cuda/test/test_gpu.cpp.
  Detected abs error of: 3.25719
    absolute tolerance was set to 8.57934e-06
    and relative tolerance set to 8.57934e-08

Thanks for checking.

Both failures are caused/triggered by #1440 and second one has something to do with TOT. Will put in more details and fix/track.

Need to go offline, will come back to the review discussions. Thanks for the initial feedback 👍

TEST_F(NVFuserTest, FusionDoubleBufferCpAsync1_CUDA) {
// requires ampere+ GPU
if (!deviceMajorMinorCheck(8)) {
GTEST_SKIP() << "skipping tests on pre-AMPERE GPUs";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer these arch-dependent tests go through the lowering and skip the rest so that they are tested on an older GPU to the extent possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also validation of arch versions at the beginning of lowering, which is inconvenient when I just wanted to see the generated code on my system with a Turing GPU. Perhaps, another PYTORCH_NVFUSER_DISABLE flag to skip the validation?

Certain conditions must hold to make the lowering passes work, but there are also other conditions that do not prevent them to go through, like this use of arch-specific instructions. Another example is non-constant allocations of register arrays. For debugging, I'd like a way to relax those conditions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd better to have this test run through the scheduling and then do ASSERT_ANY_THROW with the lowering on pre Ampere devices.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure it'd be very helpful debug and test feature. Some extra assertions would be needed in this case to make sure we never send any device mismatched kernel to nvrtc, even nvrtc compile.

Let's move these discussion and changes into #1702.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you still working on #1702?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. Just rebasing after master bump, and maybe now is almost good time to start merging.

I will need to propagate changes in #1702 to #1440 , and this one though.

//
// TODO: we may not need this sync at all if we didn't need
// to sync the block before. This information is in sync info.
// In this case shared mem double buffering "collapses" into
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what collapsing means.

Does this mean we don't need block sync when the async wait inserted above is sufficient?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will shorten the comment or even just do it in this PR. It's basically that double buffer sync should now look at sync info to decide if a sync should be inserted.

It's generic double buffer change but I guess if it's a few lines I could just do it in this PR as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only part I still need to review. Let me know if you're done.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I added all I wanted to include about double buffering in this PR.

Will have a follow up on double buffering for sure, to clear all the todo's.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@naoyam This PR should be ready for re-review.

Will need to merge back other changes before merging this one, but I expect all remaining commits fully mechanical.

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all my concerns!

Copy link
Owner

@csarofeen csarofeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is really cool and very clean now. Good work.

// Initialize buffer -> predicated load
// For memcpy async:
// If we initialized and then loaded (without sync) it would be undefined
// behavior.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only supported for memcpy async, is that true? It's a little hard for me to tell from the comment exactly where the support works and doesn't, but I think SMEM + MemCpy ASync is the only place this is usable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is only done for cp.async, and actually this is the only scenario that I know of where this is required.

In follow ups I think this can be removed for prioritized cases as well even for cp.async, but completely not getting rid of this didn't seem possible.

kir::IrVisitor::handle(expr);
}

// Invert the predicate of given expr.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limitation here seems to be that we're predicating anything outside the "valid" regions as being the initialization regions. This is fine as long as we never go outside the allocated regime. However what we'd likely want is predicates that are true only outside the "valid region" and within the "allocated region". Is this understanding correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's right. For best reliability in all cases we want to cover we'd need to find precisely the region that's allocated but invalid. But as mentioned above, this is mostly for functionality coverage, and in most perf critical cases we should be able to find alternatives to doing this, which will be in follow ups on cp.async.

// input or its shared mem output, or both.
// Due to the WAR discussed in [Predicate Inversion for CpAsync],
// we currently cannot support use cases where both the gmem read
// and the smem write need to be predicated.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just requires a predicate for the GMEM read no? I'm failing to see how this is different from the logic/limitation in my comment in lower_predicate.cpp. We would want to read valid GMEM locations which would have to have a valid associated SMEM write location. So the "valid region" is the same for GMEM and SMEM and guaranteed to be in "allocated memory". I think this is the same limitation, just wasn't immediately obvious to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is the same limitation and the assertion is put here because here we have access to the allocated region info. If we have valid data we are guaranteed to have allocated region but due to the limitation discussed in lower_predicate.cpp, for cp.async we just also require allocated region for all active threads in this instruction.

I will still need to work through pathological cases in follow ups, for functional coverage but not perf critical use case.

torch/csrc/jit/codegen/cuda/test/test_gpu.cpp Outdated Show resolved Hide resolved
Base automatically changed from ampere_mma_op to devel May 23, 2022 23:50
@shmsong shmsong merged commit dc670a2 into devel May 24, 2022
@shmsong shmsong deleted the ampere_cp_async_initial_support branch May 24, 2022 18:58
malfet pushed a commit to pytorch/pytorch that referenced this pull request Jun 8, 2022
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

A few bigger updates:
1. Initial support of cp.async and cp.async.wait: csarofeen#1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440

Commits that's actually in this PR from the csarofeen branch
```
* dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f Fix missing cooperative launch (#1726)
* dc670a2 Async gmem copy support on sm80+ (#1619)
* 5e6a8da Add turing mma support and test (#1643)
* d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39 Mma op integration on ampere (#1440)
* fade8da patch python test for bfloat16 (#1724)
* 8fbd0b1 Fine-grained kernel profiling (#1720)
* 77c1b4f Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b More precise concretization analysis (#1719)
* f4d3630 Enable complex python tests (#1667)
* 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830 Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7 updating_ci_machine (#1718)
* 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453 Allow using nvFuser on CUDA extension (#1701)
* 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676)
```
Pull Request resolved: #78244
Approved by: https://github.com/csarofeen, https://github.com/malfet
facebook-github-bot pushed a commit to pytorch/pytorch that referenced this pull request Jun 8, 2022
Summary:
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

A few bigger updates:
1. Initial support of cp.async and cp.async.wait: csarofeen#1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440

Commits that's actually in this PR from the csarofeen branch
```
* dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f Fix missing cooperative launch (#1726)
* dc670a2 Async gmem copy support on sm80+ (#1619)
* 5e6a8da Add turing mma support and test (#1643)
* d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39 Mma op integration on ampere (#1440)
* fade8da patch python test for bfloat16 (#1724)
* 8fbd0b1 Fine-grained kernel profiling (#1720)
* 77c1b4f Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b More precise concretization analysis (#1719)
* f4d3630 Enable complex python tests (#1667)
* 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830 Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7 updating_ci_machine (#1718)
* 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453 Allow using nvFuser on CUDA extension (#1701)
* 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676)
```

Pull Request resolved: #78244

Reviewed By: ejguan

Differential Revision: D36678948

Pulled By: davidberard98

fbshipit-source-id: 0ccde965acbd31da67d99c6adb2eaaa888948105
jjsjann123 added a commit to jjsjann123/nvfuser that referenced this pull request Oct 29, 2022
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

A few bigger updates:
1. Initial support of cp.async and cp.async.wait: csarofeen/pytorch#1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen/pytorch#1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen/pytorch#1440

Commits that's actually in this PR from the csarofeen branch
```
* dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726)
* dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619)
* 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643)
* d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440)
* fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724)
* 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720)
* 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719)
* f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667)
* 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718)
* 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701)
* 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676)
```
Pull Request resolved: pytorch/pytorch#78244
Approved by: https://github.com/csarofeen, https://github.com/malfet
jjsjann123 added a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

A few bigger updates:
1. Initial support of cp.async and cp.async.wait: csarofeen/pytorch#1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen/pytorch#1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen/pytorch#1440

Commits that's actually in this PR from the csarofeen branch
```
* 939e6c9 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* e4a514b Fix missing cooperative launch (#1726)
* 1bb7b65 Async gmem copy support on sm80+ (#1619)
* 69354da Add turing mma support and test (#1643)
* 7ca0fa9 Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 8c5fb93 Mma op integration on ampere (#1440)
* fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724)
* 1278624 Fine-grained kernel profiling (#1720)
* 34cb422 Adding dry run mode to skip arch dependent checks (#1702)
* 4c3cba4 More precise concretization analysis (#1719)
* 5a9ad9c Enable complex python tests (#1667)
* 8102c05 Minor bugfix in transform_rfactor.cpp (#1715)
* 2c0363c Separate root domain and rfactor domain in TransformPrinter (#1716)
* 1679226 Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718)
* acde15c Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701)
* e57cc6b Validate LOOP concrete IDs have complete IterDomains (#1676)
```
Pull Request resolved: pytorch/pytorch#78244
Approved by: https://github.com/csarofeen, https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants