-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async gmem copy support on sm80+ #1619
Conversation
…ync_initial_support
Thanks for checking. Both failures are caused/triggered by #1440 and second one has something to do with TOT. Will put in more details and fix/track. Need to go offline, will come back to the review discussions. Thanks for the initial feedback 👍 |
TEST_F(NVFuserTest, FusionDoubleBufferCpAsync1_CUDA) { | ||
// requires ampere+ GPU | ||
if (!deviceMajorMinorCheck(8)) { | ||
GTEST_SKIP() << "skipping tests on pre-AMPERE GPUs"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer these arch-dependent tests go through the lowering and skip the rest so that they are tested on an older GPU to the extent possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also validation of arch versions at the beginning of lowering, which is inconvenient when I just wanted to see the generated code on my system with a Turing GPU. Perhaps, another PYTORCH_NVFUSER_DISABLE
flag to skip the validation?
Certain conditions must hold to make the lowering passes work, but there are also other conditions that do not prevent them to go through, like this use of arch-specific instructions. Another example is non-constant allocations of register arrays. For debugging, I'd like a way to relax those conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd better to have this test run through the scheduling and then do ASSERT_ANY_THROW
with the lowering on pre Ampere devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure it'd be very helpful debug and test feature. Some extra assertions would be needed in this case to make sure we never send any device mismatched kernel to nvrtc, even nvrtc compile.
Let's move these discussion and changes into #1702.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you still working on #1702?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// | ||
// TODO: we may not need this sync at all if we didn't need | ||
// to sync the block before. This information is in sync info. | ||
// In this case shared mem double buffering "collapses" into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what collapsing means.
Does this mean we don't need block sync when the async wait inserted above is sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will shorten the comment or even just do it in this PR. It's basically that double buffer sync should now look at sync info
to decide if a sync should be inserted.
It's generic double buffer change but I guess if it's a few lines I could just do it in this PR as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only part I still need to review. Let me know if you're done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I added all I wanted to include about double buffering in this PR.
Will have a follow up on double buffering for sure, to clear all the todo's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@naoyam This PR should be ready for re-review.
Will need to merge back other changes before merging this one, but I expect all remaining commits fully mechanical.
…ync_initial_support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing all my concerns!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is really cool and very clean now. Good work.
// Initialize buffer -> predicated load | ||
// For memcpy async: | ||
// If we initialized and then loaded (without sync) it would be undefined | ||
// behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only supported for memcpy async, is that true? It's a little hard for me to tell from the comment exactly where the support works and doesn't, but I think SMEM + MemCpy ASync is the only place this is usable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is only done for cp.async, and actually this is the only scenario that I know of where this is required.
In follow ups I think this can be removed for prioritized cases as well even for cp.async, but completely not getting rid of this didn't seem possible.
kir::IrVisitor::handle(expr); | ||
} | ||
|
||
// Invert the predicate of given expr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The limitation here seems to be that we're predicating anything outside the "valid" regions as being the initialization regions. This is fine as long as we never go outside the allocated regime. However what we'd likely want is predicates that are true only outside the "valid region" and within the "allocated region". Is this understanding correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's right. For best reliability in all cases we want to cover we'd need to find precisely the region that's allocated but invalid. But as mentioned above, this is mostly for functionality coverage, and in most perf critical cases we should be able to find alternatives to doing this, which will be in follow ups on cp.async.
// input or its shared mem output, or both. | ||
// Due to the WAR discussed in [Predicate Inversion for CpAsync], | ||
// we currently cannot support use cases where both the gmem read | ||
// and the smem write need to be predicated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just requires a predicate for the GMEM read no? I'm failing to see how this is different from the logic/limitation in my comment in lower_predicate.cpp
. We would want to read valid GMEM locations which would have to have a valid associated SMEM write location. So the "valid region" is the same for GMEM and SMEM and guaranteed to be in "allocated memory". I think this is the same limitation, just wasn't immediately obvious to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is the same limitation and the assertion is put here because here we have access to the allocated region
info. If we have valid data
we are guaranteed to have allocated region
but due to the limitation discussed in lower_predicate.cpp
, for cp.async we just also require allocated region
for all active threads in this instruction.
I will still need to work through pathological cases in follow ups, for functional coverage but not perf critical use case.
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440 Commits that's actually in this PR from the csarofeen branch ``` * dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f Fix missing cooperative launch (#1726) * dc670a2 Async gmem copy support on sm80+ (#1619) * 5e6a8da Add turing mma support and test (#1643) * d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39 Mma op integration on ampere (#1440) * fade8da patch python test for bfloat16 (#1724) * 8fbd0b1 Fine-grained kernel profiling (#1720) * 77c1b4f Adding dry run mode to skip arch dependent checks (#1702) * 151d95b More precise concretization analysis (#1719) * f4d3630 Enable complex python tests (#1667) * 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830 Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7 updating_ci_machine (#1718) * 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453 Allow using nvFuser on CUDA extension (#1701) * 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: #78244 Approved by: https://github.com/csarofeen, https://github.com/malfet
Summary: Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440 Commits that's actually in this PR from the csarofeen branch ``` * dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f Fix missing cooperative launch (#1726) * dc670a2 Async gmem copy support on sm80+ (#1619) * 5e6a8da Add turing mma support and test (#1643) * d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39 Mma op integration on ampere (#1440) * fade8da patch python test for bfloat16 (#1724) * 8fbd0b1 Fine-grained kernel profiling (#1720) * 77c1b4f Adding dry run mode to skip arch dependent checks (#1702) * 151d95b More precise concretization analysis (#1719) * f4d3630 Enable complex python tests (#1667) * 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830 Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7 updating_ci_machine (#1718) * 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453 Allow using nvFuser on CUDA extension (#1701) * 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: #78244 Reviewed By: ejguan Differential Revision: D36678948 Pulled By: davidberard98 fbshipit-source-id: 0ccde965acbd31da67d99c6adb2eaaa888948105
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen/pytorch#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen/pytorch#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen/pytorch#1440 Commits that's actually in this PR from the csarofeen branch ``` * dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726) * dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619) * 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643) * d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440) * fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724) * 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720) * 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702) * 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719) * f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667) * 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718) * 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701) * 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: pytorch/pytorch#78244 Approved by: https://github.com/csarofeen, https://github.com/malfet
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen/pytorch#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen/pytorch#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen/pytorch#1440 Commits that's actually in this PR from the csarofeen branch ``` * 939e6c9 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * e4a514b Fix missing cooperative launch (#1726) * 1bb7b65 Async gmem copy support on sm80+ (#1619) * 69354da Add turing mma support and test (#1643) * 7ca0fa9 Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 8c5fb93 Mma op integration on ampere (#1440) * fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724) * 1278624 Fine-grained kernel profiling (#1720) * 34cb422 Adding dry run mode to skip arch dependent checks (#1702) * 4c3cba4 More precise concretization analysis (#1719) * 5a9ad9c Enable complex python tests (#1667) * 8102c05 Minor bugfix in transform_rfactor.cpp (#1715) * 2c0363c Separate root domain and rfactor domain in TransformPrinter (#1716) * 1679226 Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718) * acde15c Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701) * e57cc6b Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: pytorch/pytorch#78244 Approved by: https://github.com/csarofeen, https://github.com/malfet
This PR is separated from #1440 to address initial support of cp.async.
This PR introduces the preliminary use cases in cp.async (sm80+):
LoadStoreOp
in the analysis, and insert cp.async.wait in the double buffer loop.