Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mma op integration on ampere #1440

Merged
merged 64 commits into from
May 23, 2022
Merged

Mma op integration on ampere #1440

merged 64 commits into from
May 23, 2022

Conversation

shmsong
Copy link

@shmsong shmsong commented Feb 8, 2022

This PR is a continuation of #1439, the focus is on extending the infrastructure to support mma operators on turing and ampere arch, including:

  • Extension of mma infrastructure to support turing and ampere mma.
  • Minimal support for ldmatrix and cpasync to facilitate gemm on turing and ampere.
  • Predicate and sync logic to enable ldmatrix and cpasync (Preliminary limited support, thinking major cleanup in a follow up. )
  • Larger gemm fusion examples on turing (GemmGemm and GemmSoftmaxGemm)

@shmsong shmsong changed the title Mma op integration on turing/ampere [Do not merge] Mma op integration on turing/ampere Feb 10, 2022
@shmsong shmsong changed the title [Do not merge] Mma op integration on turing/ampere [Do not merge] WIP: Mma op integration on turing/ampere Feb 22, 2022
Base automatically changed from volta_mma_op to devel March 21, 2022 16:21
Copy link
Owner

@csarofeen csarofeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good :-)

std::stringstream ss;

ss << "reinterpret_cast<Array<" << dtype << "," << vec_size << ","
<< vec_size << ">*>(&" << gen(val) << ")";
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reinterpret cast seemed quite unreliable, are you sure you don't want to make sure we thunk to ptx if we can?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes eventually I think we want to explicitly specify which 4 registers to pass to ldmatrix, once we have the swizzle labeling part merged.

//! 2. direct output of a broadcast op following a ldmatrix op
//! Returns true if the tv is an immediate output of ldmatrix op
//!
//! TODO: this check is
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: what's the todo for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for eventually removing the pattern matching for ldmatrix. Completed the todo comment. Thanks.

tv->split(-1, 4);
tv->split(-2, 2);

// 0 1 2 3 4
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: 4?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up. Thanks!

if (options.operand == MmaOptions::Operand::A) {
TORCH_INTERNAL_ASSERT(tv->nDims() >= 2);
// validation:
TORCH_INTERNAL_ASSERT(canValidateIsInnerDim(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Asserts should have something printed unless it should be impossible to reach.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added error message in these checks. Thanks.

@naoyam
Copy link
Collaborator

naoyam commented May 9, 2022

Looks like some non-trivial changes have been made since I reviewed the last time. Let me take a look at them.

// T2 = 0; // init for exp1
// if(pred)
// T2 = T1 ... // exp1
// If we remove pred around expr1, as the way the pred removal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only happens when exp1 is a reduction-like op.

// T1[i] = T0[i] + ...
// Note that we'd be able to reuse buffer of T0 for T1 but
// if we initialze T1 we cannot do that and thus the
// kernel would not fit in smaller devices.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to disable in this case for now, but looks like for this kind of patterns we would want to just initialize T0 and eliminate the predicates for T1 and T2, which would not interfere with the buffer reuse between T0 and T1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The holistic predicate/initialization analysis has been on the TODO list and quite a few TODO's in this PR actually boil down to that. I will try to get to that in follow ups.

@@ -213,7 +213,13 @@ void SyncMap::build(Fusion* fusion) {
// If consumer is parallelized with this type but producer is
// predicated redundant on this type. This parallel dimension
// is a RAW dimension. See test: FusionSeriaSmemWriteParallelRead1/2
if (c_id != nullptr && producer_redundant_types.get(parallel_type)) {
//
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this specific to this PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also please add a test for this specific pattern?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean whenever we have a redundant op, we would insert a sync immediately after that? Then, what would happen if we have a chain of redundant exprs? In my naive quick thinking, it seems we should just add a sync when an expr chain becomes non redundant.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we extract this change out of this PR?

Copy link
Author

@shmsong shmsong May 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sync insertion wouldn't be right after the redundant op, which I believe currently only means redundant write. The usual RAW sync insertion rule still applies here, i.e. a sync right before they are used.

As on the comment and TODO, this sync wouldn't be needed if all use chains of this redundant write ends with another redundant shared/global write with the same redundant type. This case would need an additional traversal info somewhere.

As this issue currently lives in devel, just wanted to do a quick patch and then follow up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed from this PR and moved to #1684

@shmsong
Copy link
Author

shmsong commented May 23, 2022

Checked perf delta on V100. Looked like the change in this PR didn't affect norm perf. The delta seemed to be within noise range. Below are the top ones with slow down > 2%. (Sorted in decreasing order by %)

<style> </style>
benchmark runtime before (us) runtime_after (us) slow down us (after-before) slow down us % (after-before) / before
NvFuserScheduler_Broadcast_Outer_fp32___GRAPH/NvFuserScheduler_Broadcast_Outer_fp32/64/160/manual_time 5.002035089 5.473661983 0.471626895 9.428700244
NvFuserScheduler_Broadcast_Inner_fp16___GRAPH/NvFuserScheduler_Broadcast_Inner_fp16/128/128/manual_time 5.258944758 5.727213319 0.468268561 8.904230458
NvFuserScheduler_Broadcast_Inner_fp16___GRAPH/NvFuserScheduler_Broadcast_Inner_fp16/64/320/manual_time 5.266786908 5.708947011 0.442160103 8.39525332
NvFuserScheduler_Softmax_BWD_Outer_fp32___GRAPH/NvFuserScheduler_Softmax_BWD_Outer_fp32/64/160/manual_time 5.966772077 6.440313975 0.473541897 7.936316175
NvFuserScheduler_Broadcast_Inner_fp32___GRAPH/NvFuserScheduler_Broadcast_Inner_fp32/64/160/manual_time 5.171204781 5.553243605 0.382038824 7.387810779
NvFuserScheduler_Softmax_BWD_Outer_fp32___GRAPH/NvFuserScheduler_Softmax_BWD_Outer_fp32/64/320/manual_time 6.066368353 6.500501974 0.434133622 7.156400606
NvFuserScheduler_BatchNorm_fp16___GRAPH/NvFuserScheduler_BatchNorm_fp16/8/32/8/manual_time 7.092909477 7.591659561 0.498750084 7.031671358
NvFuserScheduler_Broadcast_Outer_fp16___GRAPH/NvFuserScheduler_Broadcast_Outer_fp16/128/128/manual_time 4.981953668 5.312278102 0.330324433 6.630419614
NvFuserScheduler_Softmax_Outer_fp32___GRAPH/NvFuserScheduler_Softmax_Outer_fp32/64/320/manual_time 7.110007049 7.557707712 0.447700663 6.296768205
NvFuserScheduler_Softmax_BWD_Outer_fp16___GRAPH/NvFuserScheduler_Softmax_BWD_Outer_fp16/64/320/manual_time 6.503192134 6.910252507 0.407060373 6.259393305
NvFuserScheduler_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp16/2/8/64/manual_time 17.45210796 18.51110772 1.058999761 6.068033523
NvFuserScheduler_BatchNorm_fp32___GRAPH/NvFuserScheduler_BatchNorm_fp32/64/2/8/manual_time 8.152586811 8.647064557 0.494477746 6.065286483
NvFuserScheduler_Broadcast_Outer_fp16___GRAPH/NvFuserScheduler_Broadcast_Outer_fp16/64/320/manual_time 5.171121785 5.45804398 0.286922195 5.548548399
NvFuserScheduler_LayerNorm_fp16___GRAPH/NvFuserScheduler_LayerNorm_fp16/320/64/manual_time 6.009975374 6.328506612 0.318531237 5.300042303
NvFuserScheduler_Softmax_Outer_fp32___GRAPH/NvFuserScheduler_Softmax_Outer_fp32/128/512/manual_time 7.794475255 8.204124035 0.409648779 5.255629995
NvFuserScheduler_BatchNorm_nhwc_fp32___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp32/64/32/8/manual_time 22.37687499 23.54485723 1.167982235 5.219594939
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/8/152/28/manual_time 26.67603836 28.06582087 1.389782507 5.209853457
NvFuserScheduler_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp16/64/32/8/manual_time 17.56774887 18.44712116 0.879372293 5.00560601
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/128/48/14/manual_time 30.43901674 31.89958398 1.460567244 4.798339109
NvFuserScheduler_Softmax_Outer_fp16___GRAPH/NvFuserScheduler_Softmax_Outer_fp16/128/512/manual_time 8.137826347 8.525082262 0.387255915 4.758714411
NvFuserScheduler_Softmax_BWD_Inner_fp32___GRAPH/NvFuserScheduler_Softmax_BWD_Inner_fp32/64/160/manual_time 5.560485413 5.82074676 0.260261347 4.680550839
NvFuserScheduler_Softmax_Inner_fp32___GRAPH/NvFuserScheduler_Softmax_Inner_fp32/64/320/manual_time 6.110953211 6.394189223 0.283236012 4.634890857
NvFuserScheduler_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp16/8/2/256/manual_time 31.79447912 33.2583148 1.463835678 4.60405617
NvFuserScheduler_LayerNorm_fp32___GRAPH/NvFuserScheduler_LayerNorm_fp32/2/32768/manual_time 16.89424033 17.67205223 0.777811898 4.604006353
NvFuserScheduler_BatchNorm_fp32___GRAPH/NvFuserScheduler_BatchNorm_fp32/64/32/2/manual_time 6.901824689 7.21878047 0.316955781 4.592347609
NvFuserScheduler_LayerNorm_fp16___GRAPH/NvFuserScheduler_LayerNorm_fp16/2/32768/manual_time 17.81000135 18.60242855 0.792427206 4.449338272
NvFuserScheduler_BatchNorm_BWD_fp16___GRAPH/NvFuserScheduler_BatchNorm_BWD_fp16/64/8/8/manual_time 8.693300921 9.076493004 0.383192083 4.407900828
NvFuserScheduler_Reduction_Outer_fp16___GRAPH/NvFuserScheduler_Reduction_Outer_fp16/2/32768/manual_time 5.374971876 5.595072859 0.220100982 4.094923425
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/8/24/7/manual_time 7.79811889 8.113108218 0.314989329 4.039298875
NvFuserScheduler_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp16/512/32/8/manual_time 26.79418905 27.85602032 1.061831269 3.962916238
NvFuserScheduler_BatchNorm_fp32___GRAPH/NvFuserScheduler_BatchNorm_fp32/64/32/2/manual_time 6.861575892 7.132745882 0.27116999 3.952007448
NvFuserScheduler_BatchNorm_fp16___GRAPH/NvFuserScheduler_BatchNorm_fp16/64/64/2/manual_time 7.338797144 7.61793454 0.279137396 3.803585115
NvFuserScheduler_Softmax_Outer_fp16___GRAPH/NvFuserScheduler_Softmax_Outer_fp16/2/32768/manual_time 7.240792185 7.515533583 0.274741397 3.794355513
NvFuserScheduler_Reduction_Outer_fp16___GRAPH/NvFuserScheduler_Reduction_Outer_fp16/2/32768/manual_time 5.390028385 5.594306253 0.204277867 3.789921924
NvFuserScheduler_BatchNorm_nhwc_fp32___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp32/64/32/8/manual_time 22.1232427 22.95116442 0.827921725 3.742316334
NvFuserScheduler_BatchNorm_nhwc_BWD_fp32___GRAPH/NvFuserScheduler_BatchNorm_nhwc_BWD_fp32/2/8/64/manual_time 22.95460278 23.80705311 0.852450333 3.713635742
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/128/24/14/manual_time 32.98001716 34.19605037 1.21603321 3.687181861
NvFuserScheduler_BatchNorm_nhwc_fp32___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp32/64/128/2/manual_time 8.927148575 9.248310891 0.321162316 3.59759125
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/8/48/56/manual_time 44.79040662 46.35431856 1.563911941 3.491622557
NvFuserScheduler_Softmax_Inner_fp32___GRAPH/NvFuserScheduler_Softmax_Inner_fp32/32768/8/manual_time 26.35712594 27.24897227 0.89184633 3.383700986
NvFuserScheduler_Softmax_Inner_fp16___GRAPH/NvFuserScheduler_Softmax_Inner_fp16/512/128/manual_time 7.043253291 7.279535621 0.23628233 3.354732822
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/128/48/7/manual_time 21.23020293 21.92584322 0.695640288 3.276653975
NvFuserScheduler_LayerNorm_fp32___GRAPH/NvFuserScheduler_LayerNorm_fp32/16/32768/manual_time 22.98422661 23.73724392 0.753017315 3.276235167
NvFuserScheduler_LayerNorm_fp16___GRAPH/NvFuserScheduler_LayerNorm_fp16/8/32768/manual_time 21.84684497 22.5608758 0.714030834 3.268347606
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/32/48/28/manual_time 30.40066825 31.39157549 0.990907245 3.259491657
NvFuserScheduler_BatchNorm_fp16___GRAPH/NvFuserScheduler_BatchNorm_fp16/2/32/64/manual_time 9.581944702 9.892861686 0.310916984 3.244821312
NvFuserScheduler_LayerNorm_fp32___GRAPH/NvFuserScheduler_LayerNorm_fp32/160/64/manual_time 6.579429254 6.791462294 0.21203304 3.222666164
NvFuserScheduler_RMSNorm_fp32___GRAPH/NvFuserScheduler_RMSNorm_fp32/18/manual_time 7.047316274 7.274053241 0.226736967 3.217351938
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/128/24/7/manual_time 18.92254506 19.52410172 0.601556661 3.179047317
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/16/24/28/manual_time 20.00276538 20.59184767 0.589082292 2.945004255
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/16/152/14/manual_time 29.03755103 29.88435264 0.846801605 2.916229417
NvFuserScheduler_Softmax_BWD_Outer_fp32___GRAPH/NvFuserScheduler_Softmax_BWD_Outer_fp32/4096/128/manual_time 23.91025538 24.59746289 0.687207505 2.87411194
NvFuserScheduler_Softmax_BWD_Outer_fp32___GRAPH/NvFuserScheduler_Softmax_BWD_Outer_fp32/512/128/manual_time 8.130706863 8.363504293 0.23279743 2.863188083
NvFuserScheduler_LayerNorm_fp16___GRAPH/NvFuserScheduler_LayerNorm_fp16/512/512/manual_time 7.858435051 8.081246397 0.222811346 2.835314468
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/8/200/14/manual_time 13.1394418 13.51165307 0.372211269 2.832778395
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/32/200/7/manual_time 13.13224568 13.50133012 0.369084443 2.810520391
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/128/48/7/manual_time 28.48047811 29.27027039 0.789792282 2.773100506
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/32/48/14/manual_time 21.76956073 22.34836598 0.578805258 2.658782439
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/128/40/7/manual_time 27.60118928 28.33191623 0.730726956 2.647447358
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/8/40/28/manual_time 27.06604152 27.76387554 0.697834015 2.578264037
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/8/368/28/manual_time 43.91909506 45.02323386 1.104138801 2.514029033
NvFuserScheduler_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_BatchNorm_nhwc_BWD_fp16/64/2/64/manual_time 29.77471558 30.52282204 0.748106458 2.512556185
NvFuserScheduler_Softmax_BWD_Outer_fp32___GRAPH/NvFuserScheduler_Softmax_BWD_Outer_fp32/128/512/manual_time 7.171884772 7.35132013 0.179435358 2.501927513
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/128/48/28/manual_time 73.18557647 75.00250911 1.816932636 2.482637596
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/256/72/7/manual_time 37.28342204 38.20441504 0.920993002 2.47024804
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/8/184/28/manual_time 28.07316133 28.75276467 0.679603334 2.420829366
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/256/24/7/manual_time 19.8724716 20.34166613 0.469194529 2.361027548
NvFuserScheduler_Broadcast_Outer_fp16___GRAPH/NvFuserScheduler_Broadcast_Outer_fp16/512/512/manual_time 5.896683434 6.033131528 0.136448093 2.313980305
NvFuserScheduler_Softmax_Inner_fp16___GRAPH/NvFuserScheduler_Softmax_Inner_fp16/128/512/manual_time 6.734072974 6.889400882 0.155327908 2.306596749
NvFuserScheduler_LayerNorm_fp16___GRAPH/NvFuserScheduler_LayerNorm_fp16/128/128/manual_time 6.587796905 6.738291871 0.150494966 2.284450605
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/8/184/56/manual_time 64.55902375 66.01889247 1.459868715 2.261293047
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/512/40/14/manual_time 59.27420661 60.58163337 1.307426761 2.205726295
NvFuserScheduler_Softmax_BWD_Inner_fp16___GRAPH/NvFuserScheduler_Softmax_BWD_Inner_fp16/64/320/manual_time 5.462514708 5.582931763 0.120417055 2.204425285
NvFuserScheduler_BatchNorm_nhwc_fp32___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp32/64/8/8/manual_time 21.72911831 22.2063522 0.477233895 2.196287435
NvFuserScheduler_BatchNorm_nhwc_fp32___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp32/2/2/8/manual_time 7.79666854 7.967713608 0.171045068 2.193822496
NvFuserScheduler_BatchNorm_nhwc_BWD_fp32___GRAPH/NvFuserScheduler_BatchNorm_nhwc_BWD_fp32/2/8/2/manual_time 5.468709605 5.587764548 0.119054944 2.177020764
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/8/24/56/manual_time 32.7025543 33.40964586 0.707091558 2.162190609
NvFuserScheduler_BatchNorm_nhwc_fp32___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp32/8/32/8/manual_time 20.12773228 20.55721187 0.429479594 2.133770402
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/8/56/28/manual_time 21.69606443 22.15216065 0.456096222 2.102207168
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/128/152/7/manual_time 36.56043951 37.32853135 0.768091841 2.100882404
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/64/24/7/manual_time 16.78702667 17.13771011 0.350683448 2.08901466
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/64/40/28/manual_time 42.78770417 43.6695605 0.881856332 2.061004089
NvFuserScheduler_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_BatchNorm_nhwc_BWD_fp16/64/32/8/manual_time 24.35471892 24.85387302 0.499154096 2.049516966
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/256/152/7/manual_time 42.36686861 43.23246737 0.865598754 2.043102977
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/8/56/56/manual_time 43.44452544 44.33102026 0.88649482 2.040521357
NvFuserScheduler_LayerNorm_fp16___GRAPH/NvFuserScheduler_LayerNorm_fp16/8/262144/manual_time 37.85316688 38.62057491 0.767408039 2.027328498
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/64/56/7/manual_time 25.68954774 26.20834552 0.518797785 2.019489757
NvFuserScheduler_Softmax_Outer_fp16___GRAPH/NvFuserScheduler_Softmax_Outer_fp16/512/512/manual_time 10.96766924 11.18727728 0.219608032 2.002321795

@csarofeen
Copy link
Owner

csarofeen commented May 23, 2022

Agree, seems within noise go ahead and merge.

@shmsong shmsong merged commit 7093e39 into devel May 23, 2022
@shmsong shmsong deleted the ampere_mma_op branch May 23, 2022 23:50
malfet pushed a commit to pytorch/pytorch that referenced this pull request Jun 8, 2022
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

A few bigger updates:
1. Initial support of cp.async and cp.async.wait: csarofeen#1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440

Commits that's actually in this PR from the csarofeen branch
```
* dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f Fix missing cooperative launch (#1726)
* dc670a2 Async gmem copy support on sm80+ (#1619)
* 5e6a8da Add turing mma support and test (#1643)
* d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39 Mma op integration on ampere (#1440)
* fade8da patch python test for bfloat16 (#1724)
* 8fbd0b1 Fine-grained kernel profiling (#1720)
* 77c1b4f Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b More precise concretization analysis (#1719)
* f4d3630 Enable complex python tests (#1667)
* 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830 Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7 updating_ci_machine (#1718)
* 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453 Allow using nvFuser on CUDA extension (#1701)
* 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676)
```
Pull Request resolved: #78244
Approved by: https://github.com/csarofeen, https://github.com/malfet
facebook-github-bot pushed a commit to pytorch/pytorch that referenced this pull request Jun 8, 2022
Summary:
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

A few bigger updates:
1. Initial support of cp.async and cp.async.wait: csarofeen#1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440

Commits that's actually in this PR from the csarofeen branch
```
* dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f Fix missing cooperative launch (#1726)
* dc670a2 Async gmem copy support on sm80+ (#1619)
* 5e6a8da Add turing mma support and test (#1643)
* d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39 Mma op integration on ampere (#1440)
* fade8da patch python test for bfloat16 (#1724)
* 8fbd0b1 Fine-grained kernel profiling (#1720)
* 77c1b4f Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b More precise concretization analysis (#1719)
* f4d3630 Enable complex python tests (#1667)
* 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830 Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7 updating_ci_machine (#1718)
* 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453 Allow using nvFuser on CUDA extension (#1701)
* 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676)
```

Pull Request resolved: #78244

Reviewed By: ejguan

Differential Revision: D36678948

Pulled By: davidberard98

fbshipit-source-id: 0ccde965acbd31da67d99c6adb2eaaa888948105
jjsjann123 added a commit to jjsjann123/nvfuser that referenced this pull request Oct 29, 2022
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

A few bigger updates:
1. Initial support of cp.async and cp.async.wait: csarofeen/pytorch#1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen/pytorch#1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen/pytorch#1440

Commits that's actually in this PR from the csarofeen branch
```
* dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726)
* dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619)
* 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643)
* d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440)
* fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724)
* 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720)
* 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719)
* f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667)
* 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718)
* 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701)
* 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676)
```
Pull Request resolved: pytorch/pytorch#78244
Approved by: https://github.com/csarofeen, https://github.com/malfet
jjsjann123 added a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

A few bigger updates:
1. Initial support of cp.async and cp.async.wait: csarofeen/pytorch#1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen/pytorch#1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen/pytorch#1440

Commits that's actually in this PR from the csarofeen branch
```
* 939e6c9 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* e4a514b Fix missing cooperative launch (#1726)
* 1bb7b65 Async gmem copy support on sm80+ (#1619)
* 69354da Add turing mma support and test (#1643)
* 7ca0fa9 Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 8c5fb93 Mma op integration on ampere (#1440)
* fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724)
* 1278624 Fine-grained kernel profiling (#1720)
* 34cb422 Adding dry run mode to skip arch dependent checks (#1702)
* 4c3cba4 More precise concretization analysis (#1719)
* 5a9ad9c Enable complex python tests (#1667)
* 8102c05 Minor bugfix in transform_rfactor.cpp (#1715)
* 2c0363c Separate root domain and rfactor domain in TransformPrinter (#1716)
* 1679226 Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718)
* acde15c Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701)
* e57cc6b Validate LOOP concrete IDs have complete IterDomains (#1676)
```
Pull Request resolved: pytorch/pytorch#78244
Approved by: https://github.com/csarofeen, https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants