Mma op integration on ampere #1440

shmsong · 2022-02-08T06:19:19Z

This PR is a continuation of #1439, the focus is on extending the infrastructure to support mma operators on turing and ampere arch, including:

Extension of mma infrastructure to support turing and ampere mma.
Minimal support for ldmatrix and cpasync to facilitate gemm on turing and ampere.
Predicate and sync logic to enable ldmatrix and cpasync (Preliminary limited support, thinking major cleanup in a follow up. )
Larger gemm fusion examples on turing (GemmGemm and GemmSoftmaxGemm)

torch/csrc/jit/codegen/cuda/runtime/memory.cu

csarofeen

Looks really good :-)

csarofeen · 2022-05-07T18:50:56Z

torch/csrc/jit/codegen/cuda/codegen.cpp

+    std::stringstream ss;
+
+    ss << "reinterpret_cast<Array<" << dtype << "," << vec_size << ","
+       << vec_size << ">*>(&" << gen(val) << ")";


reinterpret cast seemed quite unreliable, are you sure you don't want to make sure we thunk to ptx if we can?

Yes eventually I think we want to explicitly specify which 4 registers to pass to ldmatrix, once we have the swizzle labeling part merged.

csarofeen · 2022-05-07T19:02:34Z

torch/csrc/jit/codegen/cuda/scheduler/mma_utils.cpp

+//!    2. direct output of a broadcast op following a ldmatrix op
+//!  Returns true if the tv is an immediate output of ldmatrix op
+//!
+//! TODO: this check is


Nit: what's the todo for?

It's for eventually removing the pattern matching for ldmatrix. Completed the todo comment. Thanks.

csarofeen · 2022-05-07T19:05:08Z

torch/csrc/jit/codegen/cuda/scheduler/mma_utils.cpp

+      tv->split(-1, 4);
+      tv->split(-2, 2);
+
+      // 0  1   2   3  4


Cleaned up. Thanks!

csarofeen · 2022-05-07T19:05:30Z

torch/csrc/jit/codegen/cuda/scheduler/mma_utils.cpp

+  if (options.operand == MmaOptions::Operand::A) {
+    TORCH_INTERNAL_ASSERT(tv->nDims() >= 2);
+    // validation:
+    TORCH_INTERNAL_ASSERT(canValidateIsInnerDim(


Nit: Asserts should have something printed unless it should be impossible to reach.

Added error message in these checks. Thanks.

naoyam · 2022-05-09T16:46:53Z

Looks like some non-trivial changes have been made since I reviewed the last time. Let me take a look at them.

naoyam · 2022-05-09T18:26:15Z

torch/csrc/jit/codegen/cuda/lower_predicate_elimination.cpp

+    //      T2 = 0;              // init for exp1
+    //      if(pred)
+    //        T2 = T1 ...        // exp1
+    //  If we remove pred around expr1, as the way the pred removal


This only happens when exp1 is a reduction-like op.

torch/csrc/jit/codegen/cuda/lower_predicate_elimination.cpp

naoyam · 2022-05-09T18:33:21Z

torch/csrc/jit/codegen/cuda/lower_predicate_elimination.cpp

+    //      T1[i] = T0[i] + ...
+    //  Note that we'd be able to reuse buffer of T0 for T1 but
+    //    if we initialze T1 we cannot do that and thus the
+    //    kernel would not fit in smaller devices.


I think it's fine to disable in this case for now, but looks like for this kind of patterns we would want to just initialize T0 and eliminate the predicates for T1 and T2, which would not interfere with the buffer reuse between T0 and T1.

Yes. The holistic predicate/initialization analysis has been on the TODO list and quite a few TODO's in this PR actually boil down to that. I will try to get to that in follow ups.

naoyam · 2022-05-09T18:36:31Z

torch/csrc/jit/codegen/cuda/lower_sync_information.cpp

@@ -213,7 +213,13 @@ void SyncMap::build(Fusion* fusion) {
          // If consumer is parallelized with this type but producer is
          //  predicated redundant on this type. This parallel dimension
          //  is a RAW dimension. See test: FusionSeriaSmemWriteParallelRead1/2
-          if (c_id != nullptr && producer_redundant_types.get(parallel_type)) {
+          //


Is this specific to this PR?

Can you also please add a test for this specific pattern?

Does this mean whenever we have a redundant op, we would insert a sync immediately after that? Then, what would happen if we have a chain of redundant exprs? In my naive quick thinking, it seems we should just add a sync when an expr chain becomes non redundant.

Why don't we extract this change out of this PR?

The sync insertion wouldn't be right after the redundant op, which I believe currently only means redundant write. The usual RAW sync insertion rule still applies here, i.e. a sync right before they are used.

As on the comment and TODO, this sync wouldn't be needed if all use chains of this redundant write ends with another redundant shared/global write with the same redundant type. This case would need an additional traversal info somewhere.

As this issue currently lives in devel, just wanted to do a quick patch and then follow up.

Removed from this PR and moved to #1684

shmsong · 2022-05-23T23:11:30Z

Checked perf delta on V100. Looked like the change in this PR didn't affect norm perf. The delta seemed to be within noise range. Below are the top ones with slow down > 2%. (Sorted in decreasing order by %)

benchmark	runtime before (us)	runtime_after (us)	slow down us (after-before)	slow down us % (after-before) / before
NvFuserScheduler_Broadcast_Outer_fp32___GRAPH/NvFuserScheduler_Broadcast_Outer_fp32/64/160/manual_time	5.002035089	5.473661983	0.471626895	9.428700244
NvFuserScheduler_Broadcast_Inner_fp16___GRAPH/NvFuserScheduler_Broadcast_Inner_fp16/128/128/manual_time	5.258944758	5.727213319	0.468268561	8.904230458
NvFuserScheduler_Broadcast_Inner_fp16___GRAPH/NvFuserScheduler_Broadcast_Inner_fp16/64/320/manual_time	5.266786908	5.708947011	0.442160103	8.39525332
NvFuserScheduler_Softmax_BWD_Outer_fp32___GRAPH/NvFuserScheduler_Softmax_BWD_Outer_fp32/64/160/manual_time	5.966772077	6.440313975	0.473541897	7.936316175
NvFuserScheduler_Broadcast_Inner_fp32___GRAPH/NvFuserScheduler_Broadcast_Inner_fp32/64/160/manual_time	5.171204781	5.553243605	0.382038824	7.387810779
NvFuserScheduler_Softmax_BWD_Outer_fp32___GRAPH/NvFuserScheduler_Softmax_BWD_Outer_fp32/64/320/manual_time	6.066368353	6.500501974	0.434133622	7.156400606
NvFuserScheduler_BatchNorm_fp16___GRAPH/NvFuserScheduler_BatchNorm_fp16/8/32/8/manual_time	7.092909477	7.591659561	0.498750084	7.031671358
NvFuserScheduler_Broadcast_Outer_fp16___GRAPH/NvFuserScheduler_Broadcast_Outer_fp16/128/128/manual_time	4.981953668	5.312278102	0.330324433	6.630419614
NvFuserScheduler_Softmax_Outer_fp32___GRAPH/NvFuserScheduler_Softmax_Outer_fp32/64/320/manual_time	7.110007049	7.557707712	0.447700663	6.296768205
NvFuserScheduler_Softmax_BWD_Outer_fp16___GRAPH/NvFuserScheduler_Softmax_BWD_Outer_fp16/64/320/manual_time	6.503192134	6.910252507	0.407060373	6.259393305
NvFuserScheduler_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp16/2/8/64/manual_time	17.45210796	18.51110772	1.058999761	6.068033523
NvFuserScheduler_BatchNorm_fp32___GRAPH/NvFuserScheduler_BatchNorm_fp32/64/2/8/manual_time	8.152586811	8.647064557	0.494477746	6.065286483
NvFuserScheduler_Broadcast_Outer_fp16___GRAPH/NvFuserScheduler_Broadcast_Outer_fp16/64/320/manual_time	5.171121785	5.45804398	0.286922195	5.548548399
NvFuserScheduler_LayerNorm_fp16___GRAPH/NvFuserScheduler_LayerNorm_fp16/320/64/manual_time	6.009975374	6.328506612	0.318531237	5.300042303
NvFuserScheduler_Softmax_Outer_fp32___GRAPH/NvFuserScheduler_Softmax_Outer_fp32/128/512/manual_time	7.794475255	8.204124035	0.409648779	5.255629995
NvFuserScheduler_BatchNorm_nhwc_fp32___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp32/64/32/8/manual_time	22.37687499	23.54485723	1.167982235	5.219594939
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/8/152/28/manual_time	26.67603836	28.06582087	1.389782507	5.209853457
NvFuserScheduler_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp16/64/32/8/manual_time	17.56774887	18.44712116	0.879372293	5.00560601
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/128/48/14/manual_time	30.43901674	31.89958398	1.460567244	4.798339109
NvFuserScheduler_Softmax_Outer_fp16___GRAPH/NvFuserScheduler_Softmax_Outer_fp16/128/512/manual_time	8.137826347	8.525082262	0.387255915	4.758714411
NvFuserScheduler_Softmax_BWD_Inner_fp32___GRAPH/NvFuserScheduler_Softmax_BWD_Inner_fp32/64/160/manual_time	5.560485413	5.82074676	0.260261347	4.680550839
NvFuserScheduler_Softmax_Inner_fp32___GRAPH/NvFuserScheduler_Softmax_Inner_fp32/64/320/manual_time	6.110953211	6.394189223	0.283236012	4.634890857
NvFuserScheduler_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp16/8/2/256/manual_time	31.79447912	33.2583148	1.463835678	4.60405617
NvFuserScheduler_LayerNorm_fp32___GRAPH/NvFuserScheduler_LayerNorm_fp32/2/32768/manual_time	16.89424033	17.67205223	0.777811898	4.604006353
NvFuserScheduler_BatchNorm_fp32___GRAPH/NvFuserScheduler_BatchNorm_fp32/64/32/2/manual_time	6.901824689	7.21878047	0.316955781	4.592347609
NvFuserScheduler_LayerNorm_fp16___GRAPH/NvFuserScheduler_LayerNorm_fp16/2/32768/manual_time	17.81000135	18.60242855	0.792427206	4.449338272
NvFuserScheduler_BatchNorm_BWD_fp16___GRAPH/NvFuserScheduler_BatchNorm_BWD_fp16/64/8/8/manual_time	8.693300921	9.076493004	0.383192083	4.407900828
NvFuserScheduler_Reduction_Outer_fp16___GRAPH/NvFuserScheduler_Reduction_Outer_fp16/2/32768/manual_time	5.374971876	5.595072859	0.220100982	4.094923425
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/8/24/7/manual_time	7.79811889	8.113108218	0.314989329	4.039298875
NvFuserScheduler_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp16/512/32/8/manual_time	26.79418905	27.85602032	1.061831269	3.962916238
NvFuserScheduler_BatchNorm_fp32___GRAPH/NvFuserScheduler_BatchNorm_fp32/64/32/2/manual_time	6.861575892	7.132745882	0.27116999	3.952007448
NvFuserScheduler_BatchNorm_fp16___GRAPH/NvFuserScheduler_BatchNorm_fp16/64/64/2/manual_time	7.338797144	7.61793454	0.279137396	3.803585115
NvFuserScheduler_Softmax_Outer_fp16___GRAPH/NvFuserScheduler_Softmax_Outer_fp16/2/32768/manual_time	7.240792185	7.515533583	0.274741397	3.794355513
NvFuserScheduler_Reduction_Outer_fp16___GRAPH/NvFuserScheduler_Reduction_Outer_fp16/2/32768/manual_time	5.390028385	5.594306253	0.204277867	3.789921924
NvFuserScheduler_BatchNorm_nhwc_fp32___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp32/64/32/8/manual_time	22.1232427	22.95116442	0.827921725	3.742316334
NvFuserScheduler_BatchNorm_nhwc_BWD_fp32___GRAPH/NvFuserScheduler_BatchNorm_nhwc_BWD_fp32/2/8/64/manual_time	22.95460278	23.80705311	0.852450333	3.713635742
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/128/24/14/manual_time	32.98001716	34.19605037	1.21603321	3.687181861
NvFuserScheduler_BatchNorm_nhwc_fp32___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp32/64/128/2/manual_time	8.927148575	9.248310891	0.321162316	3.59759125
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/8/48/56/manual_time	44.79040662	46.35431856	1.563911941	3.491622557
NvFuserScheduler_Softmax_Inner_fp32___GRAPH/NvFuserScheduler_Softmax_Inner_fp32/32768/8/manual_time	26.35712594	27.24897227	0.89184633	3.383700986
NvFuserScheduler_Softmax_Inner_fp16___GRAPH/NvFuserScheduler_Softmax_Inner_fp16/512/128/manual_time	7.043253291	7.279535621	0.23628233	3.354732822
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/128/48/7/manual_time	21.23020293	21.92584322	0.695640288	3.276653975
NvFuserScheduler_LayerNorm_fp32___GRAPH/NvFuserScheduler_LayerNorm_fp32/16/32768/manual_time	22.98422661	23.73724392	0.753017315	3.276235167
NvFuserScheduler_LayerNorm_fp16___GRAPH/NvFuserScheduler_LayerNorm_fp16/8/32768/manual_time	21.84684497	22.5608758	0.714030834	3.268347606
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/32/48/28/manual_time	30.40066825	31.39157549	0.990907245	3.259491657
NvFuserScheduler_BatchNorm_fp16___GRAPH/NvFuserScheduler_BatchNorm_fp16/2/32/64/manual_time	9.581944702	9.892861686	0.310916984	3.244821312
NvFuserScheduler_LayerNorm_fp32___GRAPH/NvFuserScheduler_LayerNorm_fp32/160/64/manual_time	6.579429254	6.791462294	0.21203304	3.222666164
NvFuserScheduler_RMSNorm_fp32___GRAPH/NvFuserScheduler_RMSNorm_fp32/18/manual_time	7.047316274	7.274053241	0.226736967	3.217351938
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/128/24/7/manual_time	18.92254506	19.52410172	0.601556661	3.179047317
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/16/24/28/manual_time	20.00276538	20.59184767	0.589082292	2.945004255
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/16/152/14/manual_time	29.03755103	29.88435264	0.846801605	2.916229417
NvFuserScheduler_Softmax_BWD_Outer_fp32___GRAPH/NvFuserScheduler_Softmax_BWD_Outer_fp32/4096/128/manual_time	23.91025538	24.59746289	0.687207505	2.87411194
NvFuserScheduler_Softmax_BWD_Outer_fp32___GRAPH/NvFuserScheduler_Softmax_BWD_Outer_fp32/512/128/manual_time	8.130706863	8.363504293	0.23279743	2.863188083
NvFuserScheduler_LayerNorm_fp16___GRAPH/NvFuserScheduler_LayerNorm_fp16/512/512/manual_time	7.858435051	8.081246397	0.222811346	2.835314468
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/8/200/14/manual_time	13.1394418	13.51165307	0.372211269	2.832778395
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/32/200/7/manual_time	13.13224568	13.50133012	0.369084443	2.810520391
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/128/48/7/manual_time	28.48047811	29.27027039	0.789792282	2.773100506
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/32/48/14/manual_time	21.76956073	22.34836598	0.578805258	2.658782439
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/128/40/7/manual_time	27.60118928	28.33191623	0.730726956	2.647447358
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/8/40/28/manual_time	27.06604152	27.76387554	0.697834015	2.578264037
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/8/368/28/manual_time	43.91909506	45.02323386	1.104138801	2.514029033
NvFuserScheduler_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_BatchNorm_nhwc_BWD_fp16/64/2/64/manual_time	29.77471558	30.52282204	0.748106458	2.512556185
NvFuserScheduler_Softmax_BWD_Outer_fp32___GRAPH/NvFuserScheduler_Softmax_BWD_Outer_fp32/128/512/manual_time	7.171884772	7.35132013	0.179435358	2.501927513
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/128/48/28/manual_time	73.18557647	75.00250911	1.816932636	2.482637596
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/256/72/7/manual_time	37.28342204	38.20441504	0.920993002	2.47024804
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/8/184/28/manual_time	28.07316133	28.75276467	0.679603334	2.420829366
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/256/24/7/manual_time	19.8724716	20.34166613	0.469194529	2.361027548
NvFuserScheduler_Broadcast_Outer_fp16___GRAPH/NvFuserScheduler_Broadcast_Outer_fp16/512/512/manual_time	5.896683434	6.033131528	0.136448093	2.313980305
NvFuserScheduler_Softmax_Inner_fp16___GRAPH/NvFuserScheduler_Softmax_Inner_fp16/128/512/manual_time	6.734072974	6.889400882	0.155327908	2.306596749
NvFuserScheduler_LayerNorm_fp16___GRAPH/NvFuserScheduler_LayerNorm_fp16/128/128/manual_time	6.587796905	6.738291871	0.150494966	2.284450605
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/8/184/56/manual_time	64.55902375	66.01889247	1.459868715	2.261293047
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/512/40/14/manual_time	59.27420661	60.58163337	1.307426761	2.205726295
NvFuserScheduler_Softmax_BWD_Inner_fp16___GRAPH/NvFuserScheduler_Softmax_BWD_Inner_fp16/64/320/manual_time	5.462514708	5.582931763	0.120417055	2.204425285
NvFuserScheduler_BatchNorm_nhwc_fp32___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp32/64/8/8/manual_time	21.72911831	22.2063522	0.477233895	2.196287435
NvFuserScheduler_BatchNorm_nhwc_fp32___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp32/2/2/8/manual_time	7.79666854	7.967713608	0.171045068	2.193822496
NvFuserScheduler_BatchNorm_nhwc_BWD_fp32___GRAPH/NvFuserScheduler_BatchNorm_nhwc_BWD_fp32/2/8/2/manual_time	5.468709605	5.587764548	0.119054944	2.177020764
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/8/24/56/manual_time	32.7025543	33.40964586	0.707091558	2.162190609
NvFuserScheduler_BatchNorm_nhwc_fp32___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp32/8/32/8/manual_time	20.12773228	20.55721187	0.429479594	2.133770402
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/8/56/28/manual_time	21.69606443	22.15216065	0.456096222	2.102207168
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/128/152/7/manual_time	36.56043951	37.32853135	0.768091841	2.100882404
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/64/24/7/manual_time	16.78702667	17.13771011	0.350683448	2.08901466
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/64/40/28/manual_time	42.78770417	43.6695605	0.881856332	2.061004089
NvFuserScheduler_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_BatchNorm_nhwc_BWD_fp16/64/32/8/manual_time	24.35471892	24.85387302	0.499154096	2.049516966
NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16/256/152/7/manual_time	42.36686861	43.23246737	0.865598754	2.043102977
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/8/56/56/manual_time	43.44452544	44.33102026	0.88649482	2.040521357
NvFuserScheduler_LayerNorm_fp16___GRAPH/NvFuserScheduler_LayerNorm_fp16/8/262144/manual_time	37.85316688	38.62057491	0.767408039	2.027328498
NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16/64/56/7/manual_time	25.68954774	26.20834552	0.518797785	2.019489757
NvFuserScheduler_Softmax_Outer_fp16___GRAPH/NvFuserScheduler_Softmax_Outer_fp16/512/512/manual_time	10.96766924	11.18727728	0.219608032	2.002321795

csarofeen · 2022-05-23T23:32:13Z

Agree, seems within noise go ahead and merge.

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440 Commits that's actually in this PR from the csarofeen branch ``` * dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f Fix missing cooperative launch (#1726) * dc670a2 Async gmem copy support on sm80+ (#1619) * 5e6a8da Add turing mma support and test (#1643) * d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39 Mma op integration on ampere (#1440) * fade8da patch python test for bfloat16 (#1724) * 8fbd0b1 Fine-grained kernel profiling (#1720) * 77c1b4f Adding dry run mode to skip arch dependent checks (#1702) * 151d95b More precise concretization analysis (#1719) * f4d3630 Enable complex python tests (#1667) * 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830 Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7 updating_ci_machine (#1718) * 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453 Allow using nvFuser on CUDA extension (#1701) * 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: #78244 Approved by: https://github.com/csarofeen, https://github.com/malfet

Summary: Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440 Commits that's actually in this PR from the csarofeen branch ``` * dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f Fix missing cooperative launch (#1726) * dc670a2 Async gmem copy support on sm80+ (#1619) * 5e6a8da Add turing mma support and test (#1643) * d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39 Mma op integration on ampere (#1440) * fade8da patch python test for bfloat16 (#1724) * 8fbd0b1 Fine-grained kernel profiling (#1720) * 77c1b4f Adding dry run mode to skip arch dependent checks (#1702) * 151d95b More precise concretization analysis (#1719) * f4d3630 Enable complex python tests (#1667) * 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830 Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7 updating_ci_machine (#1718) * 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453 Allow using nvFuser on CUDA extension (#1701) * 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: #78244 Reviewed By: ejguan Differential Revision: D36678948 Pulled By: davidberard98 fbshipit-source-id: 0ccde965acbd31da67d99c6adb2eaaa888948105

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen/pytorch#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen/pytorch#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen/pytorch#1440 Commits that's actually in this PR from the csarofeen branch ``` * dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726) * dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619) * 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643) * d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440) * fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724) * 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720) * 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702) * 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719) * f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667) * 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718) * 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701) * 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: pytorch/pytorch#78244 Approved by: https://github.com/csarofeen, https://github.com/malfet

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen/pytorch#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen/pytorch#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen/pytorch#1440 Commits that's actually in this PR from the csarofeen branch ``` * 939e6c9 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * e4a514b Fix missing cooperative launch (#1726) * 1bb7b65 Async gmem copy support on sm80+ (#1619) * 69354da Add turing mma support and test (#1643) * 7ca0fa9 Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 8c5fb93 Mma op integration on ampere (#1440) * fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724) * 1278624 Fine-grained kernel profiling (#1720) * 34cb422 Adding dry run mode to skip arch dependent checks (#1702) * 4c3cba4 More precise concretization analysis (#1719) * 5a9ad9c Enable complex python tests (#1667) * 8102c05 Minor bugfix in transform_rfactor.cpp (#1715) * 2c0363c Separate root domain and rfactor domain in TransformPrinter (#1716) * 1679226 Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718) * acde15c Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701) * e57cc6b Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: pytorch/pytorch#78244 Approved by: https://github.com/csarofeen, https://github.com/malfet

shmsong mentioned this pull request Feb 8, 2022

Swizzle op formulation for non-affine swizzles #1441

Merged

3 tasks

csarofeen requested review from csarofeen and naoyam February 8, 2022 17:06

shmsong commented Feb 8, 2022

View reviewed changes

torch/csrc/jit/codegen/cuda/runtime/memory.cu Outdated Show resolved Hide resolved

shmsong changed the title ~~Mma op integration on turing/ampere~~ [Do not merge] Mma op integration on turing/ampere Feb 10, 2022

initial volta support

a449f49

shmsong changed the title ~~[Do not merge] Mma op integration on turing/ampere~~ [Do not merge] WIP: Mma op integration on turing/ampere Feb 22, 2022

mma parallel type && cleanup

edd43d9

shmsong force-pushed the volta_mma_op branch from 2b4f3b0 to edd43d9 Compare February 22, 2022 08:37

shmsong added 12 commits February 22, 2022 00:51

cleanup

ddac459

alignment

2f08d09

comment

ca77ff4

change request

9caeb18

fix same parallel type

de1d3ec

Merge remote-tracking branch 'origin/devel' into volta_mma_op

b393cbe

move validation pass

74f8c12

comment and cleanup

db34181

lint

4dec827

comment and cleanup

5ecd102

Merge remote-tracking branch 'origin/devel' into volta_mma_op

adb19b2

comment and format

c97c605

Base automatically changed from volta_mma_op to devel March 21, 2022 16:21

initial turing and ampere mma support

264bc77

shmsong force-pushed the ampere_mma_op branch from 480b5e7 to 264bc77 Compare March 21, 2022 22:31

shmsong added 6 commits March 21, 2022 15:39

Merge remote-tracking branch 'origin/devel' into ampere_mma_op

5bc918c

fix rebase

2f0ae5e

more rebase fix

b96462a

test comment

5347fda

Merge remote-tracking branch 'origin/devel' into ampere_mma_op

3f580d1

submodule

4834995

shmsong added 4 commits May 6, 2022 15:23

Merge remote-tracking branch 'origin/devel' into ampere_mma_op

b460fb1

arch guard update

fc35480

fix rebase

76563a5

WAR for buffer re-use

119f0eb

csarofeen approved these changes May 7, 2022

View reviewed changes

naoyam reviewed May 9, 2022

View reviewed changes

torch/csrc/jit/codegen/cuda/lower_predicate_elimination.cpp Show resolved Hide resolved

naoyam reviewed May 9, 2022

View reviewed changes

clean up

abcd1e8

shmsong force-pushed the ampere_mma_op branch from cb686b4 to abcd1e8 Compare May 9, 2022 20:50

shmsong added 7 commits May 9, 2022 13:56

update comment

ebc215a

Merge remote-tracking branch 'origin/devel' into ampere_mma_op

5e8128f

Merge remote-tracking branch 'origin/devel' into ampere_mma_op

80e840c

Merge remote-tracking branch 'origin/devel' into ampere_mma_op

5458e4b

Merge remote-tracking branch 'origin/devel' into ampere_mma_op

f9c5959

Merge remote-tracking branch 'origin/devel' into ampere_mma_op

605dbc7

use arch check utility for ampere tests

abf199a

shmsong merged commit 7093e39 into devel May 23, 2022

shmsong deleted the ampere_mma_op branch May 23, 2022 23:50

jjsjann123 mentioned this pull request May 25, 2022

[nvfuser_upstream_push] nvfuser code base bump 052422 pytorch/pytorch#78244

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mma op integration on ampere #1440

Mma op integration on ampere #1440

shmsong commented Feb 8, 2022 •

edited

Loading

csarofeen left a comment

csarofeen May 7, 2022

shmsong May 9, 2022

csarofeen May 7, 2022

shmsong May 9, 2022

csarofeen May 7, 2022

shmsong May 9, 2022

csarofeen May 7, 2022

shmsong May 9, 2022

naoyam commented May 9, 2022

naoyam May 9, 2022

naoyam May 9, 2022

shmsong May 9, 2022

naoyam May 9, 2022

naoyam May 9, 2022

naoyam May 9, 2022

naoyam May 9, 2022

shmsong May 9, 2022 •

edited

Loading

shmsong May 9, 2022

shmsong commented May 23, 2022 •

edited

Loading

csarofeen commented May 23, 2022 •

edited

Loading

Mma op integration on ampere #1440

Mma op integration on ampere #1440

Conversation

shmsong commented Feb 8, 2022 • edited Loading

csarofeen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naoyam commented May 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shmsong May 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shmsong commented May 23, 2022 • edited Loading

csarofeen commented May 23, 2022 • edited Loading

shmsong commented Feb 8, 2022 •

edited

Loading

shmsong May 9, 2022 •

edited

Loading

shmsong commented May 23, 2022 •

edited

Loading

csarofeen commented May 23, 2022 •

edited

Loading