[DO NOT MERGE] Tracking Devel #1208

csarofeen · 2021-10-19T18:26:57Z

No description provided.

jjsjann123

minor comments for myself to clean up the merge PR

jjsjann123 · 2021-11-03T18:55:26Z

aten/src/ATen/core/aten_interned_strings.h

@@ -75,6 +75,7 @@ _(aten, _expm1) \
 _(aten, _fft_with_size) \
 _(aten, _fill) \
 _(aten, _floor) \
+_(aten, _indexCopy) \
 _(aten, _fused_dropout) \


Note: don't need this... but it doesn't matter once we cherry-pick pytorch#63937

jjsjann123 · 2021-11-03T19:07:28Z

aten/src/ATen/native/Normalization.cpp

@@ -557,7 +557,6 @@ std::tuple<Tensor, Tensor, Tensor> _batch_norm_impl_index_backward(
  }

  // backward in inference mode is not supported in cudnn, fallback to native
-  // TODO: verify the same thing in miopen


errr. upstream should have removed this comment by now

jjsjann123 · 2021-11-03T19:13:06Z

benchmarks/cpp/nvfuser/CMakeLists.txt

-    scale_bias_relu.cpp
-    utils.cpp
-    main.cpp)
+add_executable(nvfuser_bench


indentation

jjsjann123 · 2021-11-03T20:55:31Z

test/test_jit.py

@@ -10817,6 +10817,89 @@ def addmm_grad_test(b, x, w):
        self.assertEqual(w.grad, w_ref.grad)
        self.assertEqual(b.grad, b_ref.grad)

+    def test_layer_norm_grad(self):


remove this test case.

jjsjann123 · 2021-11-03T20:56:22Z

tools/clang_tidy.py

@@ -0,0 +1,372 @@
+#!/usr/bin/env python3


Oops, remove this file.

jjsjann123 · 2021-11-03T22:20:25Z

torch/csrc/jit/passes/constant_pooling.cpp

@@ -4,6 +4,7 @@
 #include <torch/csrc/jit/ir/alias_analysis.h>
 #include <torch/csrc/jit/ir/ir.h>
 #include <torch/csrc/jit/ir/node_hashing.h>
+#include <torch/csrc/jit/jit_log.h>


We could revert this. 😛

jjsjann123 · 2021-11-03T22:24:27Z

torch/jit/_script.py

@@ -1301,6 +1301,10 @@ def forward(self, a) -> MyModule:
            obj = obj.__original_fn
            _rcb = _jit_internal.createResolutionCallbackFromClosure(obj)

+        # some functions are explicitly marked as not supported in script mode


oops, this should go to upstream!

csarofeen · 2022-10-29T20:27:07Z

torch/csrc/jit/codegen/cuda/arith.cpp

+  // In PyTorch, reduction of a size-0 tensor is effectively creating a tensor
+  // filled with the init value.
+  auto maybe_full =
+      maybeFullInsteadOfReduction(uint_axes, init, tv, keep_dim, dtype);


PyTorch, why are you so strange?

* Refactoring of lower_alias_memory

Upstream bump 1109

removing label workflow

* Look at all the loops rather than just the consumer IDs as there can be loops not mapped to the consumer.

* add print debug for nvfuser * refine dump exprs code: 1) rename option name 2) move reduplicated logic to func dumpExprsIfEnabled Co-authored-by: Feiwen Zhu <mzhu@nvidia.com>

Add computeWith to interleave gmem accesses and computations

… based on problem size (#2191)

* Add Float IR node class Represents the 32-bit floating-point scalar value. Not supported in PyTorch, so can't be used as inputs to fusions

* Refactor scalar IR nodes (Int, Double and Bool) Everything uses template class Scalar

…utMutator (#2202)

* "Vectorize" sequential welford computations Lift the predicated count division outside of the innermost loop if that loop is exactly mapped with vectorized IDs and not a reduction domain. Targeted to address outer-reduction grid welford tuning

disables index_select / gather python tests since upstream backs out autodiff support on these ops. pytorch#95565 We'll re-enable them when we remerge the autodiff support with opt-in via environment variable

* Fix #2531 Changed ReplayTransformations to take bool parameters explicitly with set methods. This is to avoid accidentally passing those bool arguments in a wrong order. More verbose, but would be safer

Implements fundamental logic for multi-device support Co-authored-by: shmsong <shisong@umich.edu> Co-authored-by: Sergey Lebedev <sergeyle@nvidia.com> Co-authored-by: snordmann <snordmann@nvidia.com> Co-authored-by: Xiang Gao <qasdfgtyuiop@gmail.com> Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>

Added new python API fd.ops.add_output(tensor, stride_order), where stride_order means that output axis i is the stride_order[i]th fastest dimension. e.g. if we want to specify output to be in channel-last format, we should specify fd.ops.add_output(tensor_view, [0, 3, 1, 2]), where a given output with shape [N, C, H, W] will have stride [H*W*C, 1, W*C, C] Implementation details: It's currently done in a naive way. Since nvfuser doesn't support user specified stride order yet, we fake it by: adding a permute op on outputs inside the generated kernel, to ensure that the output is stored in the correct memory layout; after the kernel has executed, we permute that corresponding output to undo the permutation inside the kernel, this gives us the semantically correct output in the desired memory layout.

ampere tests running on pre-ampere devices triggers a CI failure.

Fixing and improving indexing type handling

Fixes #2564 Co-authored-by: Jacob Hinkle <jhinkle@nvidia.com>

This reverts commit 3b85308.

Fixes python handling of expanded broadcast dimension. e.g. torch.randint(0, 1, (5, 5), device="cuda").bool().unsqueeze(-1).expand(5, 5, 5) Changes contiguity representation on python. computeContiguity currently returns an array with the length of tensor rank, elements in the array can be: True, False or None, where None indicates a given dimension is broadcast.

* Clean up compile-time and run-time index options

Fixes index_select on empty/scalar indices. Issues found in python API. Our stack should support empty tensor (numel()==0), removed the check on that. Scalar tensor should be used in place of real Scalar with variable_name[0]. Add a quick patch for that. --------- Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>

* Include non-view rfactor IDs in CA map rfactor ID sets

* fix tests for multicluster fusion

…s. (#2576) Recomputation for each persistent use should be done after the accumulation is done. Currently, recomputation and replaceVal can be done redundantly. For example, on A100, that happens with NvFuserScheduler_BatchNorm_fp32/64/32/256.

csarofeen mentioned this pull request Oct 23, 2021

Tracking 20_12_3_devel #570

Closed

jjsjann123 reviewed Nov 3, 2021

View reviewed changes

csarofeen requested a review from mruberry as a code owner May 4, 2022 20:30

csarofeen removed the request for review from mruberry August 26, 2022 18:20

csarofeen commented Oct 29, 2022

View reviewed changes

jjsjann123 and others added 25 commits November 9, 2022 09:25

Merge remote-tracking branch 'upstream/viable/strict' into HEAD

cfd8e5d

Refactor lower_alias_memory.cpp (#2170)

f588435

* Refactoring of lower_alias_memory

Merge pull request #2172 from csarofeen/upstream_bump_1109

aeecec0

Upstream bump 1109

Misc mutator cleanups/fixes (#2180)

df88344

fixing my python flake8 issue from merging warning (#2182)

40128c1

removing label workflow

Fix invalid aliasing (#2178)

c9f8c1d

* Look at all the loops rather than just the consumer IDs as there can be loops not mapped to the consumer.

Add support for select op (#2179)

ca1387a

More trivial reduction cleanup (#2181)

d1bf8c3

Print debug info about expr transform on GpuLower (#2185)

11c459e

* add print debug for nvfuser * refine dump exprs code: 1) rename option name 2) move reduplicated logic to func dumpExprsIfEnabled Co-authored-by: Feiwen Zhu <mzhu@nvidia.com>

Reduce the work to add a new expr: remove ExprType (#2186)

8d3c95a

Add computeWith to interleave gmem accesses and computations (#2156)

bf1596c

Add computeWith to interleave gmem accesses and computations

bug-3607940: fix wave quantization and round bdimx in outer reduction…

52713b8

… based on problem size (#2191)

Add support for normal distribution RNG (#2171)

f7f8c3c

Add Float IR node class (#2197)

3a6197e

* Add Float IR node class Represents the 32-bit floating-point scalar value. Not supported in PyTorch, so can't be used as inputs to fusions

Reduce the work to add a new expr: unify the structure of exprs (#2190)

50861ec

Support int32 Int (#2198)

2ab1408

Don't set DataType of IterDomain as Int (#2200)

2b754f3

delete unused function (#2201)

59e497d

Reduce the work to add a new expr: mutator cleanup (#2199)

b02b95c

Refactor scalar IR nodes (#2203)

1557d69

* Refactor scalar IR nodes (Int, Double and Bool) Everything uses template class Scalar

Reduce the work to add a new expr: graphviz cleanup (#2205)

d165f29

Suppress nvcc warnings (#2206)

d24b8ae

use reference, some compilers may complain or treat it as error (#2211)

75c8f0e

Reduce the work to add a new expr: rewrite SubstituteInExpr with OptO…

b030e41

…utMutator (#2202)

Vectorized welford (#2204)

2057f37

* "Vectorize" sequential welford computations Lift the predicated count division outside of the innermost loop if that loop is exactly mapped with vectorized IDs and not a reduction domain. Targeted to address outer-reduction grid welford tuning

liqiangxl and others added 30 commits February 28, 2023 17:27

reduction on complex numbers, added volatile copy and assignment (#2453)

657080b

disable index_select/gather tests (#2530)

cf94a48

disables index_select / gather python tests since upstream backs out autodiff support on these ops. pytorch#95565 We'll re-enable them when we remerge the autodiff support with opt-in via environment variable

Fix #2531 (#2532)

dc04f87

* Fix #2531 Changed ReplayTransformations to take bool parameters explicitly with set methods. This is to avoid accidentally passing those bool arguments in a wrong order. More verbose, but would be safer

New lowering pass: loop rotation (#2500)

591b8ae

Add optional bool preserve_error for expr simplifier (#2534)

0200e7a

Improve matmul instruction scheduling with loop rotation (#2488)

5913acc

Fix stop predicate (#2537)

a40ef69

bump ubuntu version (#2546)

b928665

cp.async access global tensor via pointer (#2282)

16a26a1

Rewrite reducePredicateRegisterUsage (#2533)

5a69c1b

Fix downcast_ptr != nullptr in RandomOp (#2547)

12c1765

Update fusion_record.h (#2536)

14930f4

Make contiguity ignore broadcasts (#2517)

4ad1055

allow prediction on block dim (#2544)

d5f10e1

skipping ampere tests on pre-ampere devices (#2553)

2ee05db

ampere tests running on pre-ampere devices triggers a CI failure.

Add debugging utility RAII guard for printting scopes (#2555)

a86f9b0

Fixing and improving indexing type handling (#2522)

3b85308

Fixing and improving indexing type handling

Avoid adding "f" suffix to std::{real,imag} (#2565)

4f82275

Fixes #2564 Co-authored-by: Jacob Hinkle <jhinkle@nvidia.com>

Revert "Fixing and improving indexing type handling (#2522)" (#2568)

e0c1786

This reverts commit 3b85308.

Clean up index type handling (#2570)

3c4b3da

* Clean up compile-time and run-time index options

Change contiguity into std::vector<c10::optional<bool>> (#2569)

9eb4c20

Make expr simplifier tests more readable (#2571)

29bb8c0

Fix indexing failure with non-view rfactor (#2562)

9c62d94

* Include non-view rfactor IDs in CA map rfactor ID sets

Fix multidevice tests (#2574)

2809661

* fix tests for multicluster fusion

IterDomain resize for pad, cat, slice (#2480)

1e30fee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Tracking Devel #1208

[DO NOT MERGE] Tracking Devel #1208

csarofeen commented Oct 19, 2021

jjsjann123 left a comment

jjsjann123 Nov 3, 2021

jjsjann123 Nov 3, 2021

jjsjann123 Nov 3, 2021

jjsjann123 Nov 3, 2021

jjsjann123 Nov 3, 2021

jjsjann123 Nov 3, 2021

jjsjann123 Nov 3, 2021

csarofeen Oct 29, 2022

[DO NOT MERGE] Tracking Devel #1208

Are you sure you want to change the base?

[DO NOT MERGE] Tracking Devel #1208

Conversation

csarofeen commented Oct 19, 2021

jjsjann123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment