[PTX-MMA] Add full PTX MMA code generation support #9909

KnowingNothing · 2022-01-12T03:08:46Z

This change adds full (although not all) PTX MMA code generation support for three generations of Tensor Core, including Volta, Turing, and Ampere. The generation logic is mainly implemented in ptx_mma.cc and should have no major influence on existing code. A test file is also provided in tests/python/unittest/test_tir_ptx_mma.py. Here is a list of limitations and further improvement is possible:

Correctness tests for int4 and binary MMA instructions are missing because NumPy has no support for int4 and binary kernels.
Tf32 and bf16 instructions are supported, but no tests are provided because as far as I know, these data types are not natively supported by TVM.
Implementation for binary MMA generates mma.sync.aligned.m16n8k256.row.col.s32.b1.b1.s32.and.popc for uint1 and mma.sync.aligned.m16n8k256.row.col.s32.b1.b1.s32.xor.popc for int1. This may not be a perfect decision.

tests/python/unittest/test_tir_ptx_mma.py

src/target/source/codegen_cuda.cc

vinx13

Some minor issues otherwise LGTM. There is some (unrelated) test errors on CI, could you try pushing again to restart the CI?

vinx13 · 2022-01-13T17:56:12Z

src/target/source/ptx_mma.h

+namespace tvm {
+namespace codegen {
+
+std::string PrintPTXAssembly(const std::string& shape, const std::string& A_layout,


maybe PrintMMAAssembly would be a better name

vinx13 · 2022-01-13T17:57:32Z

tests/python/unittest/test_tir_ptx_mma.py

+    golden = np.matmul(A_np.astype("float64"), B_np.astype("float64").T)
+
+    C_numpy = C_tvm.numpy()
+    from tvm import testing


this is not needed as tvm.testing is already imported at the beginning

vinx13 · 2022-01-13T18:01:53Z

src/target/source/ptx_mma.cc

+  /*
+   * TODO: add mma.m16n8k128
+   */
+  return "";


Suggested change

return "";

ICHECK(0);

throw;

if this is unreachable, just raises an error

shingjan · 2022-01-14T09:45:13Z

tests/python/unittest/test_tir_ptx_mma.py

+    for i in range(4):
+        Accum[i] = T.float32(0)
+
+    for mma_multi_a_col in T.vectorized(4):


Thanks for the PR! I wonder if you could elaborate more on the necessity of the declarations of MultiA, MultiB and Accum buffers here. Do buffers like A, B and C not work within the MMA assembly code generated below?

To use MMA instructions, the multiplicands and accumulator should be placed in registers, otherwise, the behavior is undefined. I have tried to use global buffers (e.g., A, B, C) to invoke MMA instructions, and the results are all wrong.

shingjan · 2022-01-14T09:47:24Z

tests/python/unittest/test_tir_ptx_mma.py

+        MultiA[mma_multi_a_col] = A[
+            (tx % 32) // 4 + mma_multi_a_col // 2 * 8, (tx % 32) % 4 * 2 + mma_multi_a_col % 2
+        ]
+    for mma_multi_b_col in T.vectorized(4):


Maybe we can combine the three loops to initialize MulitA MultiB and possibly Accum given the loop invariant are the same

Maybe it is more clear to make them separate because for people who are not familiar with CUDA or MMA, they can tell that the load of MultiA, MultiB, and the initialization of Accum are decoupled, which is also in accord with the pattern of the code generated by TVM.

shingjan · 2022-01-14T09:48:21Z

tests/python/unittest/test_tir_ptx_mma.py

+            "fp16",
+            "fp32",
+            MultiA,
+            0,


Does the use of MultiA MultiB and Accum make the bias/offset here unnecessary?

Maybe there is another way to implement the interface. I followed the existing manner of tvm_mma_sync.

Generally it is necessary if the buffer larger than required by mma.

shingjan · 2022-01-14T09:49:28Z

tests/python/unittest/test_tir_ptx_mma.py

+
+    A_np = np.random.uniform(-1, 1, [16, 8]).astype("float16")
+    B_np = np.random.uniform(-1, 1, [8, 8]).astype("float16")
+    C_np = np.random.uniform(-1, 1, [16, 8]).astype("float32")


Should't the value of C_np be zeros?

It's OK to set C_np to random values, although the most standard way is to set C_np to zeros. The results are not affected by the initial value of C_np because the accumulators are always initialized to zeros.

Yeah I am worried about the implication may confuse people.

I have changed to np.zeros.

note that random initialization does follow the convention we have in the tvm repo, so i dont think it confuses anybody. changing to zeros should good too, so i dont have strong opinion

junrushao · 2022-01-15T18:04:40Z

Some tests are failing (probably not relevant to this PR). Retriggering

junrushao · 2022-01-15T23:02:38Z

Failed again. @KnowingNothing would you mind checking the unittests also on your side?

KnowingNothing · 2022-01-16T11:21:34Z

It also failed on my local machine.

$ pytest tests/python/frontend/pytorch/qnn_test.py::test_serialized_modules
enabled targets: llvm; llvm -device=arm_cpu; cuda; cuda -model=unknown -libs=cudnn; nvptx; opencl; opencl -device=mali,aocl_sw_emu; opencl -device=intel_graphics
pytest marker:
============================================================== test session starts ===============================================================
platform linux -- Python 3.8.10, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/zchno/TVM/tvm-mirror-pr
collected 1 item

tests/python/frontend/pytorch/qnn_test.py Fatal Python error: Aborted

Current thread 0x00007f149c7e1740 (most recent call first):
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/torch/jit/_serialization.py", line 161 in load
  File "/home/zchno/TVM/tvm-mirror-pr/tests/python/frontend/pytorch/qnn_test.py", line 513 in test_serialized_modules
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/_pytest/python.py", line 1641 in runtest
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/_pytest/runner.py", line 255 in <lambda>
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/_pytest/runner.py", line 311 in from_call
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/_pytest/runner.py", line 254 in call_runtest_hook
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/_pytest/runner.py", line 215 in call_and_report
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/_pytest/runner.py", line 126 in runtestprotocol
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/_pytest/main.py", line 323 in _main
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/_pytest/main.py", line 269 in wrap_session
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/_pytest/config/__init__.py", line 162 in main
  File "/home/zchno/venv/prime/lib/python3.8/site-packages/_pytest/config/__init__.py", line 185 in console_main
  File "/home/zchno/venv/prime/bin/pytest", line 8 in <module>
Aborted (core dumped)

vinx13 · 2022-01-16T16:34:44Z

@KnowingNothing Can you try rebasing and testing again?

vinx13 · 2022-01-18T18:47:33Z

I checked the unit test and confirmed it is caused by this commit. It seems the error only happens when std::regex is used, probably because of C++ ABI incompatibility with libtorch.
@junrushao1994 Do you have more insights on this?

junrushao · 2022-01-18T19:08:34Z

@vinx13 My experience with std::regex is overwhelmingly negative. If it's the source of these bugs, let's consider other alternatives

junrushao · 2022-01-21T01:49:40Z

CC: @jinhongyii

KnowingNothing · 2022-01-24T03:11:24Z

I tried to replace std::regex with normal string operations. Hope this will work.

junrushao · 2022-01-24T17:34:12Z

Thanks! This is huge

…y to warp memory (#10855) We already have PTX mma and mma.sp builtin support in #9909 and #10339 . However, we have not supported corresponding data movement builtins for these mma instructions, so the data movement would not be as fast as wmma. This PR brings the `ldmatrix` builtin, which is a native PTX warp-level instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix), and we can use it to load several (1/2/4) 8x8 matrices from shared memory to warp memory.

…y to warp memory (apache#10855) We already have PTX mma and mma.sp builtin support in apache#9909 and apache#10339 . However, we have not supported corresponding data movement builtins for these mma instructions, so the data movement would not be as fast as wmma. This PR brings the `ldmatrix` builtin, which is a native PTX warp-level instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix), and we can use it to load several (1/2/4) 8x8 matrices from shared memory to warp memory.

KnowingNothing requested review from areusch, comaniac, icemelon, jroesch, junrushao, kparzysz-quic, masahi, merrymercy, tqchen, vinx13, yzhliu and ZihengJiang as code owners January 12, 2022 03:08

KnowingNothing force-pushed the add-ptx-mma branch 3 times, most recently from c97319a to d229a98 Compare January 12, 2022 03:55

vinx13 self-assigned this Jan 12, 2022

vinx13 reviewed Jan 12, 2022

View reviewed changes

tests/python/unittest/test_tir_ptx_mma.py Outdated Show resolved Hide resolved

vinx13 reviewed Jan 12, 2022

View reviewed changes

tests/python/unittest/test_tir_ptx_mma.py Show resolved Hide resolved

vinx13 reviewed Jan 12, 2022

View reviewed changes

tests/python/unittest/test_tir_ptx_mma.py Outdated Show resolved Hide resolved

vinx13 reviewed Jan 12, 2022

View reviewed changes

src/target/source/codegen_cuda.cc Outdated Show resolved Hide resolved

KnowingNothing force-pushed the add-ptx-mma branch from d229a98 to 4da1629 Compare January 13, 2022 02:54

KnowingNothing requested a review from Hzfengsy as a code owner January 13, 2022 02:54

vinx13 reviewed Jan 13, 2022

View reviewed changes

KnowingNothing force-pushed the add-ptx-mma branch from 4da1629 to 2664115 Compare January 14, 2022 02:48

shingjan reviewed Jan 14, 2022

View reviewed changes

vinx13 approved these changes Jan 14, 2022

View reviewed changes

KnowingNothing force-pushed the add-ptx-mma branch from 2664115 to 54c5ca2 Compare January 15, 2022 05:06

KnowingNothing force-pushed the add-ptx-mma branch from 54c5ca2 to e568cc0 Compare January 17, 2022 16:04

[PTX-MMA] Add full PTX MMA code generation support

611a7ec

KnowingNothing force-pushed the add-ptx-mma branch from e568cc0 to 611a7ec Compare January 24, 2022 03:10

vinx13 approved these changes Jan 24, 2022

View reviewed changes

vinx13 merged commit d066441 into apache:main Jan 24, 2022

yuanfz98 pushed a commit to yuanfz98/tvm that referenced this pull request Jan 24, 2022

[PTX-MMA] Add full PTX MMA code generation support (apache#9909)

db16f3c

ylc pushed a commit to ylc/tvm that referenced this pull request Feb 16, 2022

[PTX-MMA] Add full PTX MMA code generation support (apache#9909)

b750606

yzh119 mentioned this pull request Feb 22, 2022

[PTX] Support mma.sp to use Sparse Tensor Cores and refactor mma codegen #10339

Merged

yzh119 mentioned this pull request Apr 1, 2022

[PTX] ldmatrix builtin to accelerate copying data from shared memory to warp memory #10855

Merged

masahi mentioned this pull request May 18, 2022

[TIR] Support tensorization using ldmatrix + MMA #11355

Merged

driazati mentioned this pull request Jul 14, 2022

TVM v0.9.0.rc0 Release Candidate Notes #12102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PTX-MMA] Add full PTX MMA code generation support #9909

[PTX-MMA] Add full PTX MMA code generation support #9909

KnowingNothing commented Jan 12, 2022 •

edited

Loading

vinx13 left a comment

vinx13 Jan 13, 2022

vinx13 Jan 13, 2022

vinx13 Jan 13, 2022

shingjan Jan 14, 2022

KnowingNothing Jan 14, 2022

shingjan Jan 14, 2022

KnowingNothing Jan 14, 2022

shingjan Jan 14, 2022

KnowingNothing Jan 14, 2022

vinx13 Jan 14, 2022

shingjan Jan 14, 2022

KnowingNothing Jan 14, 2022

shingjan Jan 14, 2022

KnowingNothing Jan 15, 2022

junrushao Jan 15, 2022

junrushao commented Jan 15, 2022

junrushao commented Jan 15, 2022

KnowingNothing commented Jan 16, 2022

vinx13 commented Jan 16, 2022

vinx13 commented Jan 18, 2022

junrushao commented Jan 18, 2022

junrushao commented Jan 21, 2022

KnowingNothing commented Jan 24, 2022

junrushao commented Jan 24, 2022

+                          "fp16",
+                          "fp32",
+                          MultiA,
+,

[PTX-MMA] Add full PTX MMA code generation support #9909

[PTX-MMA] Add full PTX MMA code generation support #9909

Conversation

KnowingNothing commented Jan 12, 2022 • edited Loading

vinx13 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrushao commented Jan 15, 2022

junrushao commented Jan 15, 2022

KnowingNothing commented Jan 16, 2022

vinx13 commented Jan 16, 2022

vinx13 commented Jan 18, 2022

junrushao commented Jan 18, 2022

junrushao commented Jan 21, 2022

KnowingNothing commented Jan 24, 2022

junrushao commented Jan 24, 2022

KnowingNothing commented Jan 12, 2022 •

edited

Loading