[CODEGEN][CUDA] Fix vector load #5226

huochaitiantang · 2020-04-03T09:28:40Z

Fix high-low bit bug in __pack_half2.
Do not emit code of vector load by introducing an extra statement and vector store:

    int _1;
    int4 _2 = (make_int4)(
      ((((((int)blockIdx.x) * 64) + ((int)threadIdx.x)) - 3))+(16*0), 
      ((((((int)blockIdx.x) * 64) + ((int)threadIdx.x)) - 3))+(16*1), 
      ((((((int)blockIdx.x) * 64) + ((int)threadIdx.x)) - 3))+(16*2), 
      ((((((int)blockIdx.x) * 64) + ((int)threadIdx.x)) - 3))+(16*3));
    
    _1=(((signed char*)A)[_2.x] << 0);
    _1=_1 & ~(0x000000ff << 8) |(((signed char*)A)[_2.y] << 8);
    _1=_1 & ~(0x000000ff << 16) |(((signed char*)A)[_2.z] << 16);
    _1=_1 & ~(0x000000ff << 24) |(((signed char*)A)[_2.w] << 24);
    (( int*)(( signed char*)B + (((((int)blockIdx.x) * 88) + (((int)threadIdx.x) * 4)))))[0] = 
    (((((int)threadIdx.x) < 3) || (19 <= ((int)threadIdx.x))) ? (int)0 : _1);

The above code is a padding kernel. Whether _2.x, _2.y, _2.z, _2.w are the correct indexes of A or not, the introduced variable _1 will be calculated. So emit the following code instead:

  int4 _1 = (make_int4)(
    ((((((int)blockIdx.x) * 64) + ((int)threadIdx.x)) - 3))+(16*0), 
    ((((((int)blockIdx.x) * 64) + ((int)threadIdx.x)) - 3))+(16*1), 
    ((((((int)blockIdx.x) * 64) + ((int)threadIdx.x)) - 3))+(16*2), 
    ((((((int)blockIdx.x) * 64) + ((int)threadIdx.x)) - 3))+(16*3));

  (( int*)(( signed char*)B + (((((int)blockIdx.x) * 88) + (((int)threadIdx.x) * 4)))))[0] = 
    (((((int)threadIdx.x) < 3) || (19 <= ((int)threadIdx.x))) 
    ? (int)0 
    : ((0x000000ff << 0) & (((signed char*)A)[_1.x] << 0))|
      ((0x000000ff << 8) & (((signed char*)A)[_1.y] << 8))|
      ((0x000000ff << 16) & (((signed char*)A)[_1.z] << 16))|
      ((0x000000ff << 24) & (((signed char*)A)[_1.w] << 24)));

@vinx13, could you please help review? Thanks!

tqchen · 2020-04-04T01:48:25Z

also cc @wpan11nv @ZihengJiang

wpan11nv · 2020-04-06T18:12:11Z

src/target/source/literal/cuda_half_t.h

@@ -291,7 +291,7 @@ static inline __device__ __host__ unsigned
 __pack_half2(const half x, const half y) {
  unsigned v0 = *((unsigned short *)&x);
  unsigned v1 = *((unsigned short *)&y);
-  return (v0 << 16) | v1;
+  return (v1 << 16) | v0;


good catch!

wpan11nv · 2020-04-06T18:14:43Z

src/target/source/codegen_c.cc

+void CodeGenC::PrintVecElemLoadExpr(
+    DataType t, int i, const std::string& value, std::ostream& os) {
+  CHECK_GT(t.lanes(), 1);
+  if (t.is_int() && t.bits() == 8) {


Already supported unit8.

wpan11nv · 2020-04-06T18:16:31Z

tests/python/unittest/test_target_codegen_cuda.py

+                                    (0, 0)), mode='constant', constant_values=0)
+        tvm.testing.assert_allclose(b.asnumpy(), ref)
+
+    check_cuda("int8", 64, 16, 3, 4)


uint8 test?

Already added uint8 test.

wpan11nv · 2020-04-06T18:32:55Z

src/target/source/codegen_cuda.cc

+void CodeGenCUDA::PrintVecElemLoadExpr(
+    DataType t, int i, const std::string& value, std::ostream& os) {
+  CHECK_GT(t.lanes(), 1);
+  if (t.is_int() && t.bits() == 8) {


Already supported unit8.

wpan11nv · 2020-04-06T18:34:20Z

tests/python/unittest/test_target_codegen_cuda.py

+        if not tvm.gpu(0).exist or not tvm.runtime.enabled("cuda"):
+            print("skip because cuda is not enabled..")
+            return
+


check if float16 is supported

Already checked.

huochaitiantang · 2020-04-07T09:03:20Z

Thanks for your review, @wpan11nv . The new commit supported uint8, and it also fixed the unit8 bug in the code generation of BroadcastNode.

wpan11nv · 2020-04-07T16:31:45Z

src/target/source/codegen_cuda.cc

    // make_int8x4
    const int64_t *p = as_const_int(op->value);
    CHECK(p);
    int64_t v = *p & 0xFF;
    v = (v << 24) | (v << 16) | (v << 8) | v;
-    os << "(int)" << v;
+    if (op->dtype.is_uint()) {


why do we care the signedness? this just downcasts to 32 bits,.

TVM uses uint to store unit8x4 (in function PrintType). The care will generate code like unit x = (unit)y, instead of unit x = (int)y. And what is your further opinion?

Can we keep it as is? I do not see benefits from this change. Otherwise the entire PR LGTM. Thanks!

I think it's not necessary to revert this change, if it's harmless. Consider that CodeGenCUDA::PrintType for uint8x4 generates "uint", this change somehow makes sense.

huochaitiantang · 2020-04-11T11:30:09Z

@vinx13 could you help to review the code? Thanks!

vinx13 · 2020-04-14T03:35:47Z

Thanks @huochaitiantang @wpan11nv this is merged

* Fix high-low bit bug in __pack_half2 * Fix vector load * Add unit8 support for PrintVecElemLoadExpr and BroadcastNode

tqchen assigned vinx13 Apr 4, 2020

wpan11nv reviewed Apr 6, 2020

View reviewed changes

huochaitiantang added 3 commits April 7, 2020 14:42

Fix high-low bit bug in __pack_half2

1558438

Fix vector load

ad2a944

Add unit8 support for PrintVecElemLoadExpr and BroadcastNode

78c1b62

wpan11nv reviewed Apr 7, 2020

View reviewed changes

wpan11nv approved these changes Apr 10, 2020

View reviewed changes

vinx13 approved these changes Apr 14, 2020

View reviewed changes

vinx13 merged commit d2e58ad into apache:master Apr 14, 2020

vinx13 added the status: accepted label Apr 14, 2020

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Apr 16, 2020

[CODEGEN][CUDA] Fix vector load (apache#5226)

f8718a1

* Fix high-low bit bug in __pack_half2 * Fix vector load * Add unit8 support for PrintVecElemLoadExpr and BroadcastNode

zhiics pushed a commit to neo-ai/tvm that referenced this pull request Apr 17, 2020

[CODEGEN][CUDA] Fix vector load (apache#5226)

2498b62

* Fix high-low bit bug in __pack_half2 * Fix vector load * Add unit8 support for PrintVecElemLoadExpr and BroadcastNode

dpankratz pushed a commit to dpankratz/incubator-tvm that referenced this pull request Apr 24, 2020

[CODEGEN][CUDA] Fix vector load (apache#5226)

9c52ac7

* Fix high-low bit bug in __pack_half2 * Fix vector load * Add unit8 support for PrintVecElemLoadExpr and BroadcastNode

ZihengJiang mentioned this pull request Sep 25, 2020

TVM v0.7 Release Note Candidate #6486

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CODEGEN][CUDA] Fix vector load #5226

[CODEGEN][CUDA] Fix vector load #5226

huochaitiantang commented Apr 3, 2020

tqchen commented Apr 4, 2020

wpan11nv Apr 6, 2020

wpan11nv Apr 6, 2020

huochaitiantang Apr 7, 2020

wpan11nv Apr 6, 2020

huochaitiantang Apr 7, 2020

wpan11nv Apr 6, 2020

huochaitiantang Apr 7, 2020

wpan11nv Apr 6, 2020

huochaitiantang Apr 7, 2020

huochaitiantang commented Apr 7, 2020

wpan11nv Apr 7, 2020

huochaitiantang Apr 8, 2020

wpan11nv Apr 9, 2020

llehtahw Apr 10, 2020

huochaitiantang commented Apr 11, 2020

vinx13 commented Apr 14, 2020

[CODEGEN][CUDA] Fix vector load #5226

[CODEGEN][CUDA] Fix vector load #5226

Conversation

huochaitiantang commented Apr 3, 2020

tqchen commented Apr 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huochaitiantang commented Apr 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huochaitiantang commented Apr 11, 2020

vinx13 commented Apr 14, 2020