[TOPI][TIR][TE][x86] Extend x86 SIMD (u)int8 coverage for dense & conv2d #15918

cbalint13 · 2023-10-11T21:45:29Z

This PR enhances x86 SIMD (u)int8 coverage for dense and conv2d operators.
It extends current SIMD support with avx2 & ssse3, and adds a new set of non-overflowing SIMD method.

Tracker:

PR for TIR call_{pure}_llvm_intrin pretty print.
PR for new TIR intrinsics
PR for ArrayIntImm node

This PR will hold only the TOPI part.

Changes:

[x86][TOPI]

Extends current tensorizer set, now called fast-math overflowing one, with avx2 and ssse3.
Adds a new tensorizer set, a "precision" non-overflowing one, supporting: avx512, avx2 and ssse3.
Enable proper operator data alignment to gurantee any data (even smallest) will fit the SIMD vector width.

[TIR][LLVM]

Adds new TIR ops mapped to LLVM instrinsics: zextend, sextend, truncate for type conversions.
Adds new TIR ops mapped to LLVM instrinsics: vectorpermute, vectorshuffle for vector manipulation.
The call_llvm_pure_intrin & call_llvm_intrin now holds instruction StringImm instead of IntImm abstract.
Enables TIR op atomic_add mapped to proper LLVM intrinsic guarnteed (best-effort) to lower to single instruction.

[TE]

Introduce new ArrayIntImm expression for small immediate list of integer constants.

[Target]

Introduce a flag -key=cpu,fast-math to switch from the precise SIMD (default) to the overflowing SIMD set.

Performance

For the new avx2 & ssse3 the fast vs. precise SIMD sets:

$ python3 tests/python/contrib/test_gemm_acc32_simd.py
Task tensorized: {True } [llvm -mcpu=ivybridge                     ], running time: 3.655 ms, 587.58 Gops/s
Task tensorized: {True } [llvm -mcpu=ivybridge -keys=cpu,fast-math ], running time: 3.678 ms, 583.86 Gops/s
Task tensorized: {True } [llvm -mcpu=haswell                       ], running time: 3.708 ms, 579.09 Gops/s
Task tensorized: {True } [llvm -mcpu=haswell -keys=cpu,fast-math   ], running time: 3.668 ms, 585.52 Gops/s
Task tensorized: {False} [llvm -mcpu=ivybridge                     ], running time: 41.152 ms, 52.18 Gops/s
Task tensorized: {False} [llvm -mcpu=haswell                       ], running time: 41.194 ms, 52.13 Gops/s

Notes

Precision (non fast-math) is the default now.
x86 amx and vnni schedules remains unchanged, their specific intrinsics never overflows.
The zextend, sextend, truncate lowers on x86 into single specialized instruction e.g: punpcklwd & punpckhwd
The vectorpermute, vectorshuffle also lowers on x86 into appropriate single specialized instruction.
ArrayIntImm is for the new ops: tir.vectorpermute("int32x8", whatever_vector, [0, 1, 4, 5, 2, 3, 6, 7])
The fast-math mode will always warn the user:
Using `fast-math` may overflow, make sure ranges for either data is [0,128] or weight is [-64,+64]
TIR printer (for debug purpose) will now list the very name of the instruction, not an abstract IntImm:
{...} T.call_llvm_pure_intrin("int32x4", "llvm.x86.sse2.pmadd.wd", T.uint32(2) {....}

Samples

Lowering results for the ssse3 case.

The precise one:

@I.ir_module
class Module:
    @T.prim_func
    def tvmgen_default_fused_nn_contrib_dense_pack(
                    p0: T.Buffer((4, 4), "uint8"), 
                    p1: T.Buffer((1, 1, 4, 4), "int8"),
                    compute: T.Buffer((4, 4), "int32")):
        T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
        for i_inner in range(4):
            compute_1 = T.Buffer((16,), "int32", data=compute.data)
            compute_1[i_inner * 4:i_inner * 4 + 4] = T.Broadcast(0, 4)
            p0_1 = T.Buffer((16,), "uint8", data=p0.data)
            p1_1 = T.Buffer((16,), "int8", data=p1.data)
            compute_1[i_inner * 4:i_inner * 4 + 4] = 
              T.call_llvm_pure_intrin("int32x4", "llvm.x86.ssse3.phadd.d.128", T.uint32(2), 
                T.call_llvm_pure_intrin("int32x4", "llvm.x86.sse2.pmadd.wd", T.uint32(2), 
                  T.vectorlow("void", T.zextend("int16x16", 
                    T.reinterpret("int8x16", T.Broadcast(
                      T.reinterpret("int32", p0_1[i_inner * 4:i_inner * 4 + 4]), 4)))),
                        T.vectorlow("void", T.sextend("int16x16", p1_1[0:16]))),
                          T.call_llvm_pure_intrin("int32x4", "llvm.x86.sse2.pmadd.wd", T.uint32(2),
                            T.vectorhigh("void", T.zextend("int16x16", T.reinterpret("int8x16", 
                              T.Broadcast(T.reinterpret("int32", p0_1[i_inner * 4:i_inner * 4 + 4]), 4)))),
                                T.vectorhigh("void", T.sextend("int16x16", p1_1[0:16])))) 
                                  + compute_1[i_inner * 4:i_inner * 4 + 4]

000000000001e90 <tvmgen_default_fused_nn_contrib_dense_pack_compute_>:
    1e90:	c4 e2 79 18 16       	vbroadcastss (%rsi),%xmm2
    1e95:	c5 f9 ef c0          	vpxor  %xmm0,%xmm0,%xmm0
    1e99:	c5 e9 68 d8          	vpunpckhbw %xmm0,%xmm2,%xmm3
    1e9d:	c4 e2 79 20 4a 08    	vpmovsxbw 0x8(%rdx),%xmm1
    1ea3:	c4 e2 79 30 e2       	vpmovzxbw %xmm2,%xmm4
    1ea8:	c4 e2 79 20 12       	vpmovsxbw (%rdx),%xmm2
    1ead:	c5 d9 f5 e2          	vpmaddwd %xmm2,%xmm4,%xmm4
    1eb1:	c5 e1 f5 d9          	vpmaddwd %xmm1,%xmm3,%xmm3
    1eb5:	c4 e2 59 02 eb       	vphaddd %xmm3,%xmm4,%xmm5
    {...}

define internal fastcc void @tvmgen_default_fused_nn_contrib_dense_pack_compute {
entry:
  {...}
  %3 = load i32, ptr %1, align 64, !tbaa !310
  %4 = insertelement <4 x i32> undef, i32 %3, i64 0
  %5 = bitcast <4 x i32> %4 to <16 x i8>
  %6 = shufflevector <16 x i8> %5, <16 x i8> poison, <16 x i32> 
           <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, 
            i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>
  %7 = zext <16 x i8> %6 to <16 x i16>
  %8 = shufflevector <16 x i16> %7, <16 x i16> poison, <8 x i32>
           <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %9 = load <16 x i8>, ptr %2, align 64, !tbaa !312
  %10 = sext <16 x i8> %9 to <16 x i16>
  %11 = shufflevector <16 x i16> %10, <16 x i16> poison, <8 x i32>
             <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %12 = tail call <4 x i32> @llvm.x86.sse2.pmadd.wd(<8 x i16> %8, <8 x i16> %11)
  %13 = shufflevector <16 x i16> %7, <16 x i16> poison, <8 x i32>
              <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
  %14 = shufflevector <16 x i16> %10, <16 x i16> poison, <8 x i32>
             <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
  %15 = tail call <4 x i32> @llvm.x86.sse2.pmadd.wd(<8 x i16> %13, <8 x i16> %14)
  %16 = tail call <4 x i32> @llvm.x86.ssse3.phadd.d.128(<4 x i32> %12, <4 x i32> %15)
  %17 = getelementptr inbounds i8, ptr %1, i64 4
  {...}

The fast-math one:

@I.ir_module
class Module:
    @T.prim_func
    def tvmgen_default_fused_nn_contrib_dense_pack(
                  p0: T.Buffer((4, 4), "uint8"), 
                  p1: T.Buffer((1, 1, 4, 4), "int8"), 
                  compute: T.Buffer((4, 4), "int32")):
        T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
        for i_inner in range(4):
            compute_1 = T.Buffer((16,), "int32", data=compute.data)
            compute_1[i_inner * 4:i_inner * 4 + 4] = T.Broadcast(0, 4)
            p0_1 = T.Buffer((16,), "uint8", data=p0.data)
            p1_1 = T.Buffer((16,), "int8", data=p1.data)
            compute_1[i_inner * 4:i_inner * 4 + 4] = 
              T.call_llvm_pure_intrin("int32x4", "llvm.x86.sse2.pmadd.wd", T.uint32(2), 
                T.call_llvm_pure_intrin("int16x8", "llvm.x86.ssse3.pmadd.ub.sw.128", T.uint32(2), 
                  T.reinterpret("int8x16", T.Broadcast(
                    T.reinterpret("int32", p0_1[i_inner * 4:i_inner * 4 + 4]), 4)), p1_1[0:16]), 
                      T.Broadcast(T.int16(1), 8)) + compute_1[i_inner * 4:i_inner * 4 + 4]

0000000000001e90 <tvmgen_default_fused_nn_contrib_dense_pack_compute_>:
    1e90:	c4 e2 79 18 06       	vbroadcastss (%rsi),%xmm0
    1e95:	c5 f9 6f 12          	vmovdqa (%rdx),%xmm2
    1e99:	c5 f9 6f 4a 10       	vmovdqa 0x10(%rdx),%xmm1
    1e9e:	c4 e2 79 04 da       	vpmaddubsw %xmm2,%xmm0,%xmm3
    1ea3:	c4 e2 79 18 05 e4 11 	vbroadcastss 0x11e4(%rip),%xmm0        # 3090 <_fini+0x620>
    1eaa:	00 00 
    1eac:	c5 e1 f5 d8          	vpmaddwd %xmm0,%xmm3,%xmm3

define internal fastcc void @tvmgen_default_fused_nn_contrib_dense_pack_compute_{
entry:
  {...}
  %3 = load i32, ptr %1, align 64, !tbaa !310
  %4 = insertelement <4 x i32> undef, i32 %3, i64 0
  %5 = bitcast <4 x i32> %4 to <16 x i8>
  %6 = shufflevector <16 x i8> %5, <16 x i8> poison, <16 x i32>
            <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3,
            i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>
  %7 = load <16 x i8>, ptr %2, align 64, !tbaa !312
  %8 = tail call <8 x i16> @llvm.x86.ssse3.pmadd.ub.sw.128(<16 x i8> %6, <16 x i8> %7)
  %9 = tail call <4 x i32> @llvm.x86.sse2.pmadd.wd(<8 x i16> %8, <8 x i16> 
            <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>)
  %10 = getelementptr inbounds i8, ptr %1, i64 4
  {...}

Credits

There is a compact full x86 SIMD table guide here.
This work here follows some suggestions from intel's onednn int8 compute notes.

There are already the {Float,Int,String}Imm so why not ArrayIntImm.
It is needed for simple arguments of TIR functions, not representing real data (external, buffer allocations).
See the simple usage from python: tir.vectorpermute("int32x8", whatever_vector, [0, 1, 4, 5, 2, 3, 6, 7])
See the lowered IR (llvm -mcpu=haswell) as {..} T.ArrayIntImm([0, 1, 4, 5, 2, 3, 6, 7]) {...}:

@I.ir_module
class Module:
    @T.prim_func
    def tvmgen_default_fused_nn_contrib_dense_pack(
           p0: T.Buffer((4, 8), "uint8"), 
           p1: T.Buffer((1, 2, 8, 4), "int8"), 
           compute: T.Buffer((4, 8), "int32")):
        T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
        for i_inner in range(4):
          compute_1 = T.Buffer((32,), "int32", data=compute.data)
          compute_1[i_inner * 8:i_inner * 8 + 8] = T.Broadcast(0, 8)
          for k_outer in range(2):
            p0_1 = T.Buffer((32,), "uint8", data=p0.data)
            p1_1 = T.Buffer((64,), "int8", data=p1.data)
            compute_1[i_inner * 8:i_inner * 8 + 8] =
            T.vectorpermute("int32x8", 
            T.call_llvm_pure_intrin("int32x8", "llvm.x86.avx2.phadd.d", T.uint32(2),
            T.call_llvm_pure_intrin("int32x8", "llvm.x86.avx2.pmadd.wd",  T.uint32(2),
            T.vectorlow("void", T.zextend("int16x32", T.reinterpret("int8x32", T.Broadcast(
            T.reinterpret("int32", p0_1[i_inner * 8 + k_outer * 4:i_inner * 8 + k_outer * 4 + 4]), 8)))), 
            T.vectorlow("void", T.sextend("int16x32", p1_1[k_outer * 32:k_outer * 32 + 32]))), 
            T.call_llvm_pure_intrin("int32x8", "llvm.x86.avx2.pmadd.wd", T.uint32(2),
            T.vectorhigh("void", 
            T.zextend("int16x32", T.reinterpret("int8x32",  T.Broadcast(
            T.reinterpret("int32", p0_1[i_inner * 8 + k_outer * 4:i_inner * 8 + k_outer * 4 + 4]), 8)))), 
            T.vectorhigh("void", 
            T.sextend("int16x32", p1_1[k_outer * 32:k_outer * 32 + 32])))), 
            T.ArrayIntImm([0, 1, 4, 5, 2, 3, 6, 7])) + compute_1[i_inner * 8:i_inner * 8 + 8]

The call_llvm_pure_intrin & call_llvm_intrin now holds instruction StringImm instead of IntImm abstract.

I like this a lot, would make the TIR much easier to reason about!

@ekalda

The work here (x86) is a pseudo kind of "scalable-vector" having _m128, _m256, _m512 but "hand unrolled" ones.
I also follow your RFC related to scalable vectors, I am interested in similar ideas for the riscv64 "v" extension.

ekalda · 2023-10-16T15:15:37Z

So, I had to look into adding zextend, sextend, truncate plus the vectorpermute, vectorshuffle instead.

The good point is that these are lowered to exactly what is needed (even single insn, optim) for the target arch (x86 here).

Yes, all of these intrinsics are architecture independent in LLVM, so this is a great addition from the point of other backends as well.

Regarding to using AllocateConst, I agree that it is designed (and named) with the intent of holding larger chunks of runtime data, but it is still essentially a TIR container for constants that you can query in the LLVM codegen.

There are already the {Float,Int,String}Imm so why not ArrayIntImm.

{Float,Int,String}Imm are not duplicating functionality :)

See the simple usage from python: tir.vectorpermute("int32x8", whatever_vector, [0, 1, 4, 5, 2, 3, 6, 7])

idx_vec = T.allocate_const([0, 1, 4, 5, 2, 3, 6, 7], "int32")
tir.vectorpermute("int32x8", whatever_vector, idx_vec)

is not more complex in my opinion :) I appreciate the convenience of printing the index array inplace in TIR though.

In general, I won't argue against this change if there is a wider consensus that this is a necessary addition. However, I think we have to think carefully about adding another mechanism into TIR for representing data arrays that is opaque to the memory planning. From what I can see, there are no restrictions to the size of the data it can hold, so it's rather susceptible to misuse.

I also follow your RFC related to scalable vectors, I am interested in similar ideas for the riscv64 "v" extension.

That's cool! Yes, I hope we can come up with a design that is going to work for all the scalable vector architectures out there. Feel free to chip in with your thoughts there!

cbalint13 · 2023-10-16T16:21:53Z

idx_vec = T.allocate_const([0, 1, 4, 5, 2, 3, 6, 7], "int32")
tir.vectorpermute("int32x8", whatever_vector, idx_vec)
is not more complex in my opinion :) I appreciate the convenience of printing the index array inplace in TIR though.

Hmm ... that's really short !
I don't know why I found AllocateConst (and stayed convinced) that is something complicated for "immediate" use.

tvm/tests/python/unittest/test_tvmscript_ir_builder_tir.py

Lines 386 to 394 in a13eadb

    
           buffer_var = tir.Var("v", tvm.ir.PointerType(tvm.ir.PrimType("int32"))) 
        
           ir_expected = tir.AllocateConst( 
        
               buffer_var, 
        
               "int32", 
        
               [10], 
        
               ndarray.array(np.asarray(data, "int32")), 
        
               tir.Evaluate(1), 
        
               annotations={}, 
        
           )

I also find attracted to pass a plain simple python list().

In general, I won't argue against this change if there is a wider consensus that this is a necessary addition. However, I think we have to think carefully about adding another mechanism into TIR for representing data arrays that is opaque to the memory planning. From what I can see, there are no restrictions to the size of the data it can hold, so it's rather susceptible to misuse.

For the misuse part, yes, also agree, but my understanding of AllocateConst is that inherits many properties of the buffers, while ArrayIntImm is more restrained kind of object treatable only as plain "constant" within TIR parts.

Well, I reconsider AllocateConst if there is no consensus.
Thanks for pointing this out with a succinct example !

cbalint13 · 2023-10-16T20:55:59Z

idx_vec = T.allocate_const([0, 1, 4, 5, 2, 3, 6, 7], "int32")
tir.vectorpermute("int32x8", whatever_vector, idx_vec)
is not more complex in my opinion :) I appreciate the convenience of printing the index array inplace in TIR though.
Hmm ... that's really short ! I don't know why I found AllocateConst (and stayed convinced) that is something complicated for "immediate" use.

I add these few lines here showing more contrasted aspects against/pro ArrayIntImm vs. AllocateConst.

Tried to stick with AllocateConst, by default it does not come at a cheap:

    idx_vec = tvm.tir.allocate_const([0, 1, 4, 5, 2, 3, 6, 7], "int32")
              ^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'tvm.tir' has no attribute 'allocate_const'. Did you mean: 'AllocateConst'?

https://tvm.apache.org/docs/reference/api/python/tir.html#tvm.tir.AllocateConst

  idx_vec = tvm.tir.AllocateConst([0, 1, 4, 5, 2, 3, 6, 7], "int32")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: AllocateConst.__init__() missing 3 required 
 positional arguments: 'extents', 'data_or_idx', and 'body'

Can be "aliased" or something to get the simple form: tir.allocate_const([0, 1, 4, 5, 2, 3, 6, 7], "int32")
Also remembered a remark regarding "the not exposed yet status" of tir.allocate_const at this point in the codebase.

I remain with the idea to use AllocateConst if there will be no consensus on ArrayIntImm.

ekalda · 2023-10-19T14:41:12Z

Sorry for delay on this, I was on training for two days -

Tried to stick with AllocateConst, by default it does not come at a cheap:

Ah oops I had quoted the TVMScript interface there and didn't also realise all of these arguments are required. I suppose these could be made optional if this would make AllocateConst usable for wider range of use cases.

Also remembered a remark regarding "the not exposed yet status" of tir.allocate_const at this point in the codebase.

There's an in flight patch where @Lunderberg has done a great job in integrating the AllocateConst into microTVM implementation (it just keeps failing the CI 🙈).

Also, it seems to me that there are a few core compiler changes in this patch that are needed for the new TOPI schedules, but are in essence target independent changes that would warrant separate discussions. How do you feel about breaking this patch up into smaller patches? From eyeballing the patch, e.g.

Changes to call_llvm_intrin and call_llvm_pure_intrin
ArrayIntImm node
New TIR intrinsics
TOPI schedules

cbalint13 · 2023-10-19T16:00:05Z

There's an in flight patch where @Lunderberg has done a great job in integrating the AllocateConst into microTVM implementation (it just keeps failing the CI 🙈).

Also, it seems to me that there are a few core compiler changes in this patch that are needed for the new TOPI schedules, but are in essence target independent changes that would warrant separate discussions.

See now, thanks for pointing this out !

How do you feel about breaking this patch up into smaller patches? From eyeballing the patch, e.g.

Changes to call_llvm_intrin and call_llvm_pure_intrin

ArrayIntImm node

New TIR intrinsics

TOPI schedules

That was exactly what I was thinking, so I will split this up.
I created a tracking checker-box in the first comment of this PR.

Lunderberg · 2023-10-20T17:34:00Z

Thank you for pointing out #15300. It's been on the back-burner for quite a while, as a pre-cursor to (hopefully/finally) landing #14985. If anybody who is more familiar with tvm.contrib.ethosu implementation, any help on resolving the breakages would be appreciated.

This allows printing of the LLVM function real name in TIR printer. Prior to this a counter-intuitive T.int32() value was printed instead of the real name. Changes Before: T.call_llvm_pure_intrin("int32x4", T.uint32(62), T.uint32(0)) After: T.call_llvm_pure_intrin("int32x4", "llvm.donothing", T.uint32(0)) This is part of #15918 .

cbalint13 force-pushed the x86-simd branch 15 times, most recently from dc1b629 to fe30368 Compare October 13, 2023 10:22

[TOPI][TIR][TE][x86] Extend x86 SIMD (u)int8 coverage for dense & conv2d

0d33882

cbalint13 force-pushed the x86-simd branch from fe30368 to 0d33882 Compare October 13, 2023 14:32

cbalint13 marked this pull request as ready for review October 13, 2023 21:41

cbalint13 mentioned this pull request Oct 19, 2023

[TVMScript][TIR] Pretty print TIR LLVM function name #15953

Merged

cbalint13 mentioned this pull request Oct 20, 2023

[TIR][LLVM] Expose TIR api for llvm sext/zext and trunc native type converter intrinsics #15960

Open

cbalint13 marked this pull request as draft October 24, 2023 00:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI][TIR][TE][x86] Extend x86 SIMD (u)int8 coverage for dense & conv2d #15918

[TOPI][TIR][TE][x86] Extend x86 SIMD (u)int8 coverage for dense & conv2d #15918

cbalint13 commented Oct 11, 2023 •

edited

ekalda commented Oct 16, 2023

cbalint13 commented Oct 16, 2023 •

edited

ekalda commented Oct 16, 2023

cbalint13 commented Oct 16, 2023

cbalint13 commented Oct 16, 2023 •

edited

ekalda commented Oct 19, 2023

cbalint13 commented Oct 19, 2023

Lunderberg commented Oct 20, 2023

[TOPI][TIR][TE][x86] Extend x86 SIMD (u)int8 coverage for dense & conv2d #15918

Are you sure you want to change the base?

[TOPI][TIR][TE][x86] Extend x86 SIMD (u)int8 coverage for dense & conv2d #15918

Conversation

cbalint13 commented Oct 11, 2023 • edited

Tracker:

Changes:

Performance

Notes

Samples

Credits

Next

ekalda commented Oct 16, 2023

cbalint13 commented Oct 16, 2023 • edited

ekalda commented Oct 16, 2023

cbalint13 commented Oct 16, 2023

cbalint13 commented Oct 16, 2023 • edited

ekalda commented Oct 19, 2023

cbalint13 commented Oct 19, 2023

Lunderberg commented Oct 20, 2023

cbalint13 commented Oct 11, 2023 •

edited

cbalint13 commented Oct 16, 2023 •

edited

cbalint13 commented Oct 16, 2023 •

edited