The C++ matmul kernels (aie2/mm.cc, aie2p/mm.cc) already have vectorized INT8
matmul templates (i8→i8, i8→i16, i8→i32, MAC shape 8x8x8) and compile flags
(-Di8_i8_ONLY, etc.), but the Python GEMM operator only accepts bf16 input.
This would wire up INT8 through the Python layer:
- design.py: add "i8" to dtype_in, "i8"/"i16"/"i32" to dtype_out,
add i8 MAC dims (8,8,8) to microkernel_mac_dim_map
- op.py: add i8 kernel flags, min tile sizes, skip bf16 emulation for int8
- reference.py: int8 golden reference
- test.py: int8 test cases
The NPU does ~50 TOPs INT8 vs ~3-5 TOPs bf16, so this would be a big
throughput gain for quantized inference.
If I'm not missing anything it should be trivial, I'm already on it.
The C++ matmul kernels (aie2/mm.cc, aie2p/mm.cc) already have vectorized INT8
matmul templates (i8→i8, i8→i16, i8→i32, MAC shape 8x8x8) and compile flags
(-Di8_i8_ONLY, etc.), but the Python GEMM operator only accepts bf16 input.
This would wire up INT8 through the Python layer:
add i8 MAC dims (8,8,8) to microkernel_mac_dim_map
The NPU does ~50 TOPs INT8 vs ~3-5 TOPs bf16, so this would be a big
throughput gain for quantized inference.
If I'm not missing anything it should be trivial, I'm already on it.