INT8 GEMM support

The C++ matmul kernels (aie2/mm.cc, aie2p/mm.cc) already have vectorized INT8
  matmul templates (i8→i8, i8→i16, i8→i32, MAC shape 8x8x8) and compile flags
  (-Di8_i8_ONLY, etc.), but the Python GEMM operator only accepts bf16 input.

  This would wire up INT8 through the Python layer:

  - design.py: add "i8" to dtype_in, "i8"/"i16"/"i32" to dtype_out,
    add i8 MAC dims (8,8,8) to microkernel_mac_dim_map
  - op.py: add i8 kernel flags, min tile sizes, skip bf16 emulation for int8
  - reference.py: int8 golden reference
  - test.py: int8 test cases

  The NPU does ~50 TOPs INT8 vs ~3-5 TOPs bf16, so this would be a big
  throughput gain for quantized inference.

If I'm not missing anything it should be trivial, I'm already on it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INT8 GEMM support #93

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

INT8 GEMM support #93

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions