[GEMM codegen] Distribute Shared memory copy #303

Xinyu302 · 2024-06-02T18:14:31Z

Precondition: Check whether the computation can be vectorized. If it cannot, fall back to the non-vectorized algorithm.

Convert linalg.copy to linalg.generic.
Perform tiling. Total number of elements = N. The number of elements that can be read in one vectorized operation = V. So the number of loops equals N divided by V times the number of threads. In the current case, this is
8×128. Once the thread block completes a tile, it moves to the next tile.
Within the tile, distribute by introducing threadIdx.x and threadIdx.y, calculate the flattened 1D coordinate, and then distribute, each thread should copy a "1x8" tile.
Transform vectorized operations to transfer_read and transfer_write.
Outer loop unroll.

Xinyu302 · 2024-06-05T07:26:11Z

RUN: byteir-opt -gpu-distribute-shared-memory-copy gpu-distribute-shared-memory-copy.mlir --cse --canonicalize --fold-memref-alias-ops --canonicalize --cse

Result:

#map = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 16 + d3 * 32 + d1 floordiv 4)>
#map1 = affine_map<(d0)[s0] -> (d0 * 8 + s0 - (d0 floordiv 4) * 32)>
#map2 = affine_map<(d0, d1, d2) -> (d1 * 16 + d2 * 32 + d0 floordiv 4)>
#map3 = affine_map<(d0) -> (d0 * 8 - (d0 floordiv 4) * 32)>
#map4 = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 16 + d3 * 32 + d1 floordiv 4 + 32)>
#map5 = affine_map<(d0, d1, d2) -> (d1 * 16 + d2 * 32 + d0 floordiv 4 + 32)>
#map6 = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 16 + d3 * 32 + d1 floordiv 4 + 64)>
#map7 = affine_map<(d0, d1, d2) -> (d1 * 16 + d2 * 32 + d0 floordiv 4 + 64)>
#map8 = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 16 + d3 * 32 + d1 floordiv 4 + 96)>
#map9 = affine_map<(d0, d1, d2) -> (d1 * 16 + d2 * 32 + d0 floordiv 4 + 96)>
#map10 = affine_map<(d0, d1, d2)[s0] -> (d1 * 4 + d2 * 8 + s0 + d0 floordiv 16)>
#map11 = affine_map<(d0, d1) -> (d0 * 128 + d1 * 8 - (d1 floordiv 16) * 128)>
#map12 = affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16)>
#map13 = affine_map<(d0) -> (d0 * 8 - (d0 floordiv 16) * 128)>
#map14 = affine_map<(d0, d1, d2)[s0] -> (d1 * 4 + d2 * 8 + s0 + d0 floordiv 16 + 8)>
#map15 = affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16 + 8)>
#map16 = affine_map<(d0, d1, d2)[s0] -> (d1 * 4 + d2 * 8 + s0 + d0 floordiv 16 + 16)>
#map17 = affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16 + 16)>
#map18 = affine_map<(d0, d1, d2)[s0] -> (d1 * 4 + d2 * 8 + s0 + d0 floordiv 16 + 24)>
#map19 = affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16 + 24)>
#map20 = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 4 + d3 * 8 + d1 floordiv 16)>
#map21 = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 4 + d3 * 8 + d1 floordiv 16 + 8)>
#map22 = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 4 + d3 * 8 + d1 floordiv 16 + 16)>
#map23 = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 4 + d3 * 8 + d1 floordiv 16 + 24)>
#map24 = affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16 + 32)>
#map25 = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 4 + d3 * 8 + d1 floordiv 16 + 32)>
#map26 = affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16 + 40)>
#map27 = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 4 + d3 * 8 + d1 floordiv 16 + 40)>
#map28 = affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16 + 48)>
#map29 = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 4 + d3 * 8 + d1 floordiv 16 + 48)>
#map30 = affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16 + 56)>
#map31 = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 4 + d3 * 8 + d1 floordiv 16 + 56)>
#map32 = affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16 + 64)>
#map33 = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 4 + d3 * 8 + d1 floordiv 16 + 64)>
#map34 = affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16 + 72)>
#map35 = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 4 + d3 * 8 + d1 floordiv 16 + 72)>
#map36 = affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16 + 80)>
#map37 = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 4 + d3 * 8 + d1 floordiv 16 + 80)>
#map38 = affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16 + 88)>
#map39 = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 4 + d3 * 8 + d1 floordiv 16 + 88)>
#map40 = affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16 + 96)>
#map41 = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 4 + d3 * 8 + d1 floordiv 16 + 96)>
#map42 = affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16 + 104)>
#map43 = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 4 + d3 * 8 + d1 floordiv 16 + 104)>
#map44 = affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16 + 112)>
#map45 = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 4 + d3 * 8 + d1 floordiv 16 + 112)>
#map46 = affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16 + 120)>
#map47 = affine_map<(d0, d1, d2, d3) -> (d0 * 128 + d2 * 4 + d3 * 8 + d1 floordiv 16 + 120)>
module {
  func.func private @Unknown0(%arg0: memref<5376x2048xf16>, %arg1: memref<2048x5376xf16>) -> memref<5376x5376xf16> attributes {__byteir_gemm_block_size__ = [64, 2, 1], __byteir_gemm_pipeline_depth__ = 3 : i64, __byteir_gemm_tile_config__ = [128, 128, 32], __byteir_matmul_epilogue_fusion__} {
    %cst = arith.constant 0.000000e+00 : f16
    %c0 = arith.constant 0 : index
    %c2048 = arith.constant 2048 : index
    %c32 = arith.constant 32 : index
    %alloc = memref.alloc() : memref<5376x5376xf16>
    scf.forall (%arg2, %arg3) in (42, 42) {
      %0 = gpu.thread_id  x
      %1 = gpu.thread_id  y
      %2 = gpu.thread_id  z
      %alloca = memref.alloca() {__byteir_alloca_accumulator__} : memref<128x128xf16, #gpu.address_space<workgroup>>
      %alloca_0 = memref.alloca() {__byteir_alloca_matrix_b__} : memref<32x128xf16, #gpu.address_space<workgroup>>
      %alloca_1 = memref.alloca() {__byteir_alloca_matrix_a__} : memref<128x32xf16, #gpu.address_space<workgroup>>
      linalg.fill ins(%cst : f16) outs(%alloca : memref<128x128xf16, #gpu.address_space<workgroup>>)
      scf.for %arg4 = %c0 to %c2048 step %c32 {
        %53 = affine.apply #map(%arg2, %0, %1, %2)
        %54 = affine.apply #map1(%0)[%arg4]
        %55 = vector.transfer_read %arg0[%53, %54], %cst {in_bounds = [true, true]} : memref<5376x2048xf16>, vector<1x8xf16>
        %56 = affine.apply #map2(%0, %1, %2)
        %57 = affine.apply #map3(%0)
        vector.transfer_write %55, %alloca_1[%56, %57] {in_bounds = [true, true]} : vector<1x8xf16>, memref<128x32xf16, #gpu.address_space<workgroup>>
        %58 = affine.apply #map4(%arg2, %0, %1, %2)
        %59 = vector.transfer_read %arg0[%58, %54], %cst {in_bounds = [true, true]} : memref<5376x2048xf16>, vector<1x8xf16>
        %60 = affine.apply #map5(%0, %1, %2)
        vector.transfer_write %59, %alloca_1[%60, %57] {in_bounds = [true, true]} : vector<1x8xf16>, memref<128x32xf16, #gpu.address_space<workgroup>>
        %61 = affine.apply #map6(%arg2, %0, %1, %2)
        %62 = vector.transfer_read %arg0[%61, %54], %cst {in_bounds = [true, true]} : memref<5376x2048xf16>, vector<1x8xf16>
        %63 = affine.apply #map7(%0, %1, %2)
        vector.transfer_write %62, %alloca_1[%63, %57] {in_bounds = [true, true]} : vector<1x8xf16>, memref<128x32xf16, #gpu.address_space<workgroup>>
        %64 = affine.apply #map8(%arg2, %0, %1, %2)
        %65 = vector.transfer_read %arg0[%64, %54], %cst {in_bounds = [true, true]} : memref<5376x2048xf16>, vector<1x8xf16>
        %66 = affine.apply #map9(%0, %1, %2)
        vector.transfer_write %65, %alloca_1[%66, %57] {in_bounds = [true, true]} : vector<1x8xf16>, memref<128x32xf16, #gpu.address_space<workgroup>>
        %67 = affine.apply #map10(%0, %1, %2)[%arg4]
        %68 = affine.apply #map11(%arg3, %0)
        %69 = vector.transfer_read %arg1[%67, %68], %cst {in_bounds = [true, true]} : memref<2048x5376xf16>, vector<1x8xf16>
        %70 = affine.apply #map12(%0, %1, %2)
        %71 = affine.apply #map13(%0)
        vector.transfer_write %69, %alloca_0[%70, %71] {in_bounds = [true, true]} : vector<1x8xf16>, memref<32x128xf16, #gpu.address_space<workgroup>>
        %72 = affine.apply #map14(%0, %1, %2)[%arg4]
        %73 = vector.transfer_read %arg1[%72, %68], %cst {in_bounds = [true, true]} : memref<2048x5376xf16>, vector<1x8xf16>
        %74 = affine.apply #map15(%0, %1, %2)
        vector.transfer_write %73, %alloca_0[%74, %71] {in_bounds = [true, true]} : vector<1x8xf16>, memref<32x128xf16, #gpu.address_space<workgroup>>
        %75 = affine.apply #map16(%0, %1, %2)[%arg4]
        %76 = vector.transfer_read %arg1[%75, %68], %cst {in_bounds = [true, true]} : memref<2048x5376xf16>, vector<1x8xf16>
        %77 = affine.apply #map17(%0, %1, %2)
        vector.transfer_write %76, %alloca_0[%77, %71] {in_bounds = [true, true]} : vector<1x8xf16>, memref<32x128xf16, #gpu.address_space<workgroup>>
        %78 = affine.apply #map18(%0, %1, %2)[%arg4]
        %79 = vector.transfer_read %arg1[%78, %68], %cst {in_bounds = [true, true]} : memref<2048x5376xf16>, vector<1x8xf16>
        %80 = affine.apply #map19(%0, %1, %2)
        vector.transfer_write %79, %alloca_0[%80, %71] {in_bounds = [true, true]} : vector<1x8xf16>, memref<32x128xf16, #gpu.address_space<workgroup>>
        linalg.matmul {__byteir_gpu_tile_gemm_0, __byteir_mma__, __byteir_mma_level__ = "Threadblock", __byteir_target__ = "nv_sm_80"} ins(%alloca_1, %alloca_0 : memref<128x32xf16, #gpu.address_space<workgroup>>, memref<32x128xf16, #gpu.address_space<workgroup>>) outs(%alloca : memref<128x128xf16, #gpu.address_space<workgroup>>)
      }
      %3 = affine.apply #map12(%0, %1, %2)
      %4 = affine.apply #map13(%0)
      %5 = vector.transfer_read %alloca[%3, %4], %cst {in_bounds = [true, true]} : memref<128x128xf16, #gpu.address_space<workgroup>>, vector<1x8xf16>
      %6 = affine.apply #map20(%arg2, %0, %1, %2)
      %7 = affine.apply #map11(%arg3, %0)
      vector.transfer_write %5, %alloc[%6, %7] {in_bounds = [true, true]} : vector<1x8xf16>, memref<5376x5376xf16>
      %8 = affine.apply #map15(%0, %1, %2)
      %9 = vector.transfer_read %alloca[%8, %4], %cst {in_bounds = [true, true]} : memref<128x128xf16, #gpu.address_space<workgroup>>, vector<1x8xf16>
      %10 = affine.apply #map21(%arg2, %0, %1, %2)
      vector.transfer_write %9, %alloc[%10, %7] {in_bounds = [true, true]} : vector<1x8xf16>, memref<5376x5376xf16>
      %11 = affine.apply #map17(%0, %1, %2)
      %12 = vector.transfer_read %alloca[%11, %4], %cst {in_bounds = [true, true]} : memref<128x128xf16, #gpu.address_space<workgroup>>, vector<1x8xf16>
      %13 = affine.apply #map22(%arg2, %0, %1, %2)
      vector.transfer_write %12, %alloc[%13, %7] {in_bounds = [true, true]} : vector<1x8xf16>, memref<5376x5376xf16>
      %14 = affine.apply #map19(%0, %1, %2)
      %15 = vector.transfer_read %alloca[%14, %4], %cst {in_bounds = [true, true]} : memref<128x128xf16, #gpu.address_space<workgroup>>, vector<1x8xf16>
      %16 = affine.apply #map23(%arg2, %0, %1, %2)
      vector.transfer_write %15, %alloc[%16, %7] {in_bounds = [true, true]} : vector<1x8xf16>, memref<5376x5376xf16>
      %17 = affine.apply #map24(%0, %1, %2)
      %18 = vector.transfer_read %alloca[%17, %4], %cst {in_bounds = [true, true]} : memref<128x128xf16, #gpu.address_space<workgroup>>, vector<1x8xf16>
      %19 = affine.apply #map25(%arg2, %0, %1, %2)
      vector.transfer_write %18, %alloc[%19, %7] {in_bounds = [true, true]} : vector<1x8xf16>, memref<5376x5376xf16>
      %20 = affine.apply #map26(%0, %1, %2)
      %21 = vector.transfer_read %alloca[%20, %4], %cst {in_bounds = [true, true]} : memref<128x128xf16, #gpu.address_space<workgroup>>, vector<1x8xf16>
      %22 = affine.apply #map27(%arg2, %0, %1, %2)
      vector.transfer_write %21, %alloc[%22, %7] {in_bounds = [true, true]} : vector<1x8xf16>, memref<5376x5376xf16>
      %23 = affine.apply #map28(%0, %1, %2)
      %24 = vector.transfer_read %alloca[%23, %4], %cst {in_bounds = [true, true]} : memref<128x128xf16, #gpu.address_space<workgroup>>, vector<1x8xf16>
      %25 = affine.apply #map29(%arg2, %0, %1, %2)
      vector.transfer_write %24, %alloc[%25, %7] {in_bounds = [true, true]} : vector<1x8xf16>, memref<5376x5376xf16>
      %26 = affine.apply #map30(%0, %1, %2)
      %27 = vector.transfer_read %alloca[%26, %4], %cst {in_bounds = [true, true]} : memref<128x128xf16, #gpu.address_space<workgroup>>, vector<1x8xf16>
      %28 = affine.apply #map31(%arg2, %0, %1, %2)
      vector.transfer_write %27, %alloc[%28, %7] {in_bounds = [true, true]} : vector<1x8xf16>, memref<5376x5376xf16>
      %29 = affine.apply #map32(%0, %1, %2)
      %30 = vector.transfer_read %alloca[%29, %4], %cst {in_bounds = [true, true]} : memref<128x128xf16, #gpu.address_space<workgroup>>, vector<1x8xf16>
      %31 = affine.apply #map33(%arg2, %0, %1, %2)
      vector.transfer_write %30, %alloc[%31, %7] {in_bounds = [true, true]} : vector<1x8xf16>, memref<5376x5376xf16>
      %32 = affine.apply #map34(%0, %1, %2)
      %33 = vector.transfer_read %alloca[%32, %4], %cst {in_bounds = [true, true]} : memref<128x128xf16, #gpu.address_space<workgroup>>, vector<1x8xf16>
      %34 = affine.apply #map35(%arg2, %0, %1, %2)
      vector.transfer_write %33, %alloc[%34, %7] {in_bounds = [true, true]} : vector<1x8xf16>, memref<5376x5376xf16>
      %35 = affine.apply #map36(%0, %1, %2)
      %36 = vector.transfer_read %alloca[%35, %4], %cst {in_bounds = [true, true]} : memref<128x128xf16, #gpu.address_space<workgroup>>, vector<1x8xf16>
      %37 = affine.apply #map37(%arg2, %0, %1, %2)
      vector.transfer_write %36, %alloc[%37, %7] {in_bounds = [true, true]} : vector<1x8xf16>, memref<5376x5376xf16>
      %38 = affine.apply #map38(%0, %1, %2)
      %39 = vector.transfer_read %alloca[%38, %4], %cst {in_bounds = [true, true]} : memref<128x128xf16, #gpu.address_space<workgroup>>, vector<1x8xf16>
      %40 = affine.apply #map39(%arg2, %0, %1, %2)
      vector.transfer_write %39, %alloc[%40, %7] {in_bounds = [true, true]} : vector<1x8xf16>, memref<5376x5376xf16>
      %41 = affine.apply #map40(%0, %1, %2)
      %42 = vector.transfer_read %alloca[%41, %4], %cst {in_bounds = [true, true]} : memref<128x128xf16, #gpu.address_space<workgroup>>, vector<1x8xf16>
      %43 = affine.apply #map41(%arg2, %0, %1, %2)
      vector.transfer_write %42, %alloc[%43, %7] {in_bounds = [true, true]} : vector<1x8xf16>, memref<5376x5376xf16>
      %44 = affine.apply #map42(%0, %1, %2)
      %45 = vector.transfer_read %alloca[%44, %4], %cst {in_bounds = [true, true]} : memref<128x128xf16, #gpu.address_space<workgroup>>, vector<1x8xf16>
      %46 = affine.apply #map43(%arg2, %0, %1, %2)
      vector.transfer_write %45, %alloc[%46, %7] {in_bounds = [true, true]} : vector<1x8xf16>, memref<5376x5376xf16>
      %47 = affine.apply #map44(%0, %1, %2)
      %48 = vector.transfer_read %alloca[%47, %4], %cst {in_bounds = [true, true]} : memref<128x128xf16, #gpu.address_space<workgroup>>, vector<1x8xf16>
      %49 = affine.apply #map45(%arg2, %0, %1, %2)
      vector.transfer_write %48, %alloc[%49, %7] {in_bounds = [true, true]} : vector<1x8xf16>, memref<5376x5376xf16>
      %50 = affine.apply #map46(%0, %1, %2)
      %51 = vector.transfer_read %alloca[%50, %4], %cst {in_bounds = [true, true]} : memref<128x128xf16, #gpu.address_space<workgroup>>, vector<1x8xf16>
      %52 = affine.apply #map47(%arg2, %0, %1, %2)
      vector.transfer_write %51, %alloc[%52, %7] {in_bounds = [true, true]} : vector<1x8xf16>, memref<5376x5376xf16>
    } {mapping = [#gpu.block<y>, #gpu.block<x>]}
    return %alloc : memref<5376x5376xf16>
  }
}

Xinyu302 · 2024-06-05T07:52:06Z

After step 0, generic linalg copy ops:

scf.forall (%arg2, %arg3) in (42, 42) {
  %alloca = memref.alloca() {__byteir_alloca_accumulator__} : memref<128x128xf16, #gpu.address_space<workgroup>>
  %alloca_0 = memref.alloca() {__byteir_alloca_matrix_b__} : memref<32x128xf16, #gpu.address_space<workgroup>>
  %alloca_1 = memref.alloca() {__byteir_alloca_matrix_a__} : memref<128x32xf16, #gpu.address_space<workgroup>>
  %0 = affine.apply affine_map<(d0) -> (d0 * 128)>(%arg2)
  %1 = affine.apply affine_map<(d0) -> (d0 * 128)>(%arg3)
  %subview = memref.subview %alloc[%0, %1] [128, 128] [1, 1] : memref<5376x5376xf16> to memref<128x128xf16, strided<[5376, 1], offset: ?>>
  linalg.fill ins(%cst : f16) outs(%alloca : memref<128x128xf16, #gpu.address_space<workgroup>>)
  scf.for %arg4 = %c0 to %c2048 step %c32 {
    %subview_2 = memref.subview %arg0[%0, %arg4] [128, 32] [1, 1] : memref<5376x2048xf16> to memref<128x32xf16, strided<[2048, 1], offset: ?>>
    %subview_3 = memref.subview %arg1[%arg4, %1] [32, 128] [1, 1] : memref<2048x5376xf16> to memref<32x128xf16, strided<[5376, 1], offset: ?>>
    linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%subview_2 : memref<128x32xf16, strided<[2048, 1], offset: ?>>) outs(%alloca_1 : memref<128x32xf16, #gpu.address_space<workgroup>>) attrs =  {__byteir_load_matrix_a__, __internal_linalg_transform__ = "__byteir_copy_related_to_workgroup_memory__"} {
    ^bb0(%in: f16, %out: f16):
      linalg.yield %in : f16
    }
    linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%subview_3 : memref<32x128xf16, strided<[5376, 1], offset: ?>>) outs(%alloca_0 : memref<32x128xf16, #gpu.address_space<workgroup>>) attrs =  {__byteir_load_matrix_b__, __internal_linalg_transform__ = "__byteir_copy_related_to_workgroup_memory__"} {
    ^bb0(%in: f16, %out: f16):
      linalg.yield %in : f16
    }
    linalg.matmul {__byteir_gpu_tile_gemm_0, __byteir_mma__, __byteir_mma_level__ = "Threadblock", __byteir_target__ = "nv_sm_80"} ins(%alloca_1, %alloca_0 : memref<128x32xf16, #gpu.address_space<workgroup>>, memref<32x128xf16, #gpu.address_space<workgroup>>) outs(%alloca : memref<128x128xf16, #gpu.address_space<workgroup>>)
  }
  linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%alloca : memref<128x128xf16, #gpu.address_space<workgroup>>) outs(%subview : memref<128x128xf16, strided<[5376, 1], offset: ?>>) attrs =  {__byteir_store_matrix_c__, __internal_linalg_transform__ = "__byteir_copy_related_to_workgroup_memory__"} {
  ^bb0(%in: f16, %out: f16):
    linalg.yield %in : f16
  }
} {mapping = [#gpu.block<y>, #gpu.block<x>]}

Xinyu302 · 2024-06-05T07:53:33Z

After step 1 tiling:

scf.forall (%arg2, %arg3) in (42, 42) {
  %alloca = memref.alloca() {__byteir_alloca_accumulator__} : memref<128x128xf16, #gpu.address_space<workgroup>>
  %alloca_0 = memref.alloca() {__byteir_alloca_matrix_b__} : memref<32x128xf16, #gpu.address_space<workgroup>>
  %alloca_1 = memref.alloca() {__byteir_alloca_matrix_a__} : memref<128x32xf16, #gpu.address_space<workgroup>>
  %0 = affine.apply affine_map<(d0) -> (d0 * 128)>(%arg2)
  %1 = affine.apply affine_map<(d0) -> (d0 * 128)>(%arg3)
  %subview = memref.subview %alloc[%0, %1] [128, 128] [1, 1] : memref<5376x5376xf16> to memref<128x128xf16, strided<[5376, 1], offset: ?>>
  linalg.fill ins(%cst : f16) outs(%alloca : memref<128x128xf16, #gpu.address_space<workgroup>>)
  scf.for %arg4 = %c0 to %c2048 step %c32 {
    %subview_8 = memref.subview %arg0[%0, %arg4] [128, 32] [1, 1] : memref<5376x2048xf16> to memref<128x32xf16, strided<[2048, 1], offset: ?>>
    %subview_9 = memref.subview %arg1[%arg4, %1] [32, 128] [1, 1] : memref<2048x5376xf16> to memref<32x128xf16, strided<[5376, 1], offset: ?>>
    %c32_10 = arith.constant 32 : index
    %c32_11 = arith.constant 32 : index
    %c0_12 = arith.constant 0 : index
    %c128_13 = arith.constant 128 : index
    %c32_14 = arith.constant 32 : index
    %c0_15 = arith.constant 0 : index
    %c32_16 = arith.constant 32 : index
    %c32_17 = arith.constant 32 : index
    scf.for %arg5 = %c0_12 to %c128_13 step %c32_14 {
      scf.for %arg6 = %c0_15 to %c32_16 step %c32_17 {
        %subview_26 = memref.subview %subview_8[%arg5, %arg6] [32, 32] [1, 1] : memref<128x32xf16, strided<[2048, 1], offset: ?>> to memref<32x32xf16, strided<[2048, 1], offset: ?>>
        %subview_27 = memref.subview %alloca_1[%arg5, %arg6] [32, 32] [1, 1] : memref<128x32xf16, #gpu.address_space<workgroup>> to memref<32x32xf16, strided<[32, 1], offset: ?>, #gpu.address_space<workgroup>>
        linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%subview_26 : memref<32x32xf16, strided<[2048, 1], offset: ?>>) outs(%subview_27 : memref<32x32xf16, strided<[32, 1], offset: ?>, #gpu.address_space<workgroup>>) attrs =  {__byteir_load_matrix_a__, __internal_linalg_transform__ = "copy_to_distribute"} {
        ^bb0(%in: f16, %out: f16):
          linalg.yield %in : f16
        }
      }
    }
    %c8_18 = arith.constant 8 : index
    %c128_19 = arith.constant 128 : index
    %c0_20 = arith.constant 0 : index
    %c32_21 = arith.constant 32 : index
    %c8_22 = arith.constant 8 : index
    %c0_23 = arith.constant 0 : index
    %c128_24 = arith.constant 128 : index
    %c128_25 = arith.constant 128 : index
    scf.for %arg5 = %c0_20 to %c32_21 step %c8_22 {
      scf.for %arg6 = %c0_23 to %c128_24 step %c128_25 {
        %subview_26 = memref.subview %subview_9[%arg5, %arg6] [8, 128] [1, 1] : memref<32x128xf16, strided<[5376, 1], offset: ?>> to memref<8x128xf16, strided<[5376, 1], offset: ?>>
        %subview_27 = memref.subview %alloca_0[%arg5, %arg6] [8, 128] [1, 1] : memref<32x128xf16, #gpu.address_space<workgroup>> to memref<8x128xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>>
        linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%subview_26 : memref<8x128xf16, strided<[5376, 1], offset: ?>>) outs(%subview_27 : memref<8x128xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>>) attrs =  {__byteir_load_matrix_b__, __internal_linalg_transform__ = "copy_to_distribute"} {
        ^bb0(%in: f16, %out: f16):
          linalg.yield %in : f16
        }
      }
    }
    linalg.matmul {__byteir_gpu_tile_gemm_0, __byteir_mma__, __byteir_mma_level__ = "Threadblock", __byteir_target__ = "nv_sm_80"} ins(%alloca_1, %alloca_0 : memref<128x32xf16, #gpu.address_space<workgroup>>, memref<32x128xf16, #gpu.address_space<workgroup>>) outs(%alloca : memref<128x128xf16, #gpu.address_space<workgroup>>)
  }
  %c8 = arith.constant 8 : index
  %c128 = arith.constant 128 : index
  %c0_2 = arith.constant 0 : index
  %c128_3 = arith.constant 128 : index
  %c8_4 = arith.constant 8 : index
  %c0_5 = arith.constant 0 : index
  %c128_6 = arith.constant 128 : index
  %c128_7 = arith.constant 128 : index
  scf.for %arg4 = %c0_2 to %c128_3 step %c8_4 {
    scf.for %arg5 = %c0_5 to %c128_6 step %c128_7 {
      %subview_8 = memref.subview %alloca[%arg4, %arg5] [8, 128] [1, 1] : memref<128x128xf16, #gpu.address_space<workgroup>> to memref<8x128xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>>
      %subview_9 = memref.subview %subview[%arg4, %arg5] [8, 128] [1, 1] : memref<128x128xf16, strided<[5376, 1], offset: ?>> to memref<8x128xf16, strided<[5376, 1], offset: ?>>
      linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%subview_8 : memref<8x128xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>>) outs(%subview_9 : memref<8x128xf16, strided<[5376, 1], offset: ?>>) attrs =  {__byteir_store_matrix_c__, __internal_linalg_transform__ = "copy_to_distribute"} {
      ^bb0(%in: f16, %out: f16):
        linalg.yield %in : f16
      }
    }
  }
} {mapping = [#gpu.block<y>, #gpu.block<x>]}

Xinyu302 · 2024-06-05T07:54:38Z

After step 2: thread distribution:

scf.forall (%arg2, %arg3) in (42, 42) {
  %0 = gpu.thread_id  x
  %1 = gpu.thread_id  y
  %2 = gpu.thread_id  z
  %3 = affine.apply affine_map<(d0, d1, d2) -> (d0 + d1 * 64 + d2 * 128)>(%0, %1, %2)
  %alloca = memref.alloca() {__byteir_alloca_accumulator__} : memref<128x128xf16, #gpu.address_space<workgroup>>
  %alloca_0 = memref.alloca() {__byteir_alloca_matrix_b__} : memref<32x128xf16, #gpu.address_space<workgroup>>
  %alloca_1 = memref.alloca() {__byteir_alloca_matrix_a__} : memref<128x32xf16, #gpu.address_space<workgroup>>
  %4 = affine.apply affine_map<(d0) -> (d0 * 128)>(%arg2)
  %5 = affine.apply affine_map<(d0) -> (d0 * 128)>(%arg3)
  %subview = memref.subview %alloc[%4, %5] [128, 128] [1, 1] : memref<5376x5376xf16> to memref<128x128xf16, strided<[5376, 1], offset: ?>>
  linalg.fill ins(%cst : f16) outs(%alloca : memref<128x128xf16, #gpu.address_space<workgroup>>)
  scf.for %arg4 = %c0 to %c2048 step %c32 {
    %subview_8 = memref.subview %arg0[%4, %arg4] [128, 32] [1, 1] : memref<5376x2048xf16> to memref<128x32xf16, strided<[2048, 1], offset: ?>>
    %subview_9 = memref.subview %arg1[%arg4, %5] [32, 128] [1, 1] : memref<2048x5376xf16> to memref<32x128xf16, strided<[5376, 1], offset: ?>>
    %c32_10 = arith.constant 32 : index
    %c32_11 = arith.constant 32 : index
    %c0_12 = arith.constant 0 : index
    %c128_13 = arith.constant 128 : index
    %c32_14 = arith.constant 32 : index
    %c0_15 = arith.constant 0 : index
    %c32_16 = arith.constant 32 : index
    %c32_17 = arith.constant 32 : index
    scf.for %arg5 = %c0_12 to %c128_13 step %c32_14 {
      scf.for %arg6 = %c0_15 to %c32_16 step %c32_17 {
        %subview_26 = memref.subview %subview_8[%arg5, %arg6] [32, 32] [1, 1] : memref<128x32xf16, strided<[2048, 1], offset: ?>> to memref<32x32xf16, strided<[2048, 1], offset: ?>>
        %subview_27 = memref.subview %alloca_1[%arg5, %arg6] [32, 32] [1, 1] : memref<128x32xf16, #gpu.address_space<workgroup>> to memref<32x32xf16, strided<[32, 1], offset: ?>, #gpu.address_space<workgroup>>
        %c1 = arith.constant 1 : index
        %c8_28 = arith.constant 8 : index
        %6 = affine.apply affine_map<(d0) -> (d0 mod 4)>(%0)
        %c4 = arith.constant 4 : index
        %7 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 16 + d2 * 32 + d0 floordiv 4)>(%0, %1, %2)
        %c32_29 = arith.constant 32 : index
        %8 = affine.apply affine_map<(d0, d1, d2) -> ((d1 * 16 + d2 * 32 + d0 floordiv 4) floordiv 32)>(%0, %1, %2)
        %c0_30 = arith.constant 0 : index
        %c32_31 = arith.constant 32 : index
        %c1_32 = arith.constant 1 : index
        %c0_33 = arith.constant 0 : index
        %c32_34 = arith.constant 32 : index
        %c8_35 = arith.constant 8 : index
        %9 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 16 + d2 * 32 + d0 floordiv 4)>(%0, %1, %2)
        %10 = affine.apply affine_map<() -> (32)>()
        %11 = affine.apply affine_map<(d0) -> (d0 * 8 - (d0 floordiv 4) * 32)>(%0)
        %12 = affine.apply affine_map<() -> (32)>()
        %13 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 16 + d2 * 32 + d0 floordiv 4)>(%0, %1, %2)
        %14 = affine.apply affine_map<(d0) -> (d0 * 8 - (d0 floordiv 4) * 32)>(%0)
        %15 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 16 + d2 * 32 + d0 floordiv 4)>(%0, %1, %2)
        %16 = affine.apply affine_map<(d0) -> (d0 * 8 - (d0 floordiv 4) * 32)>(%0)
        %subview_36 = memref.subview %subview_26[%13, %14] [1, 8] [1, 1] : memref<32x32xf16, strided<[2048, 1], offset: ?>> to memref<1x8xf16, strided<[2048, 1], offset: ?>>
        %subview_37 = memref.subview %subview_27[%15, %16] [1, 8] [1, 1] : memref<32x32xf16, strided<[32, 1], offset: ?>, #gpu.address_space<workgroup>> to memref<1x8xf16, strided<[32, 1], offset: ?>, #gpu.address_space<workgroup>>
        linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%subview_36 : memref<1x8xf16, strided<[2048, 1], offset: ?>>) outs(%subview_37 : memref<1x8xf16, strided<[32, 1], offset: ?>, #gpu.address_space<workgroup>>) attrs =  {__byteir_load_matrix_a__, __internal_linalg_transform__ = "copy_distributed"} {
        ^bb0(%in: f16, %out: f16):
          linalg.yield %in : f16
        }
      }
    }
    %c8_18 = arith.constant 8 : index
    %c128_19 = arith.constant 128 : index
    %c0_20 = arith.constant 0 : index
    %c32_21 = arith.constant 32 : index
    %c8_22 = arith.constant 8 : index
    %c0_23 = arith.constant 0 : index
    %c128_24 = arith.constant 128 : index
    %c128_25 = arith.constant 128 : index
    scf.for %arg5 = %c0_20 to %c32_21 step %c8_22 {
      scf.for %arg6 = %c0_23 to %c128_24 step %c128_25 {
        %subview_26 = memref.subview %subview_9[%arg5, %arg6] [8, 128] [1, 1] : memref<32x128xf16, strided<[5376, 1], offset: ?>> to memref<8x128xf16, strided<[5376, 1], offset: ?>>
        %subview_27 = memref.subview %alloca_0[%arg5, %arg6] [8, 128] [1, 1] : memref<32x128xf16, #gpu.address_space<workgroup>> to memref<8x128xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>>
        %c1 = arith.constant 1 : index
        %c8_28 = arith.constant 8 : index
        %6 = affine.apply affine_map<(d0) -> (d0 mod 16)>(%0)
        %c16 = arith.constant 16 : index
        %7 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16)>(%0, %1, %2)
        %c8_29 = arith.constant 8 : index
        %8 = affine.apply affine_map<(d0, d1, d2) -> ((d1 * 4 + d2 * 8 + d0 floordiv 16) floordiv 8)>(%0, %1, %2)
        %c0_30 = arith.constant 0 : index
        %c8_31 = arith.constant 8 : index
        %c1_32 = arith.constant 1 : index
        %c0_33 = arith.constant 0 : index
        %c128_34 = arith.constant 128 : index
        %c8_35 = arith.constant 8 : index
        %9 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16)>(%0, %1, %2)
        %10 = affine.apply affine_map<() -> (8)>()
        %11 = affine.apply affine_map<(d0) -> (d0 * 8 - (d0 floordiv 16) * 128)>(%0)
        %12 = affine.apply affine_map<() -> (128)>()
        %13 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16)>(%0, %1, %2)
        %14 = affine.apply affine_map<(d0) -> (d0 * 8 - (d0 floordiv 16) * 128)>(%0)
        %15 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16)>(%0, %1, %2)
        %16 = affine.apply affine_map<(d0) -> (d0 * 8 - (d0 floordiv 16) * 128)>(%0)
        %subview_36 = memref.subview %subview_26[%13, %14] [1, 8] [1, 1] : memref<8x128xf16, strided<[5376, 1], offset: ?>> to memref<1x8xf16, strided<[5376, 1], offset: ?>>
        %subview_37 = memref.subview %subview_27[%15, %16] [1, 8] [1, 1] : memref<8x128xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>> to memref<1x8xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>>
        linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%subview_36 : memref<1x8xf16, strided<[5376, 1], offset: ?>>) outs(%subview_37 : memref<1x8xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>>) attrs =  {__byteir_load_matrix_b__, __internal_linalg_transform__ = "copy_distributed"} {
        ^bb0(%in: f16, %out: f16):
          linalg.yield %in : f16
        }
      }
    }
    linalg.matmul {__byteir_gpu_tile_gemm_0, __byteir_mma__, __byteir_mma_level__ = "Threadblock", __byteir_target__ = "nv_sm_80"} ins(%alloca_1, %alloca_0 : memref<128x32xf16, #gpu.address_space<workgroup>>, memref<32x128xf16, #gpu.address_space<workgroup>>) outs(%alloca : memref<128x128xf16, #gpu.address_space<workgroup>>)
  }
  %c8 = arith.constant 8 : index
  %c128 = arith.constant 128 : index
  %c0_2 = arith.constant 0 : index
  %c128_3 = arith.constant 128 : index
  %c8_4 = arith.constant 8 : index
  %c0_5 = arith.constant 0 : index
  %c128_6 = arith.constant 128 : index
  %c128_7 = arith.constant 128 : index
  scf.for %arg4 = %c0_2 to %c128_3 step %c8_4 {
    scf.for %arg5 = %c0_5 to %c128_6 step %c128_7 {
      %subview_8 = memref.subview %alloca[%arg4, %arg5] [8, 128] [1, 1] : memref<128x128xf16, #gpu.address_space<workgroup>> to memref<8x128xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>>
      %subview_9 = memref.subview %subview[%arg4, %arg5] [8, 128] [1, 1] : memref<128x128xf16, strided<[5376, 1], offset: ?>> to memref<8x128xf16, strided<[5376, 1], offset: ?>>
      %c1 = arith.constant 1 : index
      %c8_10 = arith.constant 8 : index
      %6 = affine.apply affine_map<(d0) -> (d0 mod 16)>(%0)
      %c16 = arith.constant 16 : index
      %7 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16)>(%0, %1, %2)
      %c8_11 = arith.constant 8 : index
      %8 = affine.apply affine_map<(d0, d1, d2) -> ((d1 * 4 + d2 * 8 + d0 floordiv 16) floordiv 8)>(%0, %1, %2)
      %c0_12 = arith.constant 0 : index
      %c8_13 = arith.constant 8 : index
      %c1_14 = arith.constant 1 : index
      %c0_15 = arith.constant 0 : index
      %c128_16 = arith.constant 128 : index
      %c8_17 = arith.constant 8 : index
      %9 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16)>(%0, %1, %2)
      %10 = affine.apply affine_map<() -> (8)>()
      %11 = affine.apply affine_map<(d0) -> (d0 * 8 - (d0 floordiv 16) * 128)>(%0)
      %12 = affine.apply affine_map<() -> (128)>()
      %13 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16)>(%0, %1, %2)
      %14 = affine.apply affine_map<(d0) -> (d0 * 8 - (d0 floordiv 16) * 128)>(%0)
      %15 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16)>(%0, %1, %2)
      %16 = affine.apply affine_map<(d0) -> (d0 * 8 - (d0 floordiv 16) * 128)>(%0)
      %subview_18 = memref.subview %subview_8[%13, %14] [1, 8] [1, 1] : memref<8x128xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>> to memref<1x8xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>>
      %subview_19 = memref.subview %subview_9[%15, %16] [1, 8] [1, 1] : memref<8x128xf16, strided<[5376, 1], offset: ?>> to memref<1x8xf16, strided<[5376, 1], offset: ?>>
      linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%subview_18 : memref<1x8xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>>) outs(%subview_19 : memref<1x8xf16, strided<[5376, 1], offset: ?>>) attrs =  {__byteir_store_matrix_c__, __internal_linalg_transform__ = "copy_distributed"} {
      ^bb0(%in: f16, %out: f16):
        linalg.yield %in : f16
      }
    }
  }
} {mapping = [#gpu.block<y>, #gpu.block<x>]}

Xinyu302 · 2024-06-05T07:56:41Z

After step 3 vectorizaton:

scf.forall (%arg2, %arg3) in (42, 42) {
  %0 = gpu.thread_id  x
  %1 = gpu.thread_id  y
  %2 = gpu.thread_id  z
  %3 = affine.apply affine_map<(d0, d1, d2) -> (d0 + d1 * 64 + d2 * 128)>(%0, %1, %2)
  %alloca = memref.alloca() {__byteir_alloca_accumulator__} : memref<128x128xf16, #gpu.address_space<workgroup>>
  %alloca_0 = memref.alloca() {__byteir_alloca_matrix_b__} : memref<32x128xf16, #gpu.address_space<workgroup>>
  %alloca_1 = memref.alloca() {__byteir_alloca_matrix_a__} : memref<128x32xf16, #gpu.address_space<workgroup>>
  %4 = affine.apply affine_map<(d0) -> (d0 * 128)>(%arg2)
  %5 = affine.apply affine_map<(d0) -> (d0 * 128)>(%arg3)
  %subview = memref.subview %alloc[%4, %5] [128, 128] [1, 1] : memref<5376x5376xf16> to memref<128x128xf16, strided<[5376, 1], offset: ?>>
  linalg.fill ins(%cst : f16) outs(%alloca : memref<128x128xf16, #gpu.address_space<workgroup>>)
  scf.for %arg4 = %c0 to %c2048 step %c32 {
    %subview_8 = memref.subview %arg0[%4, %arg4] [128, 32] [1, 1] : memref<5376x2048xf16> to memref<128x32xf16, strided<[2048, 1], offset: ?>>
    %subview_9 = memref.subview %arg1[%arg4, %5] [32, 128] [1, 1] : memref<2048x5376xf16> to memref<32x128xf16, strided<[5376, 1], offset: ?>>
    %c32_10 = arith.constant 32 : index
    %c32_11 = arith.constant 32 : index
    %c0_12 = arith.constant 0 : index
    %c128_13 = arith.constant 128 : index
    %c32_14 = arith.constant 32 : index
    %c0_15 = arith.constant 0 : index
    %c32_16 = arith.constant 32 : index
    %c32_17 = arith.constant 32 : index
    scf.for %arg5 = %c0_12 to %c128_13 step %c32_14 {
      scf.for %arg6 = %c0_15 to %c32_16 step %c32_17 {
        %subview_26 = memref.subview %subview_8[%arg5, %arg6] [32, 32] [1, 1] : memref<128x32xf16, strided<[2048, 1], offset: ?>> to memref<32x32xf16, strided<[2048, 1], offset: ?>>
        %subview_27 = memref.subview %alloca_1[%arg5, %arg6] [32, 32] [1, 1] : memref<128x32xf16, #gpu.address_space<workgroup>> to memref<32x32xf16, strided<[32, 1], offset: ?>, #gpu.address_space<workgroup>>
        %c1 = arith.constant 1 : index
        %c8_28 = arith.constant 8 : index
        %6 = affine.apply affine_map<(d0) -> (d0 mod 4)>(%0)
        %c4 = arith.constant 4 : index
        %7 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 16 + d2 * 32 + d0 floordiv 4)>(%0, %1, %2)
        %c32_29 = arith.constant 32 : index
        %8 = affine.apply affine_map<(d0, d1, d2) -> ((d1 * 16 + d2 * 32 + d0 floordiv 4) floordiv 32)>(%0, %1, %2)
        %c0_30 = arith.constant 0 : index
        %c32_31 = arith.constant 32 : index
        %c1_32 = arith.constant 1 : index
        %c0_33 = arith.constant 0 : index
        %c32_34 = arith.constant 32 : index
        %c8_35 = arith.constant 8 : index
        %9 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 16 + d2 * 32 + d0 floordiv 4)>(%0, %1, %2)
        %10 = affine.apply affine_map<() -> (32)>()
        %11 = affine.apply affine_map<(d0) -> (d0 * 8 - (d0 floordiv 4) * 32)>(%0)
        %12 = affine.apply affine_map<() -> (32)>()
        %13 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 16 + d2 * 32 + d0 floordiv 4)>(%0, %1, %2)
        %14 = affine.apply affine_map<(d0) -> (d0 * 8 - (d0 floordiv 4) * 32)>(%0)
        %15 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 16 + d2 * 32 + d0 floordiv 4)>(%0, %1, %2)
        %16 = affine.apply affine_map<(d0) -> (d0 * 8 - (d0 floordiv 4) * 32)>(%0)
        %subview_36 = memref.subview %subview_26[%13, %14] [1, 8] [1, 1] : memref<32x32xf16, strided<[2048, 1], offset: ?>> to memref<1x8xf16, strided<[2048, 1], offset: ?>>
        %subview_37 = memref.subview %subview_27[%15, %16] [1, 8] [1, 1] : memref<32x32xf16, strided<[32, 1], offset: ?>, #gpu.address_space<workgroup>> to memref<1x8xf16, strided<[32, 1], offset: ?>, #gpu.address_space<workgroup>>
        %c1_38 = arith.constant 1 : index
        %c8_39 = arith.constant 8 : index
        %c0_40 = arith.constant 0 : index
        %cst_41 = arith.constant 0.000000e+00 : f16
        %17 = vector.transfer_read %subview_36[%c0_40, %c0_40], %cst_41 : memref<1x8xf16, strided<[2048, 1], offset: ?>>, vector<1x8xf16>
        %cst_42 = arith.constant 0.000000e+00 : f16
        %18 = vector.transfer_read %subview_37[%c0_40, %c0_40], %cst_42 : memref<1x8xf16, strided<[32, 1], offset: ?>, #gpu.address_space<workgroup>>, vector<1x8xf16>
        %c0_43 = arith.constant 0 : index
        vector.transfer_write %17, %subview_37[%c0_43, %c0_43] : vector<1x8xf16>, memref<1x8xf16, strided<[32, 1], offset: ?>, #gpu.address_space<workgroup>>
      }
    }
    %c8_18 = arith.constant 8 : index
    %c128_19 = arith.constant 128 : index
    %c0_20 = arith.constant 0 : index
    %c32_21 = arith.constant 32 : index
    %c8_22 = arith.constant 8 : index
    %c0_23 = arith.constant 0 : index
    %c128_24 = arith.constant 128 : index
    %c128_25 = arith.constant 128 : index
    scf.for %arg5 = %c0_20 to %c32_21 step %c8_22 {
      scf.for %arg6 = %c0_23 to %c128_24 step %c128_25 {
        %subview_26 = memref.subview %subview_9[%arg5, %arg6] [8, 128] [1, 1] : memref<32x128xf16, strided<[5376, 1], offset: ?>> to memref<8x128xf16, strided<[5376, 1], offset: ?>>
        %subview_27 = memref.subview %alloca_0[%arg5, %arg6] [8, 128] [1, 1] : memref<32x128xf16, #gpu.address_space<workgroup>> to memref<8x128xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>>
        %c1 = arith.constant 1 : index
        %c8_28 = arith.constant 8 : index
        %6 = affine.apply affine_map<(d0) -> (d0 mod 16)>(%0)
        %c16 = arith.constant 16 : index
        %7 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16)>(%0, %1, %2)
        %c8_29 = arith.constant 8 : index
        %8 = affine.apply affine_map<(d0, d1, d2) -> ((d1 * 4 + d2 * 8 + d0 floordiv 16) floordiv 8)>(%0, %1, %2)
        %c0_30 = arith.constant 0 : index
        %c8_31 = arith.constant 8 : index
        %c1_32 = arith.constant 1 : index
        %c0_33 = arith.constant 0 : index
        %c128_34 = arith.constant 128 : index
        %c8_35 = arith.constant 8 : index
        %9 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16)>(%0, %1, %2)
        %10 = affine.apply affine_map<() -> (8)>()
        %11 = affine.apply affine_map<(d0) -> (d0 * 8 - (d0 floordiv 16) * 128)>(%0)
        %12 = affine.apply affine_map<() -> (128)>()
        %13 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16)>(%0, %1, %2)
        %14 = affine.apply affine_map<(d0) -> (d0 * 8 - (d0 floordiv 16) * 128)>(%0)
        %15 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16)>(%0, %1, %2)
        %16 = affine.apply affine_map<(d0) -> (d0 * 8 - (d0 floordiv 16) * 128)>(%0)
        %subview_36 = memref.subview %subview_26[%13, %14] [1, 8] [1, 1] : memref<8x128xf16, strided<[5376, 1], offset: ?>> to memref<1x8xf16, strided<[5376, 1], offset: ?>>
        %subview_37 = memref.subview %subview_27[%15, %16] [1, 8] [1, 1] : memref<8x128xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>> to memref<1x8xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>>
        %c1_38 = arith.constant 1 : index
        %c8_39 = arith.constant 8 : index
        %c0_40 = arith.constant 0 : index
        %cst_41 = arith.constant 0.000000e+00 : f16
        %17 = vector.transfer_read %subview_36[%c0_40, %c0_40], %cst_41 : memref<1x8xf16, strided<[5376, 1], offset: ?>>, vector<1x8xf16>
        %cst_42 = arith.constant 0.000000e+00 : f16
        %18 = vector.transfer_read %subview_37[%c0_40, %c0_40], %cst_42 : memref<1x8xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>>, vector<1x8xf16>
        %c0_43 = arith.constant 0 : index
        vector.transfer_write %17, %subview_37[%c0_43, %c0_43] : vector<1x8xf16>, memref<1x8xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>>
      }
    }
    linalg.matmul {__byteir_gpu_tile_gemm_0, __byteir_mma__, __byteir_mma_level__ = "Threadblock", __byteir_target__ = "nv_sm_80"} ins(%alloca_1, %alloca_0 : memref<128x32xf16, #gpu.address_space<workgroup>>, memref<32x128xf16, #gpu.address_space<workgroup>>) outs(%alloca : memref<128x128xf16, #gpu.address_space<workgroup>>)
  }
  %c8 = arith.constant 8 : index
  %c128 = arith.constant 128 : index
  %c0_2 = arith.constant 0 : index
  %c128_3 = arith.constant 128 : index
  %c8_4 = arith.constant 8 : index
  %c0_5 = arith.constant 0 : index
  %c128_6 = arith.constant 128 : index
  %c128_7 = arith.constant 128 : index
  scf.for %arg4 = %c0_2 to %c128_3 step %c8_4 {
    scf.for %arg5 = %c0_5 to %c128_6 step %c128_7 {
      %subview_8 = memref.subview %alloca[%arg4, %arg5] [8, 128] [1, 1] : memref<128x128xf16, #gpu.address_space<workgroup>> to memref<8x128xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>>
      %subview_9 = memref.subview %subview[%arg4, %arg5] [8, 128] [1, 1] : memref<128x128xf16, strided<[5376, 1], offset: ?>> to memref<8x128xf16, strided<[5376, 1], offset: ?>>
      %c1 = arith.constant 1 : index
      %c8_10 = arith.constant 8 : index
      %6 = affine.apply affine_map<(d0) -> (d0 mod 16)>(%0)
      %c16 = arith.constant 16 : index
      %7 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16)>(%0, %1, %2)
      %c8_11 = arith.constant 8 : index
      %8 = affine.apply affine_map<(d0, d1, d2) -> ((d1 * 4 + d2 * 8 + d0 floordiv 16) floordiv 8)>(%0, %1, %2)
      %c0_12 = arith.constant 0 : index
      %c8_13 = arith.constant 8 : index
      %c1_14 = arith.constant 1 : index
      %c0_15 = arith.constant 0 : index
      %c128_16 = arith.constant 128 : index
      %c8_17 = arith.constant 8 : index
      %9 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16)>(%0, %1, %2)
      %10 = affine.apply affine_map<() -> (8)>()
      %11 = affine.apply affine_map<(d0) -> (d0 * 8 - (d0 floordiv 16) * 128)>(%0)
      %12 = affine.apply affine_map<() -> (128)>()
      %13 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16)>(%0, %1, %2)
      %14 = affine.apply affine_map<(d0) -> (d0 * 8 - (d0 floordiv 16) * 128)>(%0)
      %15 = affine.apply affine_map<(d0, d1, d2) -> (d1 * 4 + d2 * 8 + d0 floordiv 16)>(%0, %1, %2)
      %16 = affine.apply affine_map<(d0) -> (d0 * 8 - (d0 floordiv 16) * 128)>(%0)
      %subview_18 = memref.subview %subview_8[%13, %14] [1, 8] [1, 1] : memref<8x128xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>> to memref<1x8xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>>
      %subview_19 = memref.subview %subview_9[%15, %16] [1, 8] [1, 1] : memref<8x128xf16, strided<[5376, 1], offset: ?>> to memref<1x8xf16, strided<[5376, 1], offset: ?>>
      %c1_20 = arith.constant 1 : index
      %c8_21 = arith.constant 8 : index
      %c0_22 = arith.constant 0 : index
      %cst_23 = arith.constant 0.000000e+00 : f16
      %17 = vector.transfer_read %subview_18[%c0_22, %c0_22], %cst_23 : memref<1x8xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>>, vector<1x8xf16>
      %cst_24 = arith.constant 0.000000e+00 : f16
      %18 = vector.transfer_read %subview_19[%c0_22, %c0_22], %cst_24 : memref<1x8xf16, strided<[5376, 1], offset: ?>>, vector<1x8xf16>
      %c0_25 = arith.constant 0 : index
      vector.transfer_write %17, %subview_19[%c0_25, %c0_25] : vector<1x8xf16>, memref<1x8xf16, strided<[5376, 1], offset: ?>>
    }
  }
} {mapping = [#gpu.block<y>, #gpu.block<x>]}

- 59c2bbb [compiler] fix DeviceGraphCluster for if op (#362) - 93b7671 [runtime] support bmm with crr rcr ccr layout (#350) - f24924f [compiler] remove cat aggressive mode (#361) - f187897 [torch-frontend] change setupBackendTypeConversion to set... - e5248af [GEMM codegen] Distribute Shared memory copy (#303) - d1d1fa9 [torch-frontend] update torch-mlir to c7d52f63b482b2c30f4... - 01098f8 [compiler] fix compilation on AArch64 platform (#358) - c576d6f [Gemm Codegen]add optimize-vector-transfer (#301) - 7fa4807 [e2e] add profiler entry for single stablehlo/mhlo file (... GitOrigin-RevId: 59c2bbb

Xinyu302 added 2 commits June 4, 2024 08:12

GPU distribute shared memory copy

e7265e5

fix

3b8aa05

Xinyu302 force-pushed the shared-mem-copy branch from f35338b to 3b8aa05 Compare June 4, 2024 09:09

Xinyu302 marked this pull request as ready for review June 4, 2024 09:09

Xinyu302 added 3 commits June 4, 2024 16:08

Merge remote-tracking branch 'upstream/main' into shared-mem-copy

c975d40

add test case and can run

6773ee6

Merge remote-tracking branch 'upstream/main' into shared-mem-copy

bd3e614

Xinyu302 added 2 commits June 5, 2024 07:48

add filecheck

6b2a43f

filecheck

6d9d473

Xinyu302 changed the title ~~[WIP][GEMM codegen] Distribute Shared memory copy~~ [GEMM codegen] Distribute Shared memory copy Jun 5, 2024

Merge remote-tracking branch 'upstream/main' into shared-mem-copy

01c6dc7

XG-zheng approved these changes Jun 18, 2024

View reviewed changes

Xinyu302 and others added 7 commits June 18, 2024 07:37

Merge remote-tracking branch 'upstream/main' into shared-mem-copy

0ead5cd

Merge branch 'main' into shared-mem-copy

136761f

Merge branch 'main' into shared-mem-copy

09f3d90

Merge branch 'main' into shared-mem-copy

f36bde7

Merge branch 'main' into shared-mem-copy

474dadd

Merge branch 'main' into shared-mem-copy

957e242

Merge branch 'main' into shared-mem-copy

bd68263

XG-zheng merged commit e5248af into bytedance:main Jun 20, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GEMM codegen] Distribute Shared memory copy #303

[GEMM codegen] Distribute Shared memory copy #303

Xinyu302 commented Jun 2, 2024 •

edited

Loading

Xinyu302 commented Jun 5, 2024

Xinyu302 commented Jun 5, 2024 •

edited

Loading

Xinyu302 commented Jun 5, 2024

Xinyu302 commented Jun 5, 2024

Xinyu302 commented Jun 5, 2024

[GEMM codegen] Distribute Shared memory copy #303

[GEMM codegen] Distribute Shared memory copy #303

Conversation

Xinyu302 commented Jun 2, 2024 • edited Loading

Xinyu302 commented Jun 5, 2024

Xinyu302 commented Jun 5, 2024 • edited Loading

Xinyu302 commented Jun 5, 2024

Xinyu302 commented Jun 5, 2024

Xinyu302 commented Jun 5, 2024

Xinyu302 commented Jun 2, 2024 •

edited

Loading

Xinyu302 commented Jun 5, 2024 •

edited

Loading