Improve performance for large dilation convolution #1887

Rasterer · 2018-10-12T03:13:47Z

Some network has convolution with large dilation, like deeplabs v3, may have 12, 24, 36 dilation value. Current implementation dilates the kernel to a bigger size then perform normal conv. But in case like dilation 24, the dilated kernel size will be 49x49, which is quite large and pretty likely to be the bottleneck of whole network.

tqchen · 2018-10-12T04:36:41Z

Specific optimization is necessary for dilated conv2d and we do need some of them to skip unnecessary computations

tqchen · 2018-10-12T04:49:30Z

cc @Huyuwei @merrymercy

merrymercy · 2018-10-15T18:31:12Z

I think you are using CPU backends.

Currently, we always create an intermediate buffer for dilation (compute_root). We can use compute_inline + unroll to eliminate multiplications of zeros and large intermediate buffers. Significant performance gain can be obtained here.

Rasterer · 2018-10-16T10:10:32Z

I think you are using CPU backends.

Currently, we always create an intermediate buffer for dilation (compute_root). We can use compute_inline + unroll to eliminate multiplications of zeros and large intermediate buffers. Significant performance gain can be obtained here.

I am using GPU backend. I don't quite understand how compute_inline + unroll can solve this problem.
Do you have sample code? If yes, Maybe I can try it in my model to see the performance gain.

merrymercy · 2018-10-17T06:20:48Z

I will add a section to the tutorial later.

merrymercy · 2018-10-21T11:05:59Z

The idea is to inline dilation rather than use a separate stage. Then we can unroll some axes and use the simplifier in tvm to remove the multiplications of zeros. This also avoids large intermediate buffers.

You can follow the example below and change the implementation in TOPI.

import tvm
import topi
from tvm.contrib.util import get_lower_ir
from topi.nn.util import get_pad_tuple
from topi.util import equal_const_int

# args of a conv2d
N, H, W, CO, CI, KH, KW, strides, padding = 1, 14, 14, 512, 512, 3, 3, (1, 1), (1, 1)
# large dilation
dilation = (4, 4)

def current_schedule():
    """The current bad schedule, as a reference"""
    data = tvm.placeholder((N, CI, H, W), name='data')
    raw_kernel = tvm.placeholder((CO, CI, KH, KW), name='kernel')

    # dilate as a separate stage before conv
    dilated_kernel = topi.nn.dilate(raw_kernel, (1, 1) + dilation, name='DilatedKernel') 

    conv = topi.nn.conv2d_nchw(data, dilated_kernel, strides, padding, 'float32')
    s = tvm.create_schedule([conv.op])

    pad_data = s[conv].op.input_tensors[0]
    s[pad_data].compute_inline()
    s[dilated_kernel].compute_inline()

    AA = s.cache_read(pad_data, 'shared', conv)
    BB = s.cache_read(dilated_kernel, 'shared', conv)

    ci = s[conv].op.reduce_axis[0]
    s[AA].compute_at(s[conv], ci)
    s[BB].compute_at(s[conv], ci)

    print(get_lower_ir(s))

def better_schedule():
    """The better schedule optimized for dilation"""
    data = tvm.placeholder((N, CI, H, W), name='data')
    raw_kernel = tvm.placeholder((CO, CI, KH, KW), name='kernel')

    dilate_args = (1, 1)  + dilation
    def dilate_kernel(*indices):   # This function is the same as topi.nn.dilate, but inlined
        not_zero = []
        index_tuple = []
        for i in range(len(dilate_args)):
            if not equal_const_int(dilate_args[i], 1):
                index_tuple.append(indices[i] // dilate_args[i])
                not_zero.append((indices[i] % dilate_args[i]).equal(0))
            else:
                index_tuple.append(indices[i])
        if not_zero:
            not_zero = tvm.all(*not_zero)
            return tvm.select(not_zero, raw_kernel(*index_tuple), tvm.const(0.0, data.dtype))
        return raw_kernel(*index_tuple)

    kernel_h = (KH - 1) * dilation[0] + 1
    kernel_w = (KW - 1) * dilation[1] + 1

    # vanilla conv
    pad_top, pad_left, pad_down, pad_right = get_pad_tuple(padding, (kernel_h, kernel_w))
    out_height = (H - kernel_h + pad_top + pad_down) // strides[0] + 1
    out_width = (W - kernel_h + pad_left + pad_right) // strides[1] + 1
    pad_before = [0, 0, pad_top, pad_left]
    pad_after = [0, 0, pad_down, pad_right]
    pad_data = topi.nn.pad(data, pad_before, pad_after, name="pad_temp")

    rc = tvm.reduce_axis((0, CI), name='rc')
    ry = tvm.reduce_axis((0, kernel_h), name='ry')
    rx = tvm.reduce_axis((0, kernel_h), name='rx')
    conv = tvm.compute((N, CO, out_height, out_width),
                       lambda nn, ff, yy, xx: tvm.sum(
                           pad_data[nn, rc, yy * strides[0] + ry, xx * strides[1] + rx]*
                           dilate_kernel(ff, rc, ry, rx), axis=[rc, ry, rx]))
                         # call inlined dilation function here

    s = tvm.create_schedule([conv.op])

    pad_data = s[conv].op.input_tensors[0]
    s[pad_data].compute_inline()

    AA = s.cache_read(pad_data, 'shared', conv)
    BB = s.cache_read(raw_kernel, 'shared', conv)

    n, c, h, w = s[conv].op.axis
    ci, kh, kw = s[conv].op.reduce_axis

    s[AA].compute_at(s[conv], ci)
    s[BB].compute_at(s[conv], ci)

    # use unroll + simpilier to eliminate multiplications of zeros
    s[conv].unroll(kh)
    s[conv].unroll(kw)

    print(get_lower_ir(s))


print("Current Schedule")
current_schedule()

print("=================================")
print("Better Schedule")
better_schedule()

Output

Current Schedule
// attr [compute] storage_scope = "global"
allocate compute[float32 * 1 * 512 * 8 * 8]
// attr [pad_temp.shared] storage_scope = "shared"
allocate pad_temp.shared[float32 * 1 * 1 * 9 * 9]
// attr [DilatedKernel.shared] storage_scope = "shared"
allocate DilatedKernel.shared[float32 * 1 * 1 * 9 * 9]
produce compute {
  for (ff, 0, 512) {
    for (yy, 0, 8) {
      for (xx, 0, 8) {
        compute[((((ff*8) + yy)*8) + xx)] = 0.000000f
        for (rc, 0, 512) {
          produce pad_temp.shared {
            for (ax2, 0, 9) {
              for (ax3, 0, 9) {
                pad_temp.shared[((ax2*9) + ax3)] = tvm_if_then_else((((((1 - ax2) <= yy) && (yy < (15 - ax2))) && ((1 - ax3) <= xx)) && (xx < (15 - ax3))), data[((((((yy*14) + xx) + (rc*196)) + (ax2*14)) + ax3) + -15)], 0.000000f)
              }
            }
          }
          // BAD: Large useless intermidate buffer
          produce DilatedKernel.shared {
            for (ax2, 0, 9) {
              for (ax3, 0, 9) {
                DilatedKernel.shared[((ax2*9) + ax3)] = tvm_if_then_else((((ax2 % 4) == 0) && ((ax3 % 4) == 0)), kernel[((((((ff*512) + rc)*3) + (ax2/4))*3) + (ax3/4))], 0.000000f)
              }
            }
          }
          // BAD: Many operands of multiplication are zero, which means these multiplications are useless.
          for (ry, 0, 9) {
            for (rx, 0, 9) {
              compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (pad_temp.shared[((ry*9) + rx)]*DilatedKernel.shared[((ry*9) + rx)]))
            }
          }
        }
      }
    }
  }
}

=================================
Better Schedule
// attr [compute] storage_scope = "global"
allocate compute[float32 * 1 * 512 * 8 * 8]
// attr [pad_temp.shared] storage_scope = "shared"
allocate pad_temp.shared[float32 * 1 * 1 * 9 * 9]
// attr [kernel.shared] storage_scope = "shared"
allocate kernel.shared[float32 * 1 * 1 * 3 * 3]
produce compute {
  for (ff, 0, 512) {
    for (yy, 0, 8) {
      for (xx, 0, 8) {
        compute[((((ff*8) + yy)*8) + xx)] = 0.000000f
        for (rc, 0, 512) {
          produce pad_temp.shared {
            for (ax2, 0, 9) {
              for (ax3, 0, 9) {
                pad_temp.shared[((ax2*9) + ax3)] = tvm_if_then_else((((((1 - ax2) <= yy) && (yy < (15 - ax2))) && ((1 - ax3) <= xx)) && (xx < (15 - ax3))), data[((((((yy*14) + xx) + (rc*196)) + (ax2*14)) + ax3) + -15)], 0.000000f)
              }
            }
          }
          // GOOD: no extra buffer
          produce kernel.shared {
            for (ax2, 0, 3) {
              for (ax3, 0, 3) {
                kernel.shared[((ax2*3) + ax3)] = kernel[((((((ff*512) + rc)*3) + ax2)*3) + ax3)]
              }
            }
          }
          // GOOD: minimal computation
          compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (pad_temp.shared[0]*kernel.shared[0]))
          compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (pad_temp.shared[4]*kernel.shared[1]))
          compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (pad_temp.shared[8]*kernel.shared[2]))
          compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (pad_temp.shared[36]*kernel.shared[3]))
          compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (pad_temp.shared[40]*kernel.shared[4]))
          compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (pad_temp.shared[44]*kernel.shared[5]))
          compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (pad_temp.shared[72]*kernel.shared[6]))
          compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (pad_temp.shared[76]*kernel.shared[7]))
          compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (pad_temp.shared[80]*kernel.shared[8]))
        }
      }
    }
  }
}

Rasterer · 2018-10-30T10:21:51Z

@merrymercy Thank you very much for your detail explanation. I applied this change locally, the time cost of my model reduced from 300+ms to 50ms, great performance boost! However, I still have some questions here:

I reviewed your code, how about changing the implementation like this:

    rc = tvm.reduce_axis((0, CI), name='rc')
    ry = tvm.reduce_axis((0, KH), name='ry')
    rx = tvm.reduce_axis((0, KW), name='rx')
    conv = tvm.compute((N, CO, out_height, out_width),
                       lambda nn, ff, yy, xx: tvm.sum(
                           pad_data[nn, rc, yy * strides[0] + ry * dilation[0], xx * strides[1] + rx * dilation[1]]*
                           raw_kernel[ff, rc, ry, rx], axis=[rc, ry, rx]))

I limit the axis ry rx extent within original kernel size KH KW, and put dilation address calculation into pad_data array indexing. So there is no need to call dilate_kernel, and unrolling is not necessary either. This code is also more intuitive for developers to understand generated IR.

There is another problem in the change. That's the pad_temp loading. As you can see in your log, it loads 9x9 input data, which only 3x3 of them are used for final calculation. I found I can solve this issue by removing cache write of pad_data. But this is not good for data reusing. Do you have better idea about this?

I post all my code here:

import tvm
import topi
from tvm.contrib.util import get_lower_ir
from topi.nn.util import get_pad_tuple
from topi.util import equal_const_int

# args of a conv2d
N, H, W, CO, CI, KH, KW, strides, padding = 1, 14, 14, 512, 512, 3, 3, (1, 1), (1, 1)
# large dilation
dilation = (4, 4)

def current_schedule():
    """The current bad schedule, as a reference"""
    data = tvm.placeholder((N, CI, H, W), name='data')
    raw_kernel = tvm.placeholder((CO, CI, KH, KW), name='kernel')

    # dilate as a separate stage before conv
    dilated_kernel = topi.nn.dilate(raw_kernel, (1, 1) + dilation, name='DilatedKernel')

    conv = topi.nn.conv2d_nchw(data, dilated_kernel, strides, padding, 'float32')
    s = tvm.create_schedule([conv.op])

    pad_data = s[conv].op.input_tensors[0]
    s[pad_data].compute_inline()
    s[dilated_kernel].compute_inline()

    AA = s.cache_read(pad_data, 'shared', conv)
    BB = s.cache_read(dilated_kernel, 'shared', conv)

    ci = s[conv].op.reduce_axis[0]
    s[AA].compute_at(s[conv], ci)
    s[BB].compute_at(s[conv], ci)

    print(get_lower_ir(s))

def better_schedule():
    """The better schedule optimized for dilation"""
    data = tvm.placeholder((N, CI, H, W), name='data')
    raw_kernel = tvm.placeholder((CO, CI, KH, KW), name='kernel')

    dilate_args = (1, 1)  + dilation
    def dilate_kernel(*indices):   # This function is the same as topi.nn.dilate, but inlined
        not_zero = []
        index_tuple = []
        for i in range(len(dilate_args)):
            if not equal_const_int(dilate_args[i], 1):
                index_tuple.append(indices[i] // dilate_args[i])
                not_zero.append((indices[i] % dilate_args[i]).equal(0))
            else:
                index_tuple.append(indices[i])
        if not_zero:
            not_zero = tvm.all(*not_zero)
            return tvm.select(not_zero, raw_kernel(*index_tuple), tvm.const(0.0, data.dtype))
        return raw_kernel(*index_tuple)

    kernel_h = (KH - 1) * dilation[0] + 1
    kernel_w = (KW - 1) * dilation[1] + 1

    # vanilla conv
    pad_top, pad_left, pad_down, pad_right = get_pad_tuple(padding, (kernel_h, kernel_w))
    out_height = (H - kernel_h + pad_top + pad_down) // strides[0] + 1
    out_width = (W - kernel_h + pad_left + pad_right) // strides[1] + 1
    pad_before = [0, 0, pad_top, pad_left]
    pad_after = [0, 0, pad_down, pad_right]
    pad_data = topi.nn.pad(data, pad_before, pad_after, name="pad_temp")

    rc = tvm.reduce_axis((0, CI), name='rc')
    # ry = tvm.reduce_axis((0, kernel_h), name='ry')
    # rx = tvm.reduce_axis((0, kernel_w), name='rx')
    # conv = tvm.compute((N, CO, out_height, out_width),
                       # lambda nn, ff, yy, xx: tvm.sum(
                           # pad_data[nn, rc, yy * strides[0] + ry, xx * strides[1] + rx]*
                           # dilate_kernel(ff, rc, ry, rx), axis=[rc, ry, rx]))
                         # call inlined dilation function here

    ry = tvm.reduce_axis((0, KH), name='ry')
    rx = tvm.reduce_axis((0, KW), name='rx')
    conv = tvm.compute((N, CO, out_height, out_width),
                       lambda nn, ff, yy, xx: tvm.sum(
                           # pad_data[nn, rc, yy * strides[0] + ry, xx * strides[1] + rx]*
                           pad_data[nn, rc, yy * strides[0] + ry * dilation[0], xx * strides[1] + rx * dilation[1]]*
                           raw_kernel[ff, rc, ry, rx], axis=[rc, ry, rx]))
                           # dilate_kernel(ff, rc, ry, rx), axis=[rc, ry, rx]))

    s = tvm.create_schedule([conv.op])

    pad_data = s[conv].op.input_tensors[0]
    s[pad_data].compute_inline()

    # AA = s.cache_read(pad_data, 'shared', conv)
    BB = s.cache_read(raw_kernel, 'shared', conv)

    n, c, h, w = s[conv].op.axis
    ci, kh, kw = s[conv].op.reduce_axis

    # s[AA].compute_at(s[conv], ci)
    s[BB].compute_at(s[conv], ci)

    # use unroll + simpilier to eliminate multiplications of zeros
    s[conv].unroll(kh)
    s[conv].unroll(kw)

    print(get_lower_ir(s))


print("Current Schedule")
current_schedule()

print("=================================")
print("Better Schedule")
better_schedule()

output:

Current Schedule
// attr [compute] storage_scope = "global"
allocate compute[float32 * 1 * 512 * 8 * 8]
// attr [pad_temp.shared] storage_scope = "shared"
allocate pad_temp.shared[float32 * 1 * 1 * 9 * 9]
// attr [DilatedKernel.shared] storage_scope = "shared"
allocate DilatedKernel.shared[float32 * 1 * 1 * 9 * 9]
produce compute {
  for (ff, 0, 512) {
    for (yy, 0, 8) {
      for (xx, 0, 8) {
        compute[((((ff*8) + yy)*8) + xx)] = 0.000000f
        for (rc, 0, 512) {
          produce pad_temp.shared {
            for (ax2, 0, 9) {
              for (ax3, 0, 9) {
                pad_temp.shared[((ax2*9) + ax3)] = tvm_if_then_else((((((1 - ax2) <= yy) && (yy < (15 - ax2))) && ((1 - ax3) <= xx)) && (xx < (15 - ax3))), data[((((((yy*14) + xx) + (rc*196)) + (ax2*14)) + ax3) + -15)], 0.000000f)
              }
            }
          }
          produce DilatedKernel.shared {
            for (ax2, 0, 9) {
              for (ax3, 0, 9) {
                DilatedKernel.shared[((ax2*9) + ax3)] = tvm_if_then_else((((ax2 % 4) == 0) && ((ax3 % 4) == 0)), kernel[((((((ff*512) + rc)*3) + (ax2/4))*3) + (ax3/4))], 0.000000f)
              }
            }
          }
          for (ry, 0, 9) {
            for (rx, 0, 9) {
              compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (pad_temp.shared[((ry*9) + rx)]*DilatedKernel.shared[((ry*9) + rx)]))
            }
          }
        }
      }
    }
  }
}

=================================
Better Schedule
// attr [compute] storage_scope = "global"
allocate compute[float32 * 1 * 512 * 8 * 8]
// attr [kernel.shared] storage_scope = "shared"
allocate kernel.shared[float32 * 1 * 1 * 3 * 3]
produce compute {
  for (ff, 0, 512) {
    for (yy, 0, 8) {
      for (xx, 0, 8) {
        compute[((((ff*8) + yy)*8) + xx)] = 0.000000f
        for (rc, 0, 512) {
          produce kernel.shared {
            for (ax2, 0, 3) {
              for (ax3, 0, 3) {
                kernel.shared[((ax2*3) + ax3)] = kernel[((((((ff*512) + rc)*3) + ax2)*3) + ax3)]
              }
            }
          }
          for (ry, 0, 3) {
            for (rx, 0, 3) {
              compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (tvm_if_then_else((((((1 - (ry*4)) <= yy) && (yy < (15 - (ry*4)))) && ((1 - (rx*4)) <= xx)) && (xx < (15 - (rx*4)))), data[((((((yy*14) + xx) + (rc*196)) + (ry*56)) + (rx*4)) + -15)], 0.000000f)*kernel.shared[((ry*3) + rx)]))
            }
          }
        }
      }
    }
  }
}

merrymercy · 2018-10-31T05:48:40Z

Good observation. Cache_read can only cache a continuous range. This is the current limitation. We can add an explicit packing stage as a workaround.

import tvm
import topi
from tvm.contrib.util import get_lower_ir
from topi.nn.util import get_pad_tuple
from topi.util import equal_const_int

# args of a conv2d
N, H, W, CO, CI, KH, KW, strides, padding = 1, 14, 14, 512, 512, 3, 3, (1, 1), (1, 1)
# large dilation
dilation = (4, 4)

def current_schedule():
    """The current bad schedule, as a reference"""
    data = tvm.placeholder((N, CI, H, W), name='data')
    raw_kernel = tvm.placeholder((CO, CI, KH, KW), name='kernel')

    # dilate as a separate stage before conv
    dilated_kernel = topi.nn.dilate(raw_kernel, (1, 1) + dilation, name='DilatedKernel')

    conv = topi.nn.conv2d_nchw(data, dilated_kernel, strides, padding, 'float32')
    s = tvm.create_schedule([conv.op])

    pad_data = s[conv].op.input_tensors[0]
    s[pad_data].compute_inline()
    s[dilated_kernel].compute_inline()

    AA = s.cache_read(pad_data, 'shared', conv)
    BB = s.cache_read(dilated_kernel, 'shared', conv)

    ci = s[conv].op.reduce_axis[0]
    s[AA].compute_at(s[conv], ci)
    s[BB].compute_at(s[conv], ci)

    print(get_lower_ir(s))

def better_schedule():
    """The better schedule optimized for dilation"""
    data = tvm.placeholder((N, CI, H, W), name='data')
    raw_kernel = tvm.placeholder((CO, CI, KH, KW), name='kernel')

    dilate_args = (1, 1)  + dilation
    def dilate_kernel(*indices):   # This function is the same as topi.nn.dilate, but inlined
        not_zero = []
        index_tuple = []
        for i in range(len(dilate_args)):
            if not equal_const_int(dilate_args[i], 1):
                index_tuple.append(indices[i] // dilate_args[i])
                not_zero.append((indices[i] % dilate_args[i]).equal(0))
            else:
                index_tuple.append(indices[i])
        if not_zero:
            not_zero = tvm.all(*not_zero)
            return tvm.select(not_zero, raw_kernel(*index_tuple), tvm.const(0.0, data.dtype))
        return raw_kernel(*index_tuple)

    kernel_h = (KH - 1) * dilation[0] + 1
    kernel_w = (KW - 1) * dilation[1] + 1

    # vanilla conv
    pad_top, pad_left, pad_down, pad_right = get_pad_tuple(padding, (kernel_h, kernel_w))
    out_height = (H - kernel_h + pad_top + pad_down) // strides[0] + 1
    out_width = (W - kernel_h + pad_left + pad_right) // strides[1] + 1
    pad_before = [0, 0, pad_top, pad_left]
    pad_after = [0, 0, pad_down, pad_right]
    pad_data = topi.nn.pad(data, pad_before, pad_after, name="pad_temp")

    ##### EXPLICIT PACKING #####
    packed_data = tvm.compute((N, CO, out_height, out_width, KH, KW), lambda n, f, y, x, kh, kw:
            pad_data[n, f, y * strides[0] + kh * dilation[0], x * strides[1] + kw * dilation[1]], name='packed_data')

    rc = tvm.reduce_axis((0, CI), name='rc')
    ry = tvm.reduce_axis((0, KH), name='ry')
    rx = tvm.reduce_axis((0, KW), name='rx')
    conv = tvm.compute((N, CO, out_height, out_width),
                       lambda nn, ff, yy, xx: tvm.sum(
                           packed_data[nn, rc, yy, xx, ry, rx] *
                           raw_kernel[ff, rc, ry, rx], axis=[rc, ry, rx]))

    s = tvm.create_schedule([conv.op])

    s[pad_data].compute_inline()

    BB = s.cache_read(raw_kernel, 'shared', conv)

    n, c, h, w = s[conv].op.axis
    ci, kh, kw = s[conv].op.reduce_axis

    s[BB].compute_at(s[conv], ci)
    s[packed_data].compute_at(s[conv], ci)

    # use unroll + simpilier to eliminate multiplications of zeros
    s[conv].unroll(kh)
    s[conv].unroll(kw)

    print(get_lower_ir(s))


print("Current Schedule")
current_schedule()

print("=================================")
print("Better Schedule")
better_schedule()

output

Current Schedule
// attr [compute] storage_scope = "global"
allocate compute[float32 * 1 * 512 * 8 * 8]
// attr [pad_temp.shared] storage_scope = "shared"
allocate pad_temp.shared[float32 * 1 * 1 * 9 * 9]
// attr [DilatedKernel.shared] storage_scope = "shared"
allocate DilatedKernel.shared[float32 * 1 * 1 * 9 * 9]
produce compute {
  for (ff, 0, 512) {
    for (yy, 0, 8) {
      for (xx, 0, 8) {
        compute[((((ff*8) + yy)*8) + xx)] = 0.000000f
        for (rc, 0, 512) {
          produce pad_temp.shared {
            for (ax2, 0, 9) {
              for (ax3, 0, 9) {
                pad_temp.shared[((ax2*9) + ax3)] = tvm_if_then_else((((((1 - ax2) <= yy) && (yy < (15 - ax2))) && ((1 - ax3) <= xx)) && (xx < (15 - ax3))), data[((((((yy*14) + xx) + (rc*196)) + (ax2*14)) + ax3) + -15)], 0.000000f)
              }
            }
          }
          produce DilatedKernel.shared {
            for (ax2, 0, 9) {
              for (ax3, 0, 9) {
                DilatedKernel.shared[((ax2*9) + ax3)] = tvm_if_then_else((((ax2 % 4) == 0) && ((ax3 % 4) == 0)), kernel[((((((ff*512) + rc)*3) + (ax2/4))*3) + (ax3/4))], 0.000000f)
              }
            }
          }
          for (ry, 0, 9) {
            for (rx, 0, 9) {
              compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (pad_temp.shared[((ry*9) + rx)]*DilatedKernel.shared[((ry*9) + rx)]))
            }
          }
        }
      }
    }
  }
}

=================================
Better Schedule
// attr [compute] storage_scope = "global"
allocate compute[float32 * 1 * 512 * 8 * 8]
// attr [packed_data] storage_scope = "global"
allocate packed_data[float32 * 1 * 1 * 1 * 1 * 3 * 3]
// attr [kernel.shared] storage_scope = "shared"
allocate kernel.shared[float32 * 1 * 1 * 3 * 3]
produce compute {
  for (ff, 0, 512) {
    for (yy, 0, 8) {
      for (xx, 0, 8) {
        compute[((((ff*8) + yy)*8) + xx)] = 0.000000f
        for (rc, 0, 512) {
          produce packed_data {
           // Minimal load
            for (kh, 0, 3) {
              for (kw, 0, 3) {
                packed_data[((kh*3) + kw)] = tvm_if_then_else((((((1 - (kh*4)) <= yy) && (yy < (15 - (kh*4)))) && ((1 - (kw*4)) <= xx)) && (xx < (15 - (kw*4)))), data[((((((yy*14) + xx) + (rc*196)) + (kh*56)) + (kw*4)) + -15)], 0.000000f)
              }
            }
          }
          // Minimal load
          produce kernel.shared {
            for (ax2, 0, 3) {
              for (ax3, 0, 3) {
                kernel.shared[((ax2*3) + ax3)] = kernel[((((((ff*512) + rc)*3) + ax2)*3) + ax3)]
              }
            }
          }
          // Minimal computation
          compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (packed_data[0]*kernel.shared[0]))
          compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (packed_data[1]*kernel.shared[1]))
          compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (packed_data[2]*kernel.shared[2]))
          compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (packed_data[3]*kernel.shared[3]))
          compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (packed_data[4]*kernel.shared[4]))
          compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (packed_data[5]*kernel.shared[5]))
          compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (packed_data[6]*kernel.shared[6]))
          compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (packed_data[7]*kernel.shared[7]))
          compute[((((ff*8) + yy)*8) + xx)] = (compute[((((ff*8) + yy)*8) + xx)] + (packed_data[8]*kernel.shared[8]))
        }
      }
    }
  }
}

Rasterer · 2018-11-01T09:49:40Z

@merrymercy Great idea! Based on your code, I made packed_data inline, and cache read from it. So the unnecessary memory loading is eliminated and I can still cache the data for reusing.
Well, I think all my problems are properly addressed by now. Great appreciaiton! BTW, are you going to integrate this change into TOPI? If yes, maybe we can leave this ticket open to track your integration status.

merrymercy · 2018-11-01T18:46:31Z

Your contributions are welcome

Eliminate unnecessary zero multiplications introduced by dilated kernel

tqchen · 2018-11-17T03:18:02Z

close by #2107 thanks to @Rasterer

vinx13 · 2018-11-21T11:03:01Z

@merrymercy I think there could be similar issue in int8 conv2d on CUDA. As for the limitation of cache_read, the redundant data load can be minimized if I do some tiling to get the holes inside the data used, right?

tqchen added the status: help wanted label Oct 12, 2018

merrymercy mentioned this issue Oct 17, 2018

[AUTOTVM][TOPI] Use tunable templates for GPU (CUDA/OpenCL/ROCm/Mali) #1638

Merged

Rasterer added a commit to Rasterer/tvm that referenced this issue Nov 14, 2018

Improve performance for dilated convolution (apache#1887)

12f73fe

Eliminate unnecessary zero multiplications introduced by dilated kernel

Rasterer mentioned this issue Nov 14, 2018

[TOPI] Improve performance for dilated convolution #2107

Merged

tqchen closed this as completed Nov 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance for large dilation convolution #1887

Improve performance for large dilation convolution #1887

Rasterer commented Oct 12, 2018

tqchen commented Oct 12, 2018

tqchen commented Oct 12, 2018

merrymercy commented Oct 15, 2018 •

edited

Rasterer commented Oct 16, 2018

merrymercy commented Oct 17, 2018

merrymercy commented Oct 21, 2018 •

edited

Rasterer commented Oct 30, 2018

merrymercy commented Oct 31, 2018 •

edited

Rasterer commented Nov 1, 2018

merrymercy commented Nov 1, 2018

tqchen commented Nov 17, 2018

vinx13 commented Nov 21, 2018 •

edited

Improve performance for large dilation convolution #1887

Improve performance for large dilation convolution #1887

Comments

Rasterer commented Oct 12, 2018

tqchen commented Oct 12, 2018

tqchen commented Oct 12, 2018

merrymercy commented Oct 15, 2018 • edited

Rasterer commented Oct 16, 2018

merrymercy commented Oct 17, 2018

merrymercy commented Oct 21, 2018 • edited

Rasterer commented Oct 30, 2018

merrymercy commented Oct 31, 2018 • edited

Rasterer commented Nov 1, 2018

merrymercy commented Nov 1, 2018

tqchen commented Nov 17, 2018

vinx13 commented Nov 21, 2018 • edited

merrymercy commented Oct 15, 2018 •

edited

merrymercy commented Oct 21, 2018 •

edited

merrymercy commented Oct 31, 2018 •

edited

vinx13 commented Nov 21, 2018 •

edited