Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undefined behavior introduced by reuse_at transformation in HeteroCL #185

Open
wyanzhao opened this issue Apr 21, 2023 · 6 comments
Open

Comments

@wyanzhao
Copy link
Collaborator

We are currently performing formal verification on HeteroCL, and we have discovered an undefined behavior introduced by the reuse_at transformation. The input code is as follows:

func.func @blur_x(%A: memref<10x10xf32>, %B: memref<8x10xf32>)
{
    %s = hcl.create_op_handle "s"
    %li = hcl.create_loop_handle %s, "i"
    %lj = hcl.create_loop_handle %s, "j"
    affine.for %i = 0 to 8 {
        affine.for %j = 0 to 10 {
            %tmp = affine.load %A[%i, %j] : memref<10x10xf32>
            %tmp1 = affine.load %A[%i+1, %j] : memref<10x10xf32>
            %tmp2 = affine.load %A[%i+2, %j] : memref<10x10xf32>
            %sum = arith.addf %tmp, %tmp1: f32
            %sum1 = arith.addf %sum, %tmp2: f32
            affine.store %sum1, %B[%i, %j] : memref<8x10xf32>
        } { loop_name = "j" }
    } { loop_name = "i", op_name = "s" }
    %buf = hcl.reuse_at(%A: memref<10x10xf32>, %li) -> memref<3x10xf32>
    return
}

After running hcl-opt -opt, the output code is:

  func.func @blur_x(%arg0: memref<10x10xf32>, %arg1: memref<8x10xf32>) {
    %0 = memref.alloc() {name = "s_reuse_0"} : memref<3x10xf32>
    affine.for %arg2 = 0 to 10 {
      affine.for %arg3 = 0 to 10 {
        %1 = affine.load %0[1, %arg3] : memref<3x10xf32>
        affine.store %1, %0[0, %arg3] : memref<3x10xf32>
        %2 = affine.load %0[2, %arg3] : memref<3x10xf32>
        affine.store %2, %0[1, %arg3] : memref<3x10xf32>
        %3 = affine.load %arg0[%arg2, %arg3] : memref<10x10xf32>
        affine.store %3, %0[2, %arg3] : memref<3x10xf32>
      } {spatial}
      affine.if #set0(%arg2) {
        affine.for %arg3 = 0 to 10 {
          %1 = affine.load %0[0, %arg3] : memref<3x10xf32>
          %2 = affine.load %0[1, %arg3] : memref<3x10xf32>
          %3 = affine.load %0[2, %arg3] : memref<3x10xf32>
          %4 = arith.addf %1, %2 : f32
          %5 = arith.addf %4, %3 : f32
          affine.store %5, %arg1[%arg2 - 2, %arg3] : memref<8x10xf32>
        } {loop_name = "j"}
      }
    } {loop_name = "i", op_name = "s"}
    return
  }

The issue occurs when HeteroCL allocates a new memory block (%0 = memref.alloc() {name = "s_reuse_0"} : memref<3x10xf32>) and reads from it without initializing it first (%1 = affine.load %0[1, %arg3] : memref<3x10xf32>.

In MLIR, reading uninitialized memory is considered undefined behavior. The input code only has undefined behavior of reading from A and B when they are uninitialized, which implies that the reuse_at transformation introduces new undefined behavior does not exist in the input code.

@chhzh123
Copy link
Member

Thanks for providing the issue. In this piece of code, those undefined values will not be used in actual computation, so if you plug in some example inputs, the output should be the same as the original program. Indeed, it would be better to initialize those arrays, but we have not figured out a clean way that will not introduce hardware overheads. Explicitly writing for loops to initialize increases latency.

@wyanzhao
Copy link
Collaborator Author

Thank you for your response. According to some research, writing an uninitialized value back to the same location can lead to undefined behavior on some specific backends, such as C-based compilers, as mentioned in this article: https://queue.acm.org/detail.cfm?id=3041020, please check this example in the article:

void f(void) {
  unsigned char x[1]; /* intentionally uninitialized */
  x[0] ^= x[0];
  printf("%d\n", x[0]);
  printf("%d\n", x[0]);
  return;
}

@chhzh123
Copy link
Member

Right, this is problematic. We need to discuss what is the best way to initialize the array. It seems MLIR does not provide shorthands like int a[10] = {0} to initialize all the elements to zero.

@wyanzhao
Copy link
Collaborator Author

It appears that using linalg.fill could be an effective solution for addressing the uninitialized memory issue in HeteroCL. Xilinx adopts this approach to initialize memref in their MLIR-AIR code, as demonstrated in the following documentation: https://xilinx.github.io/mlir-air/AIRTransformPasses.html. Here is an example from the documentation that illustrates the usage of linalg.fill to initialize a memref:

#map = affine_map<()[s0] -> (s0 * 32)>
module attributes {torch.debug_module_name = "mmult"} {
  func.func @forward(%arg0: memref<1024x1024xf32>, %arg1: memref<1024x1024xf32>, %arg2: memref<1024x1024xf32>) {
    %c1024 = arith.constant 1024 : index
    %cst = arith.constant 0.000000e+00 : f32
    %c32 = arith.constant 32 : index
    %c128 = arith.constant 128 : index
    %c0 = arith.constant 0 : index
    %c4 = arith.constant 4 : index
    %0 = memref.alloc() {alignment = 128 : i64} : memref<1024x1024xf32>
    %1 = memref.alloc() {alignment = 128 : i64} : memref<1024x1024xf32>
    linalg.fill ins(%cst : f32) outs(%0 : memref<1024x1024xf32>)
    memref.copy %0, %1 : memref<1024x1024xf32> to memref<1024x1024xf32>
  }
}

@zhangzhiru
Copy link

Just a note that in practical hardware design, we do not want to initialize a memory block if it does not affect the functionality. The overhead is too high. So adding a fill operation or an initializer is not the solution here.

@chhzh123 can we avoid reading from the uninitialized array in this specific case?

@chhzh123
Copy link
Member

It also requires some control logic to skip those uninitialized elements, which may also increase hardware complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants