Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARM64] Clobbering x0-x15 Breaks Down on Clang #421

Closed
xrq-phys opened this issue Jul 12, 2020 · 4 comments
Closed

[ARM64] Clobbering x0-x15 Breaks Down on Clang #421

xrq-phys opened this issue Jul 12, 2020 · 4 comments
Labels

Comments

@xrq-phys
Copy link
Collaborator

xrq-phys commented Jul 12, 2020

Dear BLIS developers,

This seems a duplicate of #207 . Sorry for having opened another issue.
The current arm64 kernels and new armsve kernel play well with GCC but breaks down under Clang with:

kernels/armv8a/3/bli_gemm_armv8a_opt_4x4.c:73:1: fatal error: inline assembly requires more registers than available.

Toolchain:

  • LLVM 6.0 + Clang or
  • ARM Allinea Studio 20.1;

Possible Reason:
Number of input registers + Number of clobbers becomes larger than 30?

Workaround:
Comment out clobbering of x0-x15. These general-purpose registers can be discarded w.r.t. the AArch64 calling convention and results of C code lying between function entry and __asm__ volatile are saved to memory as input registers. There should be no hazard on doing this.
e.g.

kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c:1068:
 :// Register clobber list
> // "x0", "x1", "x2","x3","x4",
> // "x5", "x6", "x7", "x8",
> // "x9", "x10","x11","x12",
> // "x13","x14","x15",
  "x16","x17","x18","x19",       
  "x20","x21","x22","x23",
  "x24","x25","x26","x27",
  "v0", "v1", "v2", "v3",
  "v4", "v5", "v6", "v7",
  "v8", "v9", "v10","v11",
  "v12","v13","v14","v15",
  "v16","v17","v18","v19",
  "v20","v21","v22","v23",
  "v24","v25","v26","v27",
  "v28","v29","v30","v31"
...
kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c:2061:
:// Register clobber list
> //"x0","x1","x2","x3",
> //"x4","x5","x6",
> //"x7","x8","x9",
> //"x10","x11","x12","x13","x14",
  "x16","x17",
  "x20","x21","x22","x23","x24","x25","x26",
  "x27",       
  "v0","v1","v2",
  "v3","v4","v5",
  "v6","v7","v8",
  "v9","v10","v11",
  "v12","v13","v14",
  "v15","v16","v17","v18","v19",
  "v20","v21","v22","v23",
  "v24","v25","v26","v27",
  "v28","v29","v30","v31"

@devinamatthews
Copy link
Member

@xrq-phys Have you tested commenting out the clobbers (and if you do can you please send the generated assembly)? I would be concerned about this because I would have thought that the compiler took the general purpose nature of those registers into account when computing the register allocation, and so removing the clobbers would wither a) not change anything, or b) cause the compiler to actually use them and then something gets screwed up.

@xrq-phys
Copy link
Collaborator Author

@devinamatthews I'm afraid your concern is correct. Commenting out x0-15 in bli_gemm_armv8a_asm_d6x8.c:bli_dgemm_armv8a_asm_6x8 is giving me this after the __asm__ volatile block:

.DEND:


        //NO_APP
        str     x2, [sp, #24]           // 8-byte Folded Spill <<-- BROKEN
        str     x3, [sp, #16]           // 8-byte Folded Spill <<-- BROKEN
        ldp     x29, x30, [sp, #288]    // 8-byte Folded Reload
        ldp     x21, x20, [sp, #272]    // 8-byte Folded Reload
        ldp     x23, x22, [sp, #256]    // 8-byte Folded Reload
        ldp     x25, x24, [sp, #240]    // 8-byte Folded Reload
        ldp     x27, x26, [sp, #224]    // 8-byte Folded Reload
        ldr     x28, [sp, #208]         // 8-byte Folded Reload
        ldp     d9, d8, [sp, #192]      // 8-byte Folded Reload
        ldp     d11, d10, [sp, #176]    // 8-byte Folded Reload
        ldp     d13, d12, [sp, #160]    // 8-byte Folded Reload
        ldp     d15, d14, [sp, #144]    // 8-byte Folded Reload
        add     sp, sp, #304            // =304
        ret

Sorry for having proposed an incorrect workaround.

Still, one way to avoid this seems to be commenting out only x8-15 and compile with -O3, but that depends the program on the -O behavior thus maybe inappropriate to be put into such a project.

... I'll try to look for other solutions.

@devinamatthews
Copy link
Member

No problem, I have a feeling that either the kernel will have to be modified to use one less register, or maybe we will get lucky and one or more of the clobbers are not actually used in the kernel.

@devinamatthews
Copy link
Member

Fixed by #506

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants