Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] Support for heterogeneous compilation #2244

Closed
jcranmer-intel opened this issue Oct 1, 2020 · 24 comments
Closed

[REQUEST] Support for heterogeneous compilation #2244

jcranmer-intel opened this issue Oct 1, 2020 · 24 comments
Labels
request Request for something

Comments

@jcranmer-intel
Copy link
Contributor

I've been looking at how to add support for Intel's DPC++ compiler, which involves heterogeneous compilation. This is going to be related to adding support for OpenACC (#2067) and the computecpp (#1339) issues. I'm opening up this issue to specifically target how to display the results of heterogeneous compilation.

There are four existing forms of heterogeneous compilers I'm aware of:

  • CUDA (already exists in CE, but only does device compilation)
  • OpenACC (I have no personal experience here)
  • OpenMP's #pragma omp target (this is valid in C, C++, and Fortran compilers)
  • SYCL

The SYCL and OpenMP offload mechanisms for Intel's compilers are pretty similar, and work by compiling the host code for x86, and then compiling the device code (usually to SPIR-V), and then bundling them with the same output file via clang-offload-bundler. The SPIR-V can further be compiled to CPU code via opencl-aot or GPU code via ocloc. Both of these tools are distributed via separate projects: opencl-aot uses your opencl drivers (although they can be coaxed to use a specific driver via environment variables), and ocloc comes via the Intel Graphics Compiler.

I've managed to get (barely) working flows for SYCL compilation in two different manners, although the code is a few months stale, and it is probably better to get a proper solution from scratch. One of the most frustrating issues is that a lot of the tooling involved here doesn't actually support the usage patterns that would make single command lines work, so I've had to resort to writing my own scripts.

Describe the solution you'd like

IMHO, the best option to move forward here is to provide a "host" assembly view as well as a "device" assembly view. When compiling the output of an executable, it's possible to detect that this output has both host code and device code. The tool necessary to do so is probably compiler-specific, although several compilers might reuse the same tool. This tool would take as input args the file to check for presence of heterogeneous code and a directory to extract host and device pieces to, and would output on stdout a JSON-formatted summary of which devices are present and the filename corresponding to that device code. There might be other per-file metadata necessary.

To handle targets like SPIR-V that can itself be compiled to other device targets, I propose a new set of configuration entries for handling device compilers that work kind of like the compiler entries. These device compilers would be orthogonal to the language compilers. Device compilers also have an ability to target particular target variants (e.g., the OpenCL driver for Intel can target SSE4.2, AVX, AVX2, or AVX-512), but I'm kind of unsettled as to whether these modes should be separate targets, suggested arguments for the device compiler, or maybe yet another specifiable drop-down.

In my prototype, I tried pushing the device compilers in the same general theme as regular compilers. However, this runs into two main issues. The first is that both opencl-aot and ocloc refuse to produce output in easy steps; I had to chain together several commands to get them to output something that the regular CE infrastructure could ingest. The second issue is that you tend to end up with binary-only or source-only outputs with these tools--the regular drop downs of the asm output view aren't really useful.

Describe alternatives you've considered

The other option I've prototyped is just supporting device output only (à la CUDA currently). While there are some options that control the ability of Intel's SYCL compiler to output device output only, the issues I ran into in the last paragraph of the previous section (most notably, the need to do several commands instead of a single compiler command) means that there was still a decent amount of effort needed in the JS code to actually support the option.

Additional context

I have no experience with gcc's side of things, and I haven't personally touched any CUDA code for almost a decade. It would be wonderful if someone knowledgeable could chime in with some explanation of the feasibility of doing heterogeneous code identification.

@partouf
Copy link
Contributor

partouf commented Oct 3, 2020

To give any kind of suggestion of what would be the best way of dealing with this, it would be good to have some examples of these various outputs and how they are generated.

In general, it's not really a big deal that something has to be chained. We don't do it often, but it's been used before in our code. (Most recently in this tool -> https://github.com/compiler-explorer/compiler-explorer/blob/master/lib/tooling/pvs-studio-tool.js#L103)

@jcranmer-intel
Copy link
Contributor Author

jcranmer-intel commented Oct 9, 2020

Let me run through some of the basic stuff I do with the SYCL compilation.

If you run dpcpp -S example.cpp, you get a bundled .s file that looks like this:

# __CLANG_OFFLOAD_BUNDLE____START__ sycl-spir64-unknown-unknown-sycldevice
BCÀÞ5^T^@^@^E^@^@^@b^L0$MY¾fmû´O^[È$D^A2^E^@!^L^@^@¨<8d>^@^@^K^B!^@^B^@^@^@^V^@^@^@^G<81>#<91>AÈ^DI^F^P29<92>^A<84>^L%^E^H^Y^^^D<8b>b<80>^\E^BB<92>^KBä^P2^T8^H^XK
[ ... ]
# __CLANG_OFFLOAD_BUNDLE____END__ sycl-spir64-unknown-unknown-sycldevice

# __CLANG_OFFLOAD_BUNDLE____START__ host-x86_64-unknown-linux-gnu
  .text
  .intel_syntax noprefix
[ ... ]
# __CLANG_OFFLOAD_BUNDLE____END__ host-x86_64-unknown-linux-gnu

(-c produces similar output files, except things are contained in ELF sections instead of using comments to identify begin/end sections, although it's slightly wonkier). The sycl section in both cases is an unoptimized LLVM bitcode file, in binary format, irrespective of any other common command-line options such as -O3 or -S.

The resulting .bc file can be converted into a SPIR-V file with llvm-spirv (although there is a textual format of SPIR-V, the llvm-spirv does not produce that format). In regular linking usage, the .spv file is then converted with clang-offload-wrapper into an object file containing data symbols containing the SPIR-V data (via a few more steps that aren't important), that is added to the link line.

Ahead-of-time compilation of the .spv can be done using the opencl-aot tool. This produces an elf file which can't be processed with GNU binutils (Filef format not recognized), but works with the llvm- variants. Of the resulting sections:

Sections:
Idx Name          Size     VMA              Type
  0               00000000 0000000000000000 
  1 .ocl.ir       000226b0 0000000000000000 
  2 .ocl.meta     000001d0 0000000000000000 
  3 .ocl.opt      00000e78 0000000000000000 
  4 .ocl.obj      00022c90 0000000000000000 
  5 .ocl.ver      00000004 0000000000000000 
  6               0000002e 0000000000000000 

.ocl.obj contains a nested ELF image, so running llvm-objcopy produces the real .elf file that you can disassemble with objdump -d as normal.

The GPU compilation pipeline instead involves fewer steps (just one ocloc to get a binary image), but the disassembly step breaks each kernel into a separate .asm output file, and the debugging information is apparently not represented in the output step. It's also lower priority for me, as I don't know all that much about the GPU output format.

Github is really picky about what it lets me upload, so here's a tarball of all the files produced or consumed by this script:

dpcpp -S default.cpp -O3 -g -mllvm --x86-asm-syntax=intel # Roughly the command line that CE would use
dpcpp -c default.cpp -O3 -g # Counterpart when compiled with -c, if you're curious

# Extract the kernel, convert to SPIR-V output
clang-offload-bundler --unbundle --inputs=default.s --outputs=kernel.bc --targets=sycl-spir64-unknown-unknown-sycldevice --type=s
llvm-spirv -o kernel.spv kernel.bc -spirv-debug-info-version=legacy

# SPIR-V -> CPU
opencl-aot --device=cpu -march=avx512 -o kernel.elf kernel.spv
llvm-objcopy --dump-section=.ocl.obj=kernel-real.elf kernel.elf
objdump -d kernel-real.elf -l --insn-width=16 -M intel > objdump-cpu.txt

# SPIR-V -> GPU
ocloc -device skl -output kernel-gpu.bin -output_no_suffix -file kernel.spv -spirv_input
mkdir dump-gpu
ocloc disasm -file kernel-gpu.bin -device skl -dump dump-gpu

(The last step is supposed to produce a .asm file in dump-gpu/, but there's a library loading error which I haven't tracked down yet. But, as I said, GPU output is less important))

@partouf
Copy link
Contributor

partouf commented Jul 13, 2022

I'm watching a talk about SYCL, so I got reminded about this.

It seems that the examples that @mattgodbolt gave in #2836 don't work anymore, and I'm not sure why https://godbolt.org/z/5nK4qsW63

Then I tried to use icx and use an example from a github repo, which seems to work except there's no device pane available and the binary mode gives linker errors (that we can get rid of, I think) https://godbolt.org/z/eeqGTK4sb

Some questions that come to mind:

  • Maybe we should add the SYCL things as a library, so the user doesn't have to input the include paths manually
  • I don't know what the best solution is for things like --cuda-path and I'm not sure where the spirv path is with the .bc files
  • It seems that Clang has some unbundling code (https://github.com/compiler-explorer/compiler-explorer/blob/main/lib/compilers/clang.js#L106), but ICX seems to have the wrong compilerType, I think we have to switch that to Clang

@partouf
Copy link
Contributor

partouf commented Sep 2, 2022

Some progress on the icx front. Here's a compilation with both assembly and device code. Downside: we'll need to disassemble the code somehow. And because it depends on the type of device, we'll need some fancy flexible logic. We also need to extract the binary so there's no loss of information like there is now.

Screenshot from 2022-09-02 23-12-13

@partouf
Copy link
Contributor

partouf commented Sep 2, 2022

https://clang.llvm.org/docs/ClangOffloadBundler.html#archive-unbundling mentions it's supposed to be compatible with ar, but that's not true.
And there's a clang-offload-extract application, but it's not clear on how to use it.
And there's an -unbundle option for clang-offload-bundler https://github.com/llvm/llvm-project/blob/main/clang/tools/clang-offload-bundler/ClangOffloadBundler.cpp#L112 but that's also not clear.

@partouf
Copy link
Contributor

partouf commented Sep 2, 2022

$ clang-offload-bundler -list --type s --inputs output.s

results in

sycl-nvptx64-nvidia-cuda
host-x86_64-unknown-linux-gnu

And then $clang-offload-bundler -unbundle --type s --inputs output.s --outputs sycl.bc --targets sycl-nvptx64-nvidia-cuda outputs a file with the contents of the sycl-nvptx64-nvidia-cuda bits.

Not sure how to disassemble that. It starts with "BC" as seen in the screenshot, has some debugging file info in the middle and it ends with some symbols. (spirv-dis, objdump and nvdisasm don't work)

@partouf
Copy link
Contributor

partouf commented Sep 2, 2022

OH it's described here #2244 (comment)
Facepalm moment

@partouf
Copy link
Contributor

partouf commented Sep 2, 2022

my example doesn't produce a spirv thing so $ llvm-spirv -o kernel.spv sycl.bc -spirv-debug-info-version=legacy
gives InvalidTargetTriple: Expects spir-unknown-unknown or spir64-unknown-unknown. Actual target triple is nvptx64-nvidia-cuda

I can display some of the contents with $ /opt/compiler-explorer/clang-14.0.0/bin/llvm-bcanalyzer --dump sycl.bc... but it's a lot of ... stuff

@partouf
Copy link
Contributor

partouf commented Sep 2, 2022

... oh

$ /opt/compiler-explorer/clang-14.0.0/bin/llvm-dis sycl.bc
produces a sycl.ll with LLVM IR it seems

@partouf
Copy link
Contributor

partouf commented Sep 3, 2022

I can reproduce #2244 (comment) partially:

With icpx 2022.0.1.71 and extra arguments -fsycl -fsycl-targets=spir64, I can generate the same llvm-ir as in #4019

But I can also now run llvm-spirv -o kernel.spv kernel.bc -spirv-debug-info-version=legacy manually without errors, because it produced for a spirv target.

But I cannot run opencl-aot --device=cpu -march=avx512 -o kernel.elf kernel.spv, this will result in the following error:

OpenCL platform ID is empty
OpenCL platform name is empty
Failed to find any of these OpenCL platforms:
  Intel(R) OpenCL
  Intel(R) CPU Runtime for OpenCL(TM) Applications

I can also not find ocloc on my machine.

I did try spirv-dis on the spv file, like the spirv compiler does for openclc, but this fails pretty early in the process:

$ /opt/compiler-explorer/SPIRV-Tools-master/build/tools/spirv-dis kernel.spv 
; SPIR-V
; Version: 1.1
; Generator: Khronos LLVM/SPIR-V Translator; 14
; Bound: 4451
; Schema: 0
               OpCapability Addresses
               OpCapability Linkage
               OpCapability Kernel
               OpCapability Int64
               OpCapability GenericPointer
          %1 = OpExtInstImport "OpenCL.std"
error: 7: Invalid extended instruction import 'SPIRV.debug'

@rscohn2
Copy link
Contributor

rscohn2 commented Sep 3, 2022

The toolchains for this are confusing and not well documented. I will try to continue what you were doing with spirv. I think opencl-aot depends on an opencl runtime. Did you source the icx setvars.sh? If not then you need OCL_ICD_FILENAME + the following:

compiler.icx202210.libPath=/opt/compiler-explorer/intel-cpp-2022.1.0.137/compiler/latest/linux/compiler/lib/intel64_lin:/opt/compiler-explorer/intel-cpp-2022.1.0.137/compiler/latest/linux/compiler/lib/ia32_lin:/opt/compiler-explorer/intel-cpp-2022.1.0.137/compiler/latest/linux/lib:/opt/compiler-explorer/intel-cpp-2022.1.0.137/tbb/2021.6.0/lib/intel64/gcc4.8

@partouf
Copy link
Contributor

partouf commented Sep 3, 2022

The toolchains for this are confusing and not well documented. I will try to continue what you were doing with spirv. I think opencl-aot depends on an opencl runtime. Did you source the icx setvars.sh? If not then you need OCL_ICD_FILENAME + the following:

compiler.icx202210.libPath=/opt/compiler-explorer/intel-cpp-2022.1.0.137/compiler/latest/linux/compiler/lib/intel64_lin:/opt/compiler-explorer/intel-cpp-2022.1.0.137/compiler/latest/linux/compiler/lib/ia32_lin:/opt/compiler-explorer/intel-cpp-2022.1.0.137/compiler/latest/linux/lib:/opt/compiler-explorer/intel-cpp-2022.1.0.137/tbb/2021.6.0/lib/intel64/gcc4.8

Ahh, I did try a bunch of those but not all the paths in the ldPath, that does seem to work

The rest of the CPU part then also works (so opencl-aot + llvm-objcopy + objdump)

But isn't the CPU part exactly the same as the assembly stored under host-x86_64-unknown-linux-gnu, or is it different somehow..

@rscohn2
Copy link
Contributor

rscohn2 commented Sep 3, 2022

This works:

icpx -fsycl -O3 -fsycl-device-only -fno-sycl-use-bitcode default.cpp
/home/rscohn1/local/projects/compiler-explorer/SPIRV-Tools/build/tools/spirv-dis default-sycl-spir64-unknown-unknown.spv -o default.spirv.dis

It would require a separate compile for device. Is that ok? When I add -g, I don't seem to get the code anymore, just metadata.

@partouf
Copy link
Contributor

partouf commented Sep 3, 2022

This works:

icpx -fsycl -O3 -fsycl-device-only -fno-sycl-use-bitcode default.cpp
/home/rscohn1/local/projects/compiler-explorer/SPIRV-Tools/build/tools/spirv-dis default-sycl-spir64-unknown-unknown.spv -o default.spirv.dis

It would require a separate compile for device. Is that ok? When I add -g, I don't seem to get the code anymore, just metadata.

Oh interesting. I suspect it would be a bit intense for production. I'll give it some more thought.

@rscohn2
Copy link
Contributor

rscohn2 commented Sep 3, 2022

I will talk to the engineer that maintains the icpx driver to find a way to get spirv that can be disassembled without redundant compiles. I believe there are separate compiles for host and device code even when you use the single command.

@rscohn2
Copy link
Contributor

rscohn2 commented Sep 3, 2022

But isn't the CPU part exactly the same as the assembly stored under host-x86_64-unknown-linux-gnu, or is it different somehow.

That is the host code that launches the kernel. The compiler only generates spirv code for the device because it does not know what type of device you will be targeting.

@partouf
Copy link
Contributor

partouf commented Sep 3, 2022

Ooooh.. It looks like the -spirv-debug-info-version=legacy should probably just be --spirv-debug with the trunk version of llvm-spirv that we have. Then most of it can be disassembled.

The llvm-spir version from the icx installation seems to be bugged maybe?

I didn't test with -O3 earlier so the two variations are hard to compare atm, will have to do that to confirm if they're maybe the same.

@rscohn2
Copy link
Contributor

rscohn2 commented Sep 3, 2022

Another issue for you to think about is how the generation of device code should be integrated. People will want to see some combination of spirv, x86, gen GPU. For nvidia, I suppose it will be ptx + actual targets. For cpu & gpu, they might want to see the code for multiple architectures.

Driver compilation could be integrated under 'Add tool'. The driver needs to know what architecture you are targeting. I see that you already have a 'tool arguments' text box, so that would work well. ocloc, opencl-aot, llvm-spirv would be separate tools. For each tool, you could only see 1 architecture at a time.

Another path is to take advantage of the compiler's ahead of time compilation. In the original compile, you can specify actual targets for device code, and it will invoke the driver to compile spirv to binary and then pack all the targets into a single binary. All the options to control that could go to the single command line for the compiler. Then CE would have to unpack all the target code from the fat binary. I guess they could be shown in the '+' dropdown where you can select the single Device today. Can that handle the case where every compile populates the list with all the options? Ahead of time requires a fully linked binary with a main so it would not fit so well into the case where you are just studying one function.

'add tool' approach seems simpler. ahead of time might be more flexible but more complicated: unpacking fat binaries, discovering all the targets, creating entries in the drop down, forcing the fully linked binary.

What do you think?

@partouf
Copy link
Contributor

partouf commented Sep 3, 2022

I was just thinking about that yes.

I was considering maybe for simplicity's sake we could pass it on to a LLVM-IR editor and add more compilers to facilitate these things, but then again it would need to be an unfiltered version of the IR and it's now filtered by the same settings that you give the compiler. So we would have to always send the raw IR to the client as well.

Extra tool or dropdown in the device window or the compiler could be an option as well, but would greatly complicate our backend handling, I think.

But you're right that the current device view of showing only 1 device is also a complication.

I'll give it some more thought.

For now I think I'm going to merge #4019 as is. And then iterate over it. I'm ok with breaking the behavior if we can get something better later.

@rscohn2
Copy link
Contributor

rscohn2 commented Sep 3, 2022

Ooooh.. It looks like the -spirv-debug-info-version=legacy should probably just be --spirv-debug with the trunk version of llvm-spirv that we have. Then most of it can be disassembled.

I see something similar I add the -g flag:

icpx -g -fsycl -O3 -fsycl-device-only -fno-sycl-use-bitcode default.cpp
/home/rscohn1/local/projects/compiler-explorer/SPIRV-Tools/build/tools/spirv-dis default-sycl-spir64-unknown-unknown.spv -o default.dis

spirv-dis outputs a lot of stuff, but gets this error:

error: 1608: Invalid source language operand: 33

Does the trunk llvm-spirv avoid the error? I suspect intel has done some extensions and we don't have the right version of all the tools and options.

@partouf
Copy link
Contributor

partouf commented Sep 3, 2022

I end up with error: 1647: Id is 0 with today's llvm-spir build. But the output in stdout is still significant I think (but maybe im wrong)

@partouf
Copy link
Contributor

partouf commented Sep 3, 2022

For now I think I'm going to merge #4019 as is. And then iterate over it. I'm ok with breaking the behavior if we can get something better later.

This is now live to play around with

@jcranmer-intel
Copy link
Contributor Author

jcranmer-intel commented Sep 7, 2022

Running llvm-spirv --spirv-tools-dis will convert an LLVM bitcode file into a textual SPIR-V file, but it requires llvm-spirv to be compiled with the spirv-tools. I don't believe this is accessible via the clang driver interface though.

@partouf
Copy link
Contributor

partouf commented Nov 23, 2022

Has been live for a while, not sure why I didn't close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
request Request for something
Projects
None yet
Development

No branches or pull requests

3 participants