-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REQUEST] Support for heterogeneous compilation #2244
Comments
To give any kind of suggestion of what would be the best way of dealing with this, it would be good to have some examples of these various outputs and how they are generated. In general, it's not really a big deal that something has to be chained. We don't do it often, but it's been used before in our code. (Most recently in this tool -> https://github.com/compiler-explorer/compiler-explorer/blob/master/lib/tooling/pvs-studio-tool.js#L103) |
Let me run through some of the basic stuff I do with the SYCL compilation. If you run
( The resulting Ahead-of-time compilation of the
The GPU compilation pipeline instead involves fewer steps (just one Github is really picky about what it lets me upload, so here's a tarball of all the files produced or consumed by this script: dpcpp -S default.cpp -O3 -g -mllvm --x86-asm-syntax=intel # Roughly the command line that CE would use
dpcpp -c default.cpp -O3 -g # Counterpart when compiled with -c, if you're curious
# Extract the kernel, convert to SPIR-V output
clang-offload-bundler --unbundle --inputs=default.s --outputs=kernel.bc --targets=sycl-spir64-unknown-unknown-sycldevice --type=s
llvm-spirv -o kernel.spv kernel.bc -spirv-debug-info-version=legacy
# SPIR-V -> CPU
opencl-aot --device=cpu -march=avx512 -o kernel.elf kernel.spv
llvm-objcopy --dump-section=.ocl.obj=kernel-real.elf kernel.elf
objdump -d kernel-real.elf -l --insn-width=16 -M intel > objdump-cpu.txt
# SPIR-V -> GPU
ocloc -device skl -output kernel-gpu.bin -output_no_suffix -file kernel.spv -spirv_input
mkdir dump-gpu
ocloc disasm -file kernel-gpu.bin -device skl -dump dump-gpu (The last step is supposed to produce a |
I'm watching a talk about SYCL, so I got reminded about this. It seems that the examples that @mattgodbolt gave in #2836 don't work anymore, and I'm not sure why https://godbolt.org/z/5nK4qsW63 Then I tried to use icx and use an example from a github repo, which seems to work except there's no device pane available and the binary mode gives linker errors (that we can get rid of, I think) https://godbolt.org/z/eeqGTK4sb Some questions that come to mind:
|
Some progress on the icx front. Here's a compilation with both assembly and device code. Downside: we'll need to disassemble the code somehow. And because it depends on the type of device, we'll need some fancy flexible logic. We also need to extract the binary so there's no loss of information like there is now. |
https://clang.llvm.org/docs/ClangOffloadBundler.html#archive-unbundling mentions it's supposed to be compatible with ar, but that's not true. |
results in
And then Not sure how to disassemble that. It starts with "BC" as seen in the screenshot, has some debugging file info in the middle and it ends with some symbols. (spirv-dis, objdump and nvdisasm don't work) |
OH it's described here #2244 (comment) |
my example doesn't produce a spirv thing so I can display some of the contents with |
... oh
|
I can reproduce #2244 (comment) partially: With icpx 2022.0.1.71 and extra arguments But I can also now run But I cannot run
I can also not find I did try
|
The toolchains for this are confusing and not well documented. I will try to continue what you were doing with spirv. I think opencl-aot depends on an opencl runtime. Did you source the icx setvars.sh? If not then you need OCL_ICD_FILENAME + the following:
|
Ahh, I did try a bunch of those but not all the paths in the ldPath, that does seem to work The rest of the CPU part then also works (so opencl-aot + llvm-objcopy + objdump) But isn't the CPU part exactly the same as the assembly stored under |
This works:
It would require a separate compile for device. Is that ok? When I add -g, I don't seem to get the code anymore, just metadata. |
Oh interesting. I suspect it would be a bit intense for production. I'll give it some more thought. |
I will talk to the engineer that maintains the icpx driver to find a way to get spirv that can be disassembled without redundant compiles. I believe there are separate compiles for host and device code even when you use the single command. |
That is the host code that launches the kernel. The compiler only generates spirv code for the device because it does not know what type of device you will be targeting. |
Ooooh.. It looks like the The llvm-spir version from the icx installation seems to be bugged maybe? I didn't test with -O3 earlier so the two variations are hard to compare atm, will have to do that to confirm if they're maybe the same. |
Another issue for you to think about is how the generation of device code should be integrated. People will want to see some combination of spirv, x86, gen GPU. For nvidia, I suppose it will be ptx + actual targets. For cpu & gpu, they might want to see the code for multiple architectures. Driver compilation could be integrated under 'Add tool'. The driver needs to know what architecture you are targeting. I see that you already have a 'tool arguments' text box, so that would work well. ocloc, opencl-aot, llvm-spirv would be separate tools. For each tool, you could only see 1 architecture at a time. Another path is to take advantage of the compiler's ahead of time compilation. In the original compile, you can specify actual targets for device code, and it will invoke the driver to compile spirv to binary and then pack all the targets into a single binary. All the options to control that could go to the single command line for the compiler. Then CE would have to unpack all the target code from the fat binary. I guess they could be shown in the '+' dropdown where you can select the single Device today. Can that handle the case where every compile populates the list with all the options? Ahead of time requires a fully linked binary with a 'add tool' approach seems simpler. ahead of time might be more flexible but more complicated: unpacking fat binaries, discovering all the targets, creating entries in the drop down, forcing the fully linked binary. What do you think? |
I was just thinking about that yes. I was considering maybe for simplicity's sake we could pass it on to a LLVM-IR editor and add more compilers to facilitate these things, but then again it would need to be an unfiltered version of the IR and it's now filtered by the same settings that you give the compiler. So we would have to always send the raw IR to the client as well. Extra tool or dropdown in the device window or the compiler could be an option as well, but would greatly complicate our backend handling, I think. But you're right that the current device view of showing only 1 device is also a complication. I'll give it some more thought. For now I think I'm going to merge #4019 as is. And then iterate over it. I'm ok with breaking the behavior if we can get something better later. |
I see something similar I add the
spirv-dis outputs a lot of stuff, but gets this error:
Does the trunk llvm-spirv avoid the error? I suspect intel has done some extensions and we don't have the right version of all the tools and options. |
I end up with |
This is now live to play around with |
Running |
Has been live for a while, not sure why I didn't close this. |
I've been looking at how to add support for Intel's DPC++ compiler, which involves heterogeneous compilation. This is going to be related to adding support for OpenACC (#2067) and the computecpp (#1339) issues. I'm opening up this issue to specifically target how to display the results of heterogeneous compilation.
There are four existing forms of heterogeneous compilers I'm aware of:
#pragma omp target
(this is valid in C, C++, and Fortran compilers)The SYCL and OpenMP offload mechanisms for Intel's compilers are pretty similar, and work by compiling the host code for x86, and then compiling the device code (usually to SPIR-V), and then bundling them with the same output file via
clang-offload-bundler
. The SPIR-V can further be compiled to CPU code viaopencl-aot
or GPU code viaocloc
. Both of these tools are distributed via separate projects:opencl-aot
uses your opencl drivers (although they can be coaxed to use a specific driver via environment variables), andocloc
comes via the Intel Graphics Compiler.I've managed to get (barely) working flows for SYCL compilation in two different manners, although the code is a few months stale, and it is probably better to get a proper solution from scratch. One of the most frustrating issues is that a lot of the tooling involved here doesn't actually support the usage patterns that would make single command lines work, so I've had to resort to writing my own scripts.
Describe the solution you'd like
IMHO, the best option to move forward here is to provide a "host" assembly view as well as a "device" assembly view. When compiling the output of an executable, it's possible to detect that this output has both host code and device code. The tool necessary to do so is probably compiler-specific, although several compilers might reuse the same tool. This tool would take as input args the file to check for presence of heterogeneous code and a directory to extract host and device pieces to, and would output on stdout a JSON-formatted summary of which devices are present and the filename corresponding to that device code. There might be other per-file metadata necessary.
To handle targets like SPIR-V that can itself be compiled to other device targets, I propose a new set of configuration entries for handling device compilers that work kind of like the compiler entries. These device compilers would be orthogonal to the language compilers. Device compilers also have an ability to target particular target variants (e.g., the OpenCL driver for Intel can target SSE4.2, AVX, AVX2, or AVX-512), but I'm kind of unsettled as to whether these modes should be separate targets, suggested arguments for the device compiler, or maybe yet another specifiable drop-down.
In my prototype, I tried pushing the device compilers in the same general theme as regular compilers. However, this runs into two main issues. The first is that both
opencl-aot
andocloc
refuse to produce output in easy steps; I had to chain together several commands to get them to output something that the regular CE infrastructure could ingest. The second issue is that you tend to end up with binary-only or source-only outputs with these tools--the regular drop downs of the asm output view aren't really useful.Describe alternatives you've considered
The other option I've prototyped is just supporting device output only (à la CUDA currently). While there are some options that control the ability of Intel's SYCL compiler to output device output only, the issues I ran into in the last paragraph of the previous section (most notably, the need to do several commands instead of a single compiler command) means that there was still a decent amount of effort needed in the JS code to actually support the option.
Additional context
I have no experience with gcc's side of things, and I haven't personally touched any CUDA code for almost a decade. It would be wonderful if someone knowledgeable could chime in with some explanation of the feasibility of doing heterogeneous code identification.
The text was updated successfully, but these errors were encountered: