-
Notifications
You must be signed in to change notification settings - Fork 90
Segfault on TF tutorials #77
Comments
Hi @mirh, thanks for the report. Unfortunately without more information it's hard to say what's going wrong, essentially all this tells us is that when AMD's OpenCL implementation is trying to build the program it is crashing. I know that at least one version of the driver crashes when you try to use doubles on a device that does not support them, but I am sure there could be other reasons. What does the device code look like in this example? |
It's the example you recommend here, nothing more. |
@mirh what's your operating system and OpenCL library? On Linux, the Mesa OpenCL implementation "Clover" is quite buggy - I can only get OpenCL to work with the proprietary AMD code. |
Yeah, I absolutely know. |
Great news, everyone! |
Ending? Could you elaborate, please? |
I meant, the instructions "on the top" of the stack trace where the same. |
Hi @mirh, sorry it's taken me a while to get back to you. Getting back up to speed, you get build failures in Tensorflow, and likewise in the gaussian blur sample? Some small changes were made over the holidays, could you maybe try them? |
Yes, exactly. No difference though. |
Hmmmm... Depending on how much time you want to spend on this, you could start removing code from the kernel in the Gaussian Blur sample, to see if it ever passes the compilation... that would at least let us know what sort of construct is causing the failures. I can understand if you'd rather not do that though, as it is something of a slog. |
Commenting this, then replacing |
So it doesn't like the maths functions? Huh, I wonder if we've changed something there... So I found some AMD hardware and tried the Gaussian myself. It failed. Turns out, when I made changes to it last, I added this: const auto pi = std::atan(1) * 4; to calculate the value of pi as a const. The return type of the function is... It's possible that somewhere in Tensorflow, we're using doubles - I don't really know the code at all, and I thought we were quite careful about that sort of thing. I can't look at this more tonight but might be able to look through the kernels on Monday to see if it's the same thing happening in there! Thanks for your help tracking this down. (If you're interested, you can use the "extract-ir" script in the SDK to see the SPIR code, which is where I tracked down the use of double). |
OK, pushed here: 19be0cf |
Cool! Now all |
As I mentioned, unfortunately on this older AMD driver that we have to use it straight up crashes when you use double, anywhere in the code, even if it's not in the kernel you're trying to run. There's no way we can recover from that unfortunately 😄 It might be useful to warn at the compiler level but honestly I think it'd be far too noisy. Lots of hardware supports it, too, and lots of drivers don't segfault but instead report an error! |
You are right. In other news |
Doubles and Halfs should be optional in TF. However it seems like mentioned PR is stuck. @jwlawson and I are introducing config option that enables / disables half and double here: jwlawson/tensorflow@5ec5964 It should be in https://github.com/lukeiwanski/tensorflow/tree/dev/amd_gpu soon-ish. After that happens, someone needs to go through all registered Ops and use TF_CALL_SYCL_NUMBER_TYPES macro for registration. @mirh would you like to give it a go? :) |
Guess like. |
In the end, while I intended to look at the sycl files to see if |
Aaaand fixed. Thank you all. |
Aaaand.. It's here again (computecpp 0.7.0, tensorflow 1.8). |
Ok so.. lukeiwanski/tensorflow@0fc77bd with 0.6.1 still segfaults... |
Well, FML. But.. lukeiwanski/tensorflow@591d829 actually runs? Without even lukeiwanski/tensorflow#205 ?! I'll now try to bisect the last handful of commits, but I cannot understand what.. Build dependencies must have changed in this month or so to change code behavior? Same commits, different results. |
Hi @mirh, diff --git a/third_party/sycl/crosstool/CROSSTOOL.tpl b/third_party/sycl/crosstool/CROSSTOOL.tpl
index 3078b5b534..c62b5b93e1 100755
--- a/third_party/sycl/crosstool/CROSSTOOL.tpl
+++ b/third_party/sycl/crosstool/CROSSTOOL.tpl
@@ -175,6 +175,7 @@ toolchain {
cxx_flag: "-DEIGEN_HAS_CXX11_MATH=1"
cxx_flag: "-Wno-unused-variable"
cxx_flag: "-Wno-unused-const-variable"
+ cxx_flag: "-sycl-compress-name"
unfiltered_cxx_flag: "-Wno-builtin-macro-redefined"
unfiltered_cxx_flag: "-D__DATE__=\"redacted\"" This patch will make the compiler output kernel names that are hashed versions of the "true" kernel name, which means they will have fixed length. Looking at the repo at the commits you mention, this does seem to be consistently one of the changes. Thanks so much for persevering with this! |
I went further on with bissecting, and indeed lukeiwanski/tensorflow@3cc8566 was making the deal. So.. progressing for whatever else might be. EDIT: fuck, I just noticed I put it in the *wrong* lines (local abi, instead of cross_target) |
Well, I just put it on both, and it still didn't work (with the tip) |
If you have the time, that would be very useful! It's a shame that didn't work, I really thought we were onto something there... |
I'm still doing my testings, but are you sure that option goes as a cxx_flag, rather than compiler_flag or linker_flag something? |
cxx_flags are passed to the compiler when compiling C++ source flags (I say this because the compiler could be, for example, gcc -x c++, in some cases). It's definitely not a linker flag, In fact, that particular flag is specific to compute++, so no other compiler will be able to understand what it might mean. I'd be surprised if there were issues, because lots of flags get added in the same way by bazel, and if they don't get added you will have many more compile errors (or sometimes runtime errors - std::err messages about "missing kernels" would indicate that this had happened). Technically there's a way to check - but it's a little bit involved (a nontrivial change, with some thought attached I'm afraid). If you're willing to keep bisecting to find the time when it broke, that might be our new best chance! |
Ok so, updates.. |
lukeiwanski/tensorflow@f850c60 is the breaking commit |
Oh - maybe it's the half-type code. @lukeiwanski halfs are a configurable option, right? Maybe we can turn it off... |
hmm that's good point - disabling halfs might not be enough.. let me try something. |
@Rbiessy could you take a look at this? I believe we need to add check around the |
That should do the trick: lukeiwanski/tensorflow#245 |
@mirh can you give it a spin at lukeiwanski/tensorflow@1e0dd42 ? |
It did it! (together with --sycl-compress-name) |
So.. Are you like doing anything for that? |
What do you mean? This option makes the compiler hash the name of the kernel and output that instead because of some buggy drivers, that's all. It makes debugging harder though so we tend not to enable it by default. |
Makes sense. |
Since it's a compiler flag, there's no way for the ComputeCpp runtime to identify which driver it is running on and change the flag accordingly. |
Lol, right. |
That's a good idea. We'll try to add to the FAQ on our website - this issue is more likely to affect TensorFlow code, because of the frankly huge kernel names, but could strike anywhere. |
|
Gdb backtrace with MNIST
The text was updated successfully, but these errors were encountered: