Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenBLAS 0.3.8 issue on AMD Threadripper cpu #8747

Closed
shyrma opened this issue Mar 3, 2020 · 11 comments
Closed

OpenBLAS 0.3.8 issue on AMD Threadripper cpu #8747

shyrma opened this issue Mar 3, 2020 · 11 comments
Assignees
Labels
Bug Bugs and problems High Priority

Comments

@shyrma
Copy link
Contributor

shyrma commented Mar 3, 2020

OpenBLAS 0.3.8 crashes on AMD Threadripper 3970X cpu, with SIGILL error code.
We always got following crash while testing TFGraphTestAllHelper.log_determinant.rank3:

  • A fatal error has been detected by the Java Runtime Environment:
  • SIGILL (0x4) at pc=0x00007f86f33181ff, pid=130947, tid=0x00007f8839009700
  • C [libopenblas_nolapack.so.0+0x10a61ff] sgemm_kernel_direct+0x126f

This issue is a blocker for us, since we basically unable to use devbox for java tests

@raver119 raver119 added Bug Bugs and problems High Priority labels Mar 3, 2020
@saudet
Copy link
Contributor

saudet commented Mar 3, 2020

I'm sure OpenBLAS 0.3.7 does the same. Work around that with OPENBLAS_CORETYPE, the same as was done for issue #4287.

@raver119
Copy link
Contributor

raver119 commented Mar 4, 2020

re: 0.3.7. I don't think so, we've been using this box for 3 months now, and got into this problem only after update to 0.3.8 i think. It's pretty new issue there.

re: OPENBLAS_CORETYPE - nope, doesn't help.

@raver119
Copy link
Contributor

raver119 commented Mar 4, 2020

@saudet you're right, 0.3.7 reproduces the same issue. So probably this crash is caused by the graph being added to resources.

I'll close this issue, we'll sort it out in some other way i think now.

@raver119 raver119 closed this as completed Mar 4, 2020
@saudet
Copy link
Contributor

saudet commented Mar 4, 2020

Setting the OPENBLAS_CORETYPE environment variable to "Athlon" or "Core2" doesn't help? That's strange. That should disable pretty much all optimizations. Anyway, one more thing to try is to use MKL instead. The presets for OpenBLAS can load it with the system properties given here:
https://github.com/bytedeco/javacpp-presets/tree/master/openblas#documentation

@raver119
Copy link
Contributor

raver119 commented Mar 4, 2020

Athlon doesn't help. We'll try Core2 now

@shyrma
Copy link
Contributor Author

shyrma commented Mar 4, 2020

I've just checked, same situation - no success unfortunately

@saudet
Copy link
Contributor

saudet commented Apr 29, 2020

This is probably the cause: OpenMathLib/OpenBLAS#2526

@raver119 raver119 reopened this Apr 29, 2020
@raver119
Copy link
Contributor

Who's building OpenBLAS for JavaCPP?

saudet added a commit to bytedeco/javacpp-presets that referenced this issue Apr 30, 2020
@saudet
Copy link
Contributor

saudet commented May 1, 2020

The changes from commit bytedeco/javacpp-presets@cb00315 have been deployed.
Please give it a try with 0.3.9-1.5.4-SNAPSHOT: http://bytedeco.org/builds/

@AlexDBlack
Copy link
Contributor

ND4J tests (including those that were consistently crashing on this system) are confirmed passing on 3990x with Linux (ubuntu 20.04) with the following version modifications:
https://github.com/KonduitAI/deeplearning4j/compare/master...KonduitAI:ab_openblas_test?expand=1

@raver119
Copy link
Contributor

raver119 commented May 1, 2020

Everything is fine on 3950 and 3970 as well 👍

@raver119 raver119 closed this as completed May 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Bugs and problems High Priority
Projects
None yet
Development

No branches or pull requests

4 participants