Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

versions (1.x.x) SIGSEGV in OSX ND4J CPU #8156

Closed
DavidGOrtega opened this issue Aug 27, 2019 · 52 comments
Closed

versions (1.x.x) SIGSEGV in OSX ND4J CPU #8156

DavidGOrtega opened this issue Aug 27, 2019 · 52 comments
Assignees
Labels
Bug Bugs and problems LIBND4J

Comments

@DavidGOrtega
Copy link

Issue Description

SIGSEGV in OSX with all versions above 0.9.1

Version Information

Affects all the versions after 0.9.1

  • Deeplearning4j version

1.0.0-beta4
1.0.0-beta3
1.0.0-beta2
1.0.0-beta
1.0.0-alpha

  • Platform information (OS, etc)
    MacOS Catalina 10.15 Beta (19A487l)
    MacBook Pro (13-inch, 2018, Four Thunderbolt 3 Ports)
    2,3 GHz Intel Core i5
    16GB

Additional Information

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fff711010b4, pid=21672, tid=0x0000000000001103
#
# JRE version: Java(TM) SE Runtime Environment (8.0_202-b08) (build 1.8.0_202-b08)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.202-b08 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [libc++abi.dylib+0x30b4]  _ZNK10__cxxabiv120__si_class_type_info27has_unambiguous_public_baseEPNS_19__dynamic_cast_infoEPvi+0x4
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/davidgortega/Projects/searchbox.ai/demos2/hs_err_pid21672.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
<dependency>
      <groupId>org.nd4j</groupId>
      <artifactId>nd4j-native</artifactId>
      <version>1.0.0-beta2</version>
</dependency>
@Test
    public void amazingTest_then_Fail () {
        INDArray S = Nd4j.zeros(1, 1);
    }
@saudet
Copy link
Contributor

saudet commented Aug 28, 2019 via email

@petrychenko
Copy link

Just updated my osX to 10.15 (Catalina). It is not a beta anymore.
nd4j version: 1.0.0-beta4.
The latest AdoptOpenJdk 11 fails as @DavidGOrtega described. (AdoptOpenJDK build 11.0.4+11 , Aug 23)
But when I switched to the latest Oracle Jdk 11, it solved the problem. Not sure if it is a ND4j problem.
I'm going to post it to AdoptOpenJDK bug tracker.

@saudet
Copy link
Contributor

saudet commented Oct 9, 2019

I see, it probably has something to do with some interaction between GCC/libstdc++ and Xcode's Clang/libc++. libnd4j has been relying on OpenMP, which Xcode doesn't provide, but @raver119 plans to move away from that. That would allow us to use Xcode only and probably work around these kinds of issues...

@saudet saudet added C++ LIBND4J Bug Bugs and problems labels Oct 9, 2019
@DavidGOrtega
Copy link
Author

But when I switched to the latest Oracle Jdk 11, it solved the problem. Not sure if it is a ND4j problem.

@petrychenko Thats interesting, not sure how it would affect but its definitely worth to give a try.

My java as seen in the previous log is

java version "1.8.0_202"
Java(TM) SE Runtime Environment (build 1.8.0_202-b08)
Java HotSpot(TM) 64-Bit Server VM (build 25.202-b08, mixed mode)

@raver119
Copy link
Contributor

raver119 commented Oct 9, 2019

I've upgraded to Catalina yesterday, so i'll check it out... But ye, as @saudet mentioned, we're working on replacement for OpenMP right now, so we hope to switch to Apple Clang for macOS really soon.

@ff-will
Copy link

ff-will commented Oct 22, 2019

Made the mistake of upgrading to 10.15 over the weekend and now I'm blocked because of this. I tried

  • Oracle JDK 8
  • OpenJDK 8
  • OpenJDK 11

with always the same result. :-(

@raver119
Copy link
Contributor

raver119 commented Oct 22, 2019 via email

@PMacho
Copy link

PMacho commented Oct 26, 2019

Latest AdoptOpenJDK 13 worked for me.

@mimkorn
Copy link

mimkorn commented Oct 29, 2019

Latest AdoptOpenJDK 13 did not fix it for me.

@nimishatandon
Copy link

Am facing a similar issue using nd4j 1.0.0-beta on Linux server where the server is crashing while doing inference. Any suggestions on what could be happening there ?
The issue is inconsistent and am unable to replicate it.

@ff-will
Copy link

ff-will commented Nov 2, 2019

@PMacho What does "worked for me" mean? According to my tests

  • the JVM does not crash anymore
  • but deeplearning4j still does not work on MacOS 10.15 (how could it?)

@PMacho
Copy link

PMacho commented Nov 4, 2019

@PMacho What does "worked for me" mean? According to my tests

  • the JVM does not crash anymore
  • but deeplearning4j still does not work on MacOS 10.15 (how could it?)

Well, I didn’t check for correct function. I only tried the quick start example. It didn’t crash and it started printing log information, that it was training the network.

@treo
Copy link
Member

treo commented Nov 4, 2019

@nimishatandon that is a separate problem. The problem here is specifically about macos. Please open a separate issue for your problem.

@raver119 raver119 removed the C++ label Nov 10, 2019
@gdagley
Copy link

gdagley commented Nov 13, 2019

Is there any update on this? I had to upgrade to Catalina over the weekend for work 🙄 and now my project isn't working. If there is a fix I can try out, I would be willing to help test possible solutions.

@raver119
Copy link
Contributor

Pull request merged, snapshots are getting build, so we'll be testing on random macs in next few days.

@raver119
Copy link
Contributor

PR was merged like a hour ago.

@raver119
Copy link
Contributor

AFAIK, issue is resolved. If, for some reason, you're still able to reproduce it - please reopen this issue or file a new one.

@ff-will
Copy link

ff-will commented Nov 18, 2019

That's great! Thanks @raver119 !
I'm having some troubles with the GitHub UX though. Shouldn't there be a link to the commits, info about what branch this was merged to or a possible release version? Sorry for the stupid questions...

@raver119
Copy link
Contributor

fix is already at eclipse master, and already available in daily snapshots.

@raver119
Copy link
Contributor

as for release: we hope release will be up early next week.

@zhangy10
Copy link

@raver119 Thanks for that, and can I ask how to switch beta5 to snapshots in the dl4j example projects? Should I just change the parent pom.xml and replace all 1.0.0-beta5 to 1.0.0-SNAPSHOT under the <properties>? I tried in this way but I still got this issue. So, anything else should I update in the pom.xml? Many thanks.

@treo
Copy link
Member

treo commented Nov 18, 2019

http://deeplearning4j.org/docs/latest/deeplearning4j-config-snapshots
This describes how to use snapshots in general.
You basically have to both, add the appropriate snapshots repository and change the version to 1.0.0-SNAPSHOT.

@zhangy10
Copy link

@treo Thanks for that, it works and I forgot to add <repositories> in the pom.xml.

I hope this issue can be added to a release version soon. Updating macOS should be careful. Anyways, thanks again.

@YEXINGZHE54
Copy link

problem still exists on 1.0.0-beta5, java 1.8 HotSpot.

@raver119
Copy link
Contributor

raver119 commented Nov 21, 2019 via email

@alonsoir
Copy link

No it is not working in 1.0.0-beta5, java 1.8.

deeplearning4j/deeplearning4j-examples#927

@raver119
Copy link
Contributor

Sure it doesn't. Fix was applied in snapshots only, and will also be available in upcoming 1.0.0-beta6.

http://deeplearning4j.org/docs/latest/deeplearning4j-config-snapshots

@alonsoir
Copy link

alonsoir commented Nov 29, 2019

I have tried with SNAPSHOT versions and it doesnt work. In #927 i have posted the log.

Ok, i am adding repository tag. Testing it now.

This is the pom.xml

This is the stacktrace after running clean package -Djavacpp.platform=macosx-x86_64

@raver119
Copy link
Contributor

Great. Show your pom.xml and output (or crash log) you've got there please.

@alonsoir
Copy link

I have updated the above comment with pom.xml and stacktrace. Thank you @raver119

@AlexDBlack
Copy link
Contributor

@alonsoir I don't see the crashlog in the examples issue or here.
By crash log, we mean the file like this: /Users/aironman/gitProjects/deeplearning4j-examples/hs_err_pid13740.log

@alonsoir
Copy link

Ok, got it.

https://pastebin.com/iuMEU0Jd

@AlexDBlack
Copy link
Contributor

@alonsoir you're definitely still using 1.0.0-beta5, not snapshots.
You can see "1.0.0-beta5" in the crash log (filenames/paths); it'd have "1.0.0-SNAPSHOT" instead if you were on snapshots.

@raver119
Copy link
Contributor

[libstdc++.6.dylib+0xd2e1] __dynamic_cast+0x71

This particular library comes from gcc, so you're definitely still using 1.0.0-beta5. Current macOS snapshots (and upcoming release) is built with clang.

Same with these libraries:

0x000000010e756000  /Users/aironman/.javacpp/cache/dl4j-examples-1.0.0-beta5-bin.jar/org/bytedeco/mkldnn/macosx-x86_64/libgcc_s.1.dylib
0x00000001576ce000  /Users/aironman/.javacpp/cache/dl4j-examples-1.0.0-beta5-bin.jar/org/bytedeco/mkldnn/macosx-x86_64/libgomp.1.dylib
0x000000015770d000  /Users/aironman/.javacpp/cache/dl4j-examples-1.0.0-beta5-bin.jar/org/bytedeco/mkldnn/macosx-x86_64/libstdc++.6.dylib

They are not used anymore.

@alonsoir
Copy link

alonsoir commented Nov 29, 2019

Ok, i thought that changing parent`s pom did the change. I will change every pom.xml in the related projects, i mean, dl4j-examples and shared-utilities.
I compiled and install first shared-utilities jar file, then i did the same in dl4j-examples project.

There are compiling errors.

I changed shared-utilities pom.xml to use SNAPSHOT version and dl4j-examples pom.xml file.

@alonsoir
Copy link

alonsoir commented Nov 29, 2019

TFGraphMapper.getInstance method is not longer present in nd4j-api-1.0.0-SNAPSHOT

@AlexDBlack
Copy link
Contributor

TFGraphMapper.getInstance method is not longer present in nd4j-api-1.0.0-SNAPSHOT

Just use the static methods - TFGraphMapper.importGraph etc

@alonsoir
Copy link

Thank you @AlexDBlack , i will try to do the fix, in the meantime, i will expect to the release of beta6.

Do you have any roadmap with the release of the stable version? Thank you.

@AlexDBlack
Copy link
Contributor

Full 1.0.0 should be the next release after the upcoming 1.0.0-beta6 release.
At this point that should be somewhere around end of Q1 2020, but that could change.
Still a few more API changes required before we are happy to do the full 1.0.0, but we're getting close.

@paul-anasuya
Copy link

any other quickfix @AlexDBlack ?

@elmodeer
Copy link

elmodeer commented Dec 3, 2019

I am having the same issue, even with the snapshot version. here is my pom.xml

I think the problem that iam still running the beta5 version but I don't how. I have changed the version and added the repository tag. but with looking at the log file I found the following entries.

0x000000012542b000 /Users/Hesham/.javacpp/cache/mkl-dnn-0.20-1.5.1-macosx-x86_64.jar/org/bytedeco/mkldnn/macosx-x86_64/libgcc_s.1.dylib 0x0000000125265000 /Users/Hesham/.javacpp/cache/mkl-dnn-0.20-1.5.1-macosx-x86_64.jar/org/bytedeco/mkldnn/macosx-x86_64/libgomp.1.dylib 0x00000001273a9000 /Users/Hesham/.javacpp/cache/mkl-dnn-0.20-1.5.1-macosx-x86_64.jar/org/bytedeco/mkldnn/macosx-x86_64/libstdc++.6.dylib 0x000000012763a000 /Users/Hesham/.javacpp/cache/mkl-dnn-0.20-1.5.1-macosx-x86_64.jar/org/bytedeco/mkldnn/macosx-x86_64/libiomp5.dylib 0x00000001277e3000 /Users/Hesham/.javacpp/cache/mkl-dnn-0.20-1.5.1-macosx-x86_64.jar/org/bytedeco/mkldnn/macosx-x86_64/libmklml.dylib 0x000000012ce20000 /Users/Hesham/.javacpp/cache/mkl-dnn-0.20-1.5.1-macosx-x86_64.jar/org/bytedeco/mkldnn/macosx-x86_64/libmkldnn.0.dylib

@raver119
Copy link
Contributor

raver119 commented Dec 3, 2019

No, you don't:

   <dependency>
      <artifactId>nd4j-native</artifactId>
      <groupId>org.nd4j</groupId>
      <version>1.0.0-beta5</version>
    </dependency>

@AlexDBlack
Copy link
Contributor

@paul-anasuya

any other quickfix @AlexDBlack ?

I'm not following. Do you mean: is there any fix other than switching to snapshots, waiting for the next release (coming very soon) or not using Mac?
No, no other options sorry.

@jlpassm
Copy link

jlpassm commented Dec 4, 2019

I'm on a Mac and I installed Catalina. :-(

I got a clean compile via mvn and then started getting the errors others have found above.

I have tried switching my pom.xml from 1.0.0-beta5 to 1.0.0-SNAPSHOT.

Now I no longer can get a clean compile. I run into:

deeplearning4j/examples/modelimport/tensorflow/LoadTensorFlowMNISTMLP.java:[67,39] cannot find symbol . ( and a couple more like this in LoadTensorFlowMNISTMLP.java)

IntelliJ doesn't show any error in the editor. I am using java version "1.8.0_231".

Anybody see anything like this?

I'm scared to try to revert my Catalina to Mohave.

@gdagley
Copy link

gdagley commented Dec 4, 2019

Did you also add

  <repositories>
    <repository>
      <id>snapshots-repo</id>
      <url>https://oss.sonatype.org/content/repositories/snapshots</url>
      <releases>
        <enabled>false</enabled>
      </releases>
      <snapshots>
        <enabled>true</enabled>
      </snapshots>
    </repository>
  </repositories>

I have been able to run on Catalina with the 1.0.0-SNAPSHOT using the info from http://deeplearning4j.org/docs/latest/deeplearning4j-config-snapshots

@jlpassm
Copy link

jlpassm commented Dec 4, 2019

Yea, I added that block in the pom.xml files where Maven complained. I also added this to the main pom.xml:
<java.version>1.8</java.version>
<nd4j.version>1.0.0-SNAPSHOT</nd4j.version>
<dl4j.version>1.0.0-SNAPSHOT</dl4j.version>
<datavec.version>1.0.0-SNAPSHOT</datavec.version>
<arbiter.version>1.0.0-SNAPSHOT</arbiter.version>
<rl4j.version>1.0.0-SNAPSHOT</rl4j.version>

I tried : mvn clean install , tried: mvn package -U outside of IntelliJ and tried the maven compile inside of IntelliJ. Pretty sure I am running with Java 1.8 everywhere.

It makes no sense to me that IntelliJ doesn't mark that line with an error but the mvn compile does.

@paul-anasuya
Copy link

paul-anasuya commented Dec 6, 2019 via email

@raver119
Copy link
Contributor

raver119 commented Dec 6, 2019

show full pom.xml please as gist - https://gist.github.com/

@fanweihua
Copy link

Does 1.0.0-beta6 fix this issue? I cant find anything related to this issue in beta6 release notes.

@saudet
Copy link
Contributor

saudet commented Dec 23, 2019

Yes, this should be fixed in 1.0.0-beta6.

@jlpassm
Copy link

jlpassm commented Dec 23, 2019

The 1.0.0-beta6 release fixed the issue for me. Thanks to all involved, I really didn't want to rollback Catalina from my Mac.

@deeplearning4j deeplearning4j locked as resolved and limited conversation to collaborators Dec 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Bug Bugs and problems LIBND4J
Projects
None yet
Development

No branches or pull requests