New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run detections on multiple threads #6649

Closed
sjaiswal25 opened this Issue Oct 30, 2018 · 9 comments

Comments

Projects
None yet
4 participants
@sjaiswal25
Copy link

sjaiswal25 commented Oct 30, 2018

I am trying to run detections on two images in parallel using yolov2. on GTX 1050Ti. I spawned 2 threads for the same. Both the threads call the same function for loading the pre trained model. I assume it has something to do with both the threads trying to access the model. Below is a subset of the error log, which I think specifies the root cause. The detailed error log is attached in the file below:
hs_err_pid7506.log

13:55:09.988 [Thread-0] INFO org.deeplearning4j.nn.modelimport.keras.Hdf5Archive - Unexpected end-of-input: was expecting closing '"' for name at [Source: java.io.StringReader@583425c9; line: 1, column: 40001] # A fatal error has been detected by the Java Runtime Environment: # SIGSEGV (0xb) at pc=0x00007f8fdddfaffc, pid=7506, tid=0x00007f8fde40c700 # JRE version: Java(TM) SE Runtime Environment (8.0_162-b12) (build 1.8.0_162-b12) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.162-b12 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libhdf5.so.101+0x8effc] H5C_unprotect+0x29c Failed to write core dump. Core dumps have been disabled.

Kindly help me resolve the issue.

@agibsonccc agibsonccc closed this Oct 30, 2018

@agibsonccc agibsonccc reopened this Oct 30, 2018

@agibsonccc

This comment has been minimized.

Copy link
Member

agibsonccc commented Oct 30, 2018

@sjaiswal25 saw gitter..from @AlexDBlack I'll leave this open, but in general we don't support ndarrays being thread safe. Could you please post the hs_pid error.log from this too?

@raver119

This comment has been minimized.

Copy link
Contributor

raver119 commented Oct 30, 2018

Load model once, in one thread and then use ParallelInference.

@raver119

This comment has been minimized.

Copy link
Contributor

raver119 commented Oct 30, 2018

I mean - this crash happens in HDF5 reader, that's a bit too far from actual ND4J/DL4J code base. So probably something isn't thread safe there as well. So do what you've been told: load model in one thread, and then use ParallelInference.

@sjaiswal25

This comment has been minimized.

Copy link

sjaiswal25 commented Oct 30, 2018

In https://gitter.im/deeplearning4j/deeplearning4j/archives/2018/01/18, Marcus Klang
@marcusklang mentions that he has used multi threading for achieving detections in parallel and then switched to PI for a better implementaion.So, I was curious to try it out. Anyways, I will try using PI, but could you please point me to an example of using PI in such a scenario(if there exists one) or someplace to make a head start. Thanks.

@AlexDBlack

This comment has been minimized.

Copy link
Member

AlexDBlack commented Oct 30, 2018

Right, I'm thinking if this is a just a threading issue in HDF5 code (i.e., not our code) then we'll just put a lock/synchronized method around loading and call it a day.

@sjaiswal25 As for an example: there's not much to it... https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/inference/ParallelInferenceExample.java

@sjaiswal25

This comment has been minimized.

Copy link

sjaiswal25 commented Oct 30, 2018

Will try it :)

@AlexDBlack

This comment has been minimized.

Copy link
Member

AlexDBlack commented Oct 31, 2018

I was able to reproduce this. It look like yes, the HDF5 library we use simply isn't thread safe.

@AlexDBlack AlexDBlack referenced this issue Oct 31, 2018

Merged

DL4J Fixes #6648

AlexDBlack added a commit that referenced this issue Oct 31, 2018

AlexDBlack added a commit that referenced this issue Oct 31, 2018

AlexDBlack added a commit that referenced this issue Oct 31, 2018

DL4J Fixes (#6648)
* Fix issue with bn mean/var updates being divided by minibatch

* Final batch norm fixes/tests

* #6635 Add exception when trying to use CSV/LineRecordReader without first initializing it

* #5577 Align SameDiff conv op same mode config names

* #6306 scala version suffix

* #6306 change scala versions script

* #6639 Fix KNN test issue

* #6649 add synchronization to avoid thread safety issues with hdf5 library

* Trigger CI

* Fix for conv3d TF import

* CuDNN fixes + remove outdated testss (mode now supported)

* Fix depthwise conv2d + add gradient check

* RNG seed for potentially flaky test

* Partial fix for spark test failures (broadcasts + multiple spark contexts)
@lock

This comment has been minimized.

Copy link

lock bot commented Nov 30, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Nov 30, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.