Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dl4j - cuda stuck waiting at org.nd4j.nativeblas.Nd4jCuda$NativeOps.streamSynchronize #6518

Closed
bhorkar opened this issue Oct 3, 2018 · 14 comments

Comments

@bhorkar
Copy link

commented Oct 3, 2018

I have attached jstack output.
dump_thread.log
pom.xml.txt
LSTMPrediction .java.txt

Scenario when this issue was observed:

  1. RDD as exported to a directory (dataset_%.bin files)
  2. FileDatasetiterator is used to iterate the data (attached in above .java file).
  3. .fit is operation stuck waiting or the input.
  4. Use cuda -9.0 (issue also observed with cuda 9.2). However, no issue with nd4j-native.

Platform:
$ uname -a
Linux sr-mlgpu05 4.15.0-29-generic #31-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux

@raver119

This comment has been minimized.

Copy link
Contributor

commented Oct 3, 2018

Sure there's no issues with nd4j-native, there's no CUDA methods used.

Is your issue reproducible without spark?

@raver119

This comment has been minimized.

Copy link
Contributor

commented Oct 3, 2018

Also, what's your GPU model name?

@bhorkar

This comment has been minimized.

Copy link
Author

commented Oct 3, 2018

I tried using same pom.xml file and MnistClassifier.java example on the this machine. Same issue. So nothing to do with spark/etc. Mostly with GPU?

GPU: V-100.
nvidia-smi
Wed Oct 3 11:46:41 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48 Driver Version: 390.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | 0 |
| N/A 35C P0 43W / 250W | 1653MiB / 32510MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:D8:00.0 Off | 0 |
| N/A 34C P0 35W / 250W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 17704 C java 1642MiB |

@raver119

This comment has been minimized.

Copy link
Contributor

commented Oct 4, 2018

Try different initial weights distribution please, and tell me what happens.

@bhorkar

This comment has been minimized.

Copy link
Author

commented Oct 4, 2018

@raver119

This comment has been minimized.

Copy link
Contributor

commented Oct 4, 2018

I'm afraid that's a bug which was considered fixed... Probably still not.

@raver119 raver119 self-assigned this Oct 4, 2018

@bhorkar

This comment has been minimized.

Copy link
Author

commented Oct 4, 2018

@raver119

This comment has been minimized.

Copy link
Contributor

commented Oct 4, 2018

Bug affects one rng op and V100 devices.

@raver119

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2018

Hm. That's not the same bug.

@raver119

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2018

You've been using simple XAVIER as weight init, right?

@bhorkar

This comment has been minimized.

Copy link
Author

commented Oct 10, 2018

@raver119

This comment has been minimized.

Copy link
Contributor

commented Oct 11, 2018

Thanks for highlighting this problem, issue was fixed.

@raver119 raver119 closed this Oct 11, 2018

@AlexDBlack

This comment has been minimized.

Copy link
Contributor

commented Oct 12, 2018

FYI: I've temporarily reverted the fix PR.
Once the fix has been re-applied, you can access it via snapshots:
https://deeplearning4j.org/docs/latest/deeplearning4j-config-snapshots

@lock

This comment has been minimized.

Copy link

commented Nov 11, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Nov 11, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
3 participants
You can’t perform that action at this time.