Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock during training with OMP_NUM_THREADS >= 8 #7637

Closed
tschut opened this issue Apr 29, 2019 · 27 comments

Comments

Projects
None yet
5 participants
@tschut
Copy link

commented Apr 29, 2019

Issue Description

When training a CNN on text classification, training hangs when using OMP_NUM_THREADS >= 8. For lower num_threads the performance increases almost linearly:
OMP_NUM_THREADS | Batches/sec
1 | 2.117
2 | 3.815
4 | 7.006
6 | 9.539

The (simple) network:

        MultiLayerConfiguration config = new NeuralNetConfiguration.Builder()
                .weightInit(WeightInit.RELU)
                .activation(Activation.LEAKYRELU)
                .updater(new Adam(0.01))
                .convolutionMode(ConvolutionMode.Same)
                .l2(0.001)
                .list()
                .layer(new ConvolutionLayer.Builder()
                        .kernelSize(3, 50)
                        .stride(1, 50)
                        .nIn(1)
                        .nOut(100)
                        .build())
                .layer(new GlobalPoolingLayer.Builder()
                        .poolingType(PoolingType.MAX)
                        .dropOut(0.7)
                        .build())
                .layer(new OutputLayer.Builder()
                        .lossFunction(LossFunctions.LossFunction.MCXENT)
                        .activation(Activation.SOFTMAX)
                        .nIn(100)
                        .nOut(AgeGroup.values().length - 1)
                        .build())
                .build();

Output of kill -3 in this gist: https://gist.github.com/tschut/730ebeff7039baed44e52d623c841334.

Version Information

  • snapshot version of dl4j and nd4j
  • running on cpu (no gpu) on ubuntu 18.04
$ uname -a
Linux gpu-instance2 4.15.0-1029-gcp #31-Ubuntu SMP Thu Mar 21 09:40:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • processor info
$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) CPU @ 2.00GHz
Stepping:            3
CPU MHz:             2000.180
BogoMIPS:            4000.36
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            56320K
NUMA node0 CPU(s):   0-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat arch_capabilities
@tschut

This comment has been minimized.

Copy link
Author

commented Apr 29, 2019

After setting PerformanceListener to run every iteration I see that it appears to always hang after the 5th iteration:

11:19:53.945 [main] INFO  org.nd4j.linalg.factory.Nd4jBackend - Loaded [CpuBackend] backend
11:19:54.194 [main] INFO  org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for NativeOps: 8
11:19:54.261 [main] INFO  org.nd4j.nativeblas.Nd4jBlas - Number of threads used for BLAS: 8
11:19:54.265 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Backend used: [CPU]; OS: [Linux]
11:19:54.266 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Cores: [32]; Memory: [14.7GB];
11:19:54.266 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Blas vendor: [MKL]
11:19:56.383 [main] INFO  o.d.m.e.loader.WordVectorSerializer - Projected memory use for model: [8.69 MB]
11:20:01.049 [main] INFO  c.l.p.t.c.a.AgeGroupClassifierCNN - Building network
11:20:01.092 [main] INFO  o.d.nn.multilayer.MultiLayerNetwork - Starting MultiLayerNetwork with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
11:20:01.127 [main] INFO  o.d.e.t.BaseEarlyStoppingTrainer - Starting early stopping training
11:20:01.902 [main] INFO  o.d.o.listeners.PerformanceListener - ETL: 0 ms; iteration 0; iteration time: 783 ms; samples/sec: 81.737; batches/sec: 1.277; 
11:20:02.073 [main] INFO  o.d.o.listeners.PerformanceListener - ETL: 0 ms; iteration 1; iteration time: 171 ms; samples/sec: 374.269; batches/sec: 5.848; 
11:20:02.220 [main] INFO  o.d.o.listeners.PerformanceListener - ETL: 0 ms; iteration 2; iteration time: 146 ms; samples/sec: 438.356; batches/sec: 6.849; 
11:20:02.329 [main] INFO  o.d.o.listeners.PerformanceListener - ETL: 0 ms; iteration 3; iteration time: 108 ms; samples/sec: 592.593; batches/sec: 9.259; 
11:20:02.436 [main] INFO  o.d.o.listeners.PerformanceListener - ETL: 0 ms; iteration 4; iteration time: 107 ms; samples/sec: 598.131; batches/sec: 9.346; 
11:20:02.512 [main] INFO  o.d.o.listeners.PerformanceListener - ETL: 0 ms; iteration 5; iteration time: 76 ms; samples/sec: 842.105; batches/sec: 13.158;

I've ran the program ~10 times and it always hangs after the 5th iteration. When setting OMP_NUM_THREADS to 6 it runs fine (no other change, just setting the env variable).

@raver119

This comment has been minimized.

Copy link
Contributor

commented Apr 29, 2019

And it always hangs in the same place?

@tschut

This comment has been minimized.

Copy link
Author

commented Apr 29, 2019

It would appear so, yes. Because of that i'd think it was something with the data, but then I'd expect problems when running with lower num_threads too...

@tschut

This comment has been minimized.

Copy link
Author

commented Apr 29, 2019

I can try running multiple sessions with kill -3 to see if it's exactly the same place, but based on console output it looks like it is.

@AlexDBlack

This comment has been minimized.

Copy link
Member

commented Apr 29, 2019

Just to double check it is MKL-DNN and not something in our code before that, can you also see if it deadlocks with the following?
Nd4jCpu.Environment.getInstance().setUseMKLDNN(false);

@tschut

This comment has been minimized.

Copy link
Author

commented Apr 29, 2019

Nope, no more deadlock after adding that. To be 100% sure I removed the line again, rebuilded, and the deadlock is back in exactly the same place.

@raver119

This comment has been minimized.

Copy link
Contributor

commented Apr 29, 2019

So, it's MKL-DNN then...

@AlexDBlack

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

I checked mkl-dnn github issues, didn't find anything related...
As far as release notes, this is the closest I could find - also not useful - https://github.com/intel/mkl-dnn/releases

There's a small chance that somehow we're using the library incorrectly (though, we have quite thorough tests to validate correctness and that it matches the built-in implementation). I'd like to rule that out first... we might have to isolate this and report it upstream.

@tschut Can you provide a complete minimal example we can run?
I've tried to reproduce locally based on the config (I can't reproduce) but I don't know the actual input sizes you are using. Thus my inability to reproduce might just be different input sizes, or just that it's not reproducible on my system (Windows, 5960x, 8C/16T).
https://gist.github.com/AlexDBlack/fc14f2eef7244fe8eecc3690137e89eb

@tschut

This comment has been minimized.

Copy link
Author

commented Apr 30, 2019

I think it's going to be very difficult to get to a minimal, isolated, example. What I tried this morning:

  • Changing my random seed --> it still deadlocks, but in a different iteration. So, it's related to the input data.
  • Changing minibatch-size from 64 to 1 --> no more deadlock. So, it's somehow related to the combination of things in a minibatch.

I'm using CnnSentenceDataSetIterator, so if I understand correctly shape of the minibatch is dependent on the data. Then again, why wouldn't it deadlock with minibatch-size = 1?

Tried minibatch size of 8 --> deadlock. Minibatch size of 7 --> deadlock. Minibatch size of 2 --> deadlock.

I can try adding some debug statements to CnnSentenceDataSetIterator to find out the exact input size for the problematic minibatch, maybe that'll help?

@AlexDBlack

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

I can try adding some debug statements to CnnSentenceDataSetIterator to find out the exact input size for the problematic minibatch, maybe that'll help?

That would definitely help. But you don't need to modify CnnSentenceDataSetIterator - just write a simple DataSetPreProcessor to print shapes, and add it to the iterator.

@tschut

This comment has been minimized.

Copy link
Author

commented Apr 30, 2019

Output:

10:30:12,726 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback-test.xml]
10:30:12,727 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback.groovy]
10:30:12,728 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback.xml] at [jar:file:/home/tschut/wammes-classifier/target/classifier-1.0-SNAPSHOT-bin.jar!/logback.xml]
10:30:12,742 |-INFO in ch.qos.logback.core.joran.spi.ConfigurationWatchList@627551fb - URL [jar:file:/home/tschut/wammes-classifier/target/classifier-1.0-SNAPSHOT-bin.jar!/logback.xml] is not of type file
10:30:17.622 [main] INFO  org.nd4j.linalg.factory.Nd4jBackend - Loaded [CpuBackend] backend
10:30:17.886 [main] INFO  org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for NativeOps: 8
10:30:17.960 [main] INFO  org.nd4j.nativeblas.Nd4jBlas - Number of threads used for BLAS: 8
10:30:17.965 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Backend used: [CPU]; OS: [Linux]
10:30:17.965 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Cores: [32]; Memory: [14.7GB];
10:30:17.965 [main] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Blas vendor: [MKL]
10:30:20.369 [main] INFO  o.d.m.e.loader.WordVectorSerializer - Projected memory use for model: [8.69 MB]
10:30:25.305 [main] INFO  c.l.p.t.c.a.AgeGroupClassifierCNN - Building network
10:30:25.348 [main] INFO  o.d.nn.multilayer.MultiLayerNetwork - Starting MultiLayerNetwork with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
10:30:25.386 [main] INFO  o.d.e.t.BaseEarlyStoppingTrainer - Starting early stopping training
dataSet.getFeatures().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,368,50],  Stride: [18400,18400,50,1]
dataSet.getLabels().shapeInfoToString(): Rank: 2, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,4],  Stride: [4,1]
dataSet.getFeaturesMaskArray().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,368,1],  Stride: [368,368,1,1]
dataSet.getLabelsMaskArray().shapeInfoToString(): null
dataSet.getFeatures().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,512,50],  Stride: [25600,25600,50,1]
dataSet.getLabels().shapeInfoToString(): Rank: 2, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,4],  Stride: [4,1]
dataSet.getFeaturesMaskArray().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,512,1],  Stride: [512,512,1,1]
dataSet.getLabelsMaskArray().shapeInfoToString(): null
dataSet.getFeatures().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,512,50],  Stride: [25600,25600,50,1]
dataSet.getLabels().shapeInfoToString(): Rank: 2, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,4],  Stride: [4,1]
dataSet.getFeaturesMaskArray().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,512,1],  Stride: [512,512,1,1]
dataSet.getLabelsMaskArray().shapeInfoToString(): null
dataSet.getFeatures().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,512,50],  Stride: [25600,25600,50,1]
dataSet.getLabels().shapeInfoToString(): Rank: 2, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,4],  Stride: [4,1]
dataSet.getFeaturesMaskArray().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,512,1],  Stride: [512,512,1,1]
dataSet.getLabelsMaskArray().shapeInfoToString(): null
dataSet.getFeatures().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,512,50],  Stride: [25600,25600,50,1]
dataSet.getLabels().shapeInfoToString(): Rank: 2, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,4],  Stride: [4,1]
dataSet.getFeaturesMaskArray().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,512,1],  Stride: [512,512,1,1]
dataSet.getLabelsMaskArray().shapeInfoToString(): null
dataSet.getFeatures().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,512,50],  Stride: [25600,25600,50,1]
dataSet.getLabels().shapeInfoToString(): Rank: 2, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,4],  Stride: [4,1]
dataSet.getFeaturesMaskArray().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,512,1],  Stride: [512,512,1,1]
dataSet.getLabelsMaskArray().shapeInfoToString(): null
10:30:26.149 [main] INFO  o.d.o.listeners.PerformanceListener - ETL: 0 ms; iteration 0; iteration time: 773 ms; samples/sec: 82.794; batches/sec: 1.294; 
dataSet.getFeatures().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,368,50],  Stride: [18400,18400,50,1]
dataSet.getLabels().shapeInfoToString(): Rank: 2, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,4],  Stride: [4,1]
dataSet.getFeaturesMaskArray().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,368,1],  Stride: [368,368,1,1]
dataSet.getLabelsMaskArray().shapeInfoToString(): null
dataSet.getFeatures().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,443,50],  Stride: [22150,22150,50,1]
dataSet.getLabels().shapeInfoToString(): Rank: 2, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,4],  Stride: [4,1]
dataSet.getFeaturesMaskArray().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,443,1],  Stride: [443,443,1,1]
dataSet.getLabelsMaskArray().shapeInfoToString(): null
10:30:26.265 [main] INFO  o.d.o.listeners.PerformanceListener - ETL: 0 ms; iteration 1; iteration time: 116 ms; samples/sec: 551.724; batches/sec: 8.621; 
dataSet.getFeatures().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,436,50],  Stride: [21800,21800,50,1]
dataSet.getLabels().shapeInfoToString(): Rank: 2, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,4],  Stride: [4,1]
dataSet.getFeaturesMaskArray().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,436,1],  Stride: [436,436,1,1]
dataSet.getLabelsMaskArray().shapeInfoToString(): null
dataSet.getFeatures().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,469,50],  Stride: [23450,23450,50,1]
dataSet.getLabels().shapeInfoToString(): Rank: 2, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,4],  Stride: [4,1]
dataSet.getFeaturesMaskArray().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,469,1],  Stride: [469,469,1,1]
dataSet.getLabelsMaskArray().shapeInfoToString(): null
dataSet.getFeatures().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,376,50],  Stride: [18800,18800,50,1]
dataSet.getLabels().shapeInfoToString(): Rank: 2, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,4],  Stride: [4,1]
dataSet.getFeaturesMaskArray().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,376,1],  Stride: [376,376,1,1]
dataSet.getLabelsMaskArray().shapeInfoToString(): null
10:30:26.386 [main] INFO  o.d.o.listeners.PerformanceListener - ETL: 0 ms; iteration 2; iteration time: 121 ms; samples/sec: 528.926; batches/sec: 8.264; 
dataSet.getFeatures().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,403,50],  Stride: [20150,20150,50,1]
dataSet.getLabels().shapeInfoToString(): Rank: 2, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,4],  Stride: [4,1]
dataSet.getFeaturesMaskArray().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,403,1],  Stride: [403,403,1,1]
dataSet.getLabelsMaskArray().shapeInfoToString(): null
dataSet.getFeatures().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,350,50],  Stride: [17500,17500,50,1]
dataSet.getLabels().shapeInfoToString(): Rank: 2, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,4],  Stride: [4,1]
dataSet.getFeaturesMaskArray().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,350,1],  Stride: [350,350,1,1]
dataSet.getLabelsMaskArray().shapeInfoToString(): null
dataSet.getFeatures().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,419,50],  Stride: [20950,20950,50,1]
dataSet.getLabels().shapeInfoToString(): Rank: 2, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,4],  Stride: [4,1]
dataSet.getFeaturesMaskArray().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,419,1],  Stride: [419,419,1,1]
dataSet.getLabelsMaskArray().shapeInfoToString(): null
10:30:26.510 [main] INFO  o.d.o.listeners.PerformanceListener - ETL: 0 ms; iteration 3; iteration time: 123 ms; samples/sec: 520.325; batches/sec: 8.130; 
dataSet.getFeatures().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,441,50],  Stride: [22050,22050,50,1]
dataSet.getLabels().shapeInfoToString(): Rank: 2, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,4],  Stride: [4,1]
dataSet.getFeaturesMaskArray().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,441,1],  Stride: [441,441,1,1]
dataSet.getLabelsMaskArray().shapeInfoToString(): null
10:30:26.626 [main] INFO  o.d.o.listeners.PerformanceListener - ETL: 0 ms; iteration 4; iteration time: 115 ms; samples/sec: 556.522; batches/sec: 8.696; 
dataSet.getFeatures().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,512,50],  Stride: [25600,25600,50,1]
dataSet.getLabels().shapeInfoToString(): Rank: 2, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,4],  Stride: [4,1]
dataSet.getFeaturesMaskArray().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,512,1],  Stride: [512,512,1,1]
dataSet.getLabelsMaskArray().shapeInfoToString(): null
10:30:26.703 [main] INFO  o.d.o.listeners.PerformanceListener - ETL: 0 ms; iteration 5; iteration time: 76 ms; samples/sec: 842.105; batches/sec: 13.158; 
dataSet.getFeatures().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,402,50],  Stride: [20100,20100,50,1]
dataSet.getLabels().shapeInfoToString(): Rank: 2, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,4],  Stride: [4,1]
dataSet.getFeaturesMaskArray().shapeInfoToString(): Rank: 4, DataType: FLOAT, Offset: 0, Order: c, Shape: [64,1,402,1],  Stride: [402,402,1,1]
dataSet.getLabelsMaskArray().shapeInfoToString(): null

I'm out of my league here... is this helpful?

@tschut

This comment has been minimized.

Copy link
Author

commented Apr 30, 2019

Btw this is output up until it deadlocks, no more output after this.

@AlexDBlack

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

Yes, that's perfect, thanks.

@AlexDBlack

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

I'm not able to reproduce this locally unfortunately (Windows 10, 5960x)
I've tried setting OMP_NUM_THREADS to 4, 8, 12 and 16, runs fine with any/all.

@tschut Mind running this and seeing if it reproduces your problem? (The array shapes should be the same as what you posted)
https://gist.github.com/AlexDBlack/fc14f2eef7244fe8eecc3690137e89eb

@tschut

This comment has been minimized.

Copy link
Author

commented Apr 30, 2019

Yes! Your code also deadlocks after the 6th iteration @AlexDBlack. Must be hardware or os specific. Is there a prize for most esoteric bug found? 😉

I'm running on a Google Cloud Compute VM, should be pretty simple to set one up and reproduce it there.

@AlexDBlack

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

Is there a prize for most esoteric bug found?

Does frustration and disappointment count as a prize? 😛

I'm running on a Google Cloud Compute VM, should be pretty simple to set one up and reproduce it there.

Yeah, we're trying to reproduce on Azure (a lot easier for us than GCC).
What was the exact VM model?

@tschut

This comment has been minimized.

Copy link
Author

commented Apr 30, 2019

Machine type
custom (32 vCPUs, 60 GB memory)
CPU platform
Intel Skylake

Also see output of lscpu top of this thread.

Does frustration and disappointment count as a prize? 😛

I suppose not 😆

@saudet

This comment has been minimized.

Copy link
Member

commented May 1, 2019

@tschut Can you make sure you have nd4j-native-avx512 in your class path? Skylake can apparently deadlock if we try to make it execute some old MMX/SSE instructions along AVX-512.

@tschut

This comment has been minimized.

Copy link
Author

commented May 1, 2019

I can't find anything with that name in the uberjar. Also, when I reconfigured my vm to run on Haswell instead of Skylake, the deadlock didn't occur, so that seems to confirm this is indeed the issue.

@tschut

This comment has been minimized.

Copy link
Author

commented May 1, 2019

Correction, you can't specify an exact architecture, only a minimum. Was pure 'luck' that when I started it this morning I got Haswell instead of Skylake. After specifying Skylake as minimum the deadlock was back.

@raver119

This comment has been minimized.

Copy link
Contributor

commented May 1, 2019

Please use avx512 classifier for nd4j backend and tell us what happens

@tschut

This comment has been minimized.

Copy link
Author

commented May 1, 2019

You mean for nd4j-native, right? I did this:

                <dependency>
                    <groupId>org.nd4j</groupId>
                    <artifactId>nd4j-native</artifactId>
                    <version>${dl4j.version}</version>
                    <classifier>avx512</classifier>
                    <scope>compile</scope>
                </dependency>

Resulting in:

Exception in thread "main" java.lang.ExceptionInInitializerError
        at org.deeplearning4j.nn.conf.MultiLayerConfiguration$Builder.build(MultiLayerConfiguration.java:701)
        at org.deeplearning4j.nn.conf.NeuralNetConfiguration$ListBuilder.build(NeuralNetConfiguration.java:268)
        at com.luiwammes.pc.trainer.cnn.agegroup.Temp.main(Temp.java:50)
Caused by: java.lang.RuntimeException: org.nd4j.linalg.factory.Nd4jBackend$NoAvailableBackendException: Please ensure that you have an nd4j backend on your classpath. Please see: http://nd4j.org/getstarted.html
        at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5768)
        at org.nd4j.linalg.factory.Nd4j.<clinit>(Nd4j.java:202)
        ... 3 more
Caused by: org.nd4j.linalg.factory.Nd4jBackend$NoAvailableBackendException: Please ensure that you have an nd4j backend on your classpath. Please see: http://nd4j.org/getstarted.html
        at org.nd4j.linalg.factory.Nd4jBackend.load(Nd4jBackend.java:213)
        at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5765)
        ... 4 more

Build didn't complain, so that's weird.

@raver119

This comment has been minimized.

Copy link
Contributor

commented May 1, 2019

for nd4j-native you'll need different classifier, and not a compile scope.

@treo

This comment has been minimized.

Copy link
Member

commented May 1, 2019

The dependency should look more like:

        <dependency>
                <groupId>org.nd4j</groupId>
                 <artifactId>nd4j-native</artifactId>
                 <version>${dl4j.version}</version>
                 <classifier>linux-x86_64-avx512</classifier>
        </dependency>
@tschut

This comment has been minimized.

Copy link
Author

commented May 1, 2019

Ah thanks, I was really confused there for a moment. Also, previous build did fail, I made a mistake with maven profiles, so never mind my previous comment. Going to test this now.

@tschut

This comment has been minimized.

Copy link
Author

commented May 1, 2019

It's still deadlocking :(

mvn-shade reports

[INFO] Including org.nd4j:nd4j-native:jar:linux-x86_64-avx512:1.0.0-SNAPSHOT in the shaded jar.

So I'm pretty sure maven config is correct now for this. Also, \linux-x86_64-avx512\libnd4jcpu.so is now included in my jar, so I suppose it should pick that up.

Running lsof -p PID: https://gist.github.com/tschut/c852f0d53b4b4b10683b7645096aad67. That shows it's loaded the avx512 libraries.

@raver119

This comment has been minimized.

Copy link
Contributor

commented May 9, 2019

We have added temporary workaround that allows to disable MKL-DNN for specific operations.

@raver119 raver119 closed this May 9, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.