Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NN fails to learn #13

Closed
alexandermorozov opened this issue Mar 11, 2016 · 8 comments
Closed

NN fails to learn #13

alexandermorozov opened this issue Mar 11, 2016 · 8 comments

Comments

@alexandermorozov
Copy link

After the first iteration NN output gets stuck at value 9:

leaf-examples$ cargo run --release --  mnist linear --batch-size 10 
     Running `target/release/leaf-examples mnist linear --batch-size 10`
Last sample: Prediction: 0, Target: 3 | Accuracy 1/10 = 10.00%
Last sample: Prediction: 9, Target: 4 | Accuracy 2/20 = 10.00%
Last sample: Prediction: 9, Target: 3 | Accuracy 3/30 = 10.00%
Last sample: Prediction: 9, Target: 1 | Accuracy 4/40 = 10.00%
Last sample: Prediction: 9, Target: 3 | Accuracy 7/50 = 14.00%
Last sample: Prediction: 9, Target: 4 | Accuracy 9/60 = 15.00%
Last sample: Prediction: 9, Target: 1 | Accuracy 9/70 = 12.86%
Last sample: Prediction: 9, Target: 9 | Accuracy 10/80 = 12.50%
Last sample: Prediction: 9, Target: 6 | Accuracy 11/90 = 12.22%
Last sample: Prediction: 9, Target: 5 | Accuracy 11/100 = 11.00%
Last sample: Prediction: 9, Target: 9 | Accuracy 12/110 = 10.91%
Last sample: Prediction: 9, Target: 2 | Accuracy 13/120 = 10.83%
Last sample: Prediction: 9, Target: 3 | Accuracy 13/130 = 10.00%
Last sample: Prediction: 9, Target: 7 | Accuracy 14/140 = 10.00%
Last sample: Prediction: 9, Target: 4 | Accuracy 14/150 = 9.33%

This happens nearly always and output is invariably 9 (excluding first iteration). If second process is started while first is still running, then second NN learns and in all 3 models reaches about 90% accuracy. Actually, first process may be example/benchmark from main leaf repo or some other program that heavily uses CUDA. I've tried various combinations, and it looks like that things that matters are: 1) it should allocate sizable memory chunk, 2) it should overwrite it with something. It doesn't matter if the first program is running or in stopped state.

This behavior can be explained if memory where coefficients are stored isn't initialized by leaf with small random values and just contains junk left from other programs. If the junk is good enough, NN learns, but suboptimally. If it's not, it's stuck. Next time program gets the same memory, so if it's stuck, then it's stuck for good. But it would be a security hole if CUDA doesn't zero memory allocations, so my guess may be completely wrong.

Here is my setup:

  • rustc 1.8.0-beta.1 (facbfdd71 2016-03-02),
  • debian stretch (9),
  • cuda 7.0,
  • cudnn v3, v4 (tried both),
  • GeForce GTX 960 / 4G RAM, I bought it just yesterday, but HW seems solid and memtestG80 doesn't show any problems,
  • up-to-date checkout of leaf-examples with Hyper stuff commented out (linking fails on Debian 9 due to openssl API mismatch).

I guess I'll try to code some small simple NN at the weekend and check coefficients at different computation stages.

Edit: formatting

@hobofan
Copy link
Member

hobofan commented Mar 12, 2016

But it would be a security hole if CUDA doesn't zero memory allocations, so my guess may be completely wrong.

As unintuitive as it seems that's actually the case, and that behaviour recently got some more exposure (https://charliehorse55.wordpress.com/2016/01/09/how-nvidia-breaks-chrome-incognito/).

However that shouldn't have any impact on the way Leaf learns.
When the network is created, the weights of Linear and Convolution layers are randomly initialized (See https://github.com/autumnai/leaf/blob/master/src/layers/common/linear.rs#L100), so the initial state of the memory shouldn't really matter. Maybe there is a problem with the filled weights not being synchronized correctly?

Generally I would assume that when one of the examples doesn't learn it's due to bad hyperparameters (batch-size, learning-rate, etc.), but your findings certainly are interesting and I'll look into it.

@KodrAus
Copy link

KodrAus commented Mar 13, 2016

I'm getting the same results on my setup:

target/release/leaf-examples mnist linear --batch-size 10
Last sample: Prediction: 2, Target: 3 | Accuracy 1/10 = 10.00%
Last sample: Prediction: 9, Target: 4 | Accuracy 2/20 = 10.00%
Last sample: Prediction: 9, Target: 3 | Accuracy 3/30 = 10.00%
Last sample: Prediction: 9, Target: 1 | Accuracy 4/40 = 10.00%
Last sample: Prediction: 9, Target: 3 | Accuracy 7/50 = 14.00%
Last sample: Prediction: 9, Target: 4 | Accuracy 9/60 = 15.00%
Last sample: Prediction: 9, Target: 1 | Accuracy 9/70 = 12.86%
Last sample: Prediction: 9, Target: 9 | Accuracy 10/80 = 12.50%
Last sample: Prediction: 9, Target: 6 | Accuracy 11/90 = 12.22%
...
CUDA version 7.5.18
rustc 1.9.0-nightly
cudnn v4
Nvidia GTX Titan X

@hobofan
Copy link
Member

hobofan commented Mar 14, 2016

I didn't get around to it on the weekend but was able to run it now, and it learned correctly and from the first try:

cargo run --release --  mnist linear --batch-size 10 
   Compiling collenchyma v0.0.8
   Compiling collenchyma-nn v0.3.4
   Compiling collenchyma-blas v0.2.0
   Compiling leaf v0.2.0
   Compiling leaf-examples v0.1.0 (file:///home/hobofan/autumn/leaf-examples)
     Running `target/release/leaf-examples mnist linear --batch-size 10`
target/release/leaf-examples: /opt/cuda/lib64/libOpenCL.so.1: no version information available (required by target/release/leaf-examples)
Last sample: Prediction: 2, Target: 3 | Accuracy 1/10 = 10.00%
Last sample: Prediction: 3, Target: 4 | Accuracy 3/20 = 15.00%
Last sample: Prediction: 4, Target: 3 | Accuracy 4/30 = 13.33%
Last sample: Prediction: 1, Target: 1 | Accuracy 7/40 = 17.50%
Last sample: Prediction: 0, Target: 3 | Accuracy 10/50 = 20.00%
Last sample: Prediction: 2, Target: 4 | Accuracy 12/60 = 20.00%
Last sample: Prediction: 9, Target: 1 | Accuracy 15/70 = 21.43%
Last sample: Prediction: 0, Target: 9 | Accuracy 21/80 = 26.25%
Last sample: Prediction: 6, Target: 6 | Accuracy 26/90 = 28.89%
Last sample: Prediction: 0, Target: 5 | Accuracy 29/100 = 29.00%
Last sample: Prediction: 4, Target: 9 | Accuracy 33/110 = 30.00%
Last sample: Prediction: 3, Target: 2 | Accuracy 40/120 = 33.33%
Last sample: Prediction: 1, Target: 3 | Accuracy 46/130 = 35.38%
Last sample: Prediction: 7, Target: 7 | Accuracy 52/140 = 37.14%
Last sample: Prediction: 5, Target: 4 | Accuracy 56/150 = 37.33%
Last sample: Prediction: 7, Target: 8 | Accuracy 63/160 = 39.38%
Last sample: Prediction: 9, Target: 9 | Accuracy 69/170 = 40.59%

Rust 1.7.0-stable
CUDA version 7.5.17
cuDNN v4
NVIDIA GT 750M (2GB RAM)

EDIT:

It also works with my other machine:
Rust 1.5.0-stable
CUDA version 7.5.17
cuDNN v4
NVIDIA Titan X

@KodrAus
Copy link

KodrAus commented Mar 14, 2016

Hmm, I'll try using the same CUDA and Rust versions as you and see if it changes my results. Will edit with details.

EDIT: No combination of driver or cuda versions seems to work for me:

Ubuntu 15.10
Rust 1.7.0 Stable

nvidia-361.28 (os)
nvidia-352.79 (prop)
nvidia-352.63 (prop)

@MarcoPolo
Copy link

Same results here (not learning and always predicting 9)

Machine info:

Rust 1.7 stable
Ubuntu 14.04
CUDA v7.5.17
cuDNN v4
nvidia GTX 680

@alexandermorozov
Copy link
Author

Yesterday I got it learn from the first try with linear net. Second run also worked. Then I switched to conv and it always returned 9. After that subsequent runs of linearnet returned 9 too.

I've simplified this example a bit by reducing input dimension to 1 and autogenerating training samples, code is here. It has the same behaviour -- sometimes it gets stuck, sometimes it doesn't. Effect doesn't depend on number of layers and batch sizes -- I've got same thing with only one linear layer and batch_size=1. In cases it gets stuck, output of nll layer after the first generation contais some sensible values. On later generations it degrades to all NaNs. Even if learning_rate=0 and values shouldn't change.

I'm currently looking into how to dump intermidiate values and weights to find out when they turn to NaNs. I've got a bit more time now, hopefully'll figure it out this time.

@KodrAus
Copy link

KodrAus commented Mar 20, 2016

@alexandermorozov I'm getting the same NaN results as you on your test code, so far I haven't been able to get any nets to learn.

On another note I had to add a build.rs to your test code to get it to link cu* properly on my machine. How have you got cuda set up on your machine?

@alexandermorozov
Copy link
Author

I'm getting the same NaN results as you on your test code, so far I haven't been able to get any nets to learn.

You can try to start two tasks simultaneously. It generally works for me: second task learns more often than not. Though it's difficult to tell if net works as expected: half of neurons might be dead and net may still learn somewhat.

On another note I had to add a build.rs to your test code to get it to link cu* properly on my machine. How have you got cuda set up on your machine?

I'm on Debian testing, common cuda packages are installed from distro repos. libcudnn.so* are manually placed in /usr/local/lib, cudnn.h in /usr/local/include. More importantly Rust switched linker from ld to ld.gold about 3 month ago, and ld.gold doesn't search in /usr/local/lib by default, so environment variable should be set like this: export LIBRARY_PATH="/usr/local/lib". If this doesn't help, can you post error message or content of build.rs? It may be better to create another bug to stay on topic here.

homu added a commit to autumnai/leaf that referenced this issue Mar 23, 2016
fix/sgd: initialize weight gradient history with zeroes

SGD solver used unintialized history tensors. If there were some NaNs then
whole network got poisoned after the first generation even if momentum
was set to zero. This patch prefills gradient history with zeros.

FIX: autumnai/leaf-examples#13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants