-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NN fails to learn #13
Comments
As unintuitive as it seems that's actually the case, and that behaviour recently got some more exposure (https://charliehorse55.wordpress.com/2016/01/09/how-nvidia-breaks-chrome-incognito/). However that shouldn't have any impact on the way Leaf learns. Generally I would assume that when one of the examples doesn't learn it's due to bad hyperparameters (batch-size, learning-rate, etc.), but your findings certainly are interesting and I'll look into it. |
I'm getting the same results on my setup:
|
I didn't get around to it on the weekend but was able to run it now, and it learned correctly and from the first try:
Rust 1.7.0-stable EDIT: It also works with my other machine: |
Hmm, I'll try using the same CUDA and Rust versions as you and see if it changes my results. Will edit with details. EDIT: No combination of driver or cuda versions seems to work for me:
|
Same results here (not learning and always predicting 9) Machine info: Rust 1.7 stable |
Yesterday I got it learn from the first try with I've simplified this example a bit by reducing input dimension to I'm currently looking into how to dump intermidiate values and weights to find out when they turn to NaNs. I've got a bit more time now, hopefully'll figure it out this time. |
@alexandermorozov I'm getting the same On another note I had to add a |
You can try to start two tasks simultaneously. It generally works for me: second task learns more often than not. Though it's difficult to tell if net works as expected: half of neurons might be dead and net may still learn somewhat.
I'm on Debian testing, common cuda packages are installed from distro repos. |
fix/sgd: initialize weight gradient history with zeroes SGD solver used unintialized history tensors. If there were some NaNs then whole network got poisoned after the first generation even if momentum was set to zero. This patch prefills gradient history with zeros. FIX: autumnai/leaf-examples#13
After the first iteration NN output gets stuck at value 9:
This happens nearly always and output is invariably 9 (excluding first iteration). If second process is started while first is still running, then second NN learns and in all 3 models reaches about 90% accuracy. Actually, first process may be example/benchmark from main leaf repo or some other program that heavily uses CUDA. I've tried various combinations, and it looks like that things that matters are: 1) it should allocate sizable memory chunk, 2) it should overwrite it with something. It doesn't matter if the first program is running or in stopped state.
This behavior can be explained if memory where coefficients are stored isn't initialized by leaf with small random values and just contains junk left from other programs. If the junk is good enough, NN learns, but suboptimally. If it's not, it's stuck. Next time program gets the same memory, so if it's stuck, then it's stuck for good. But it would be a security hole if CUDA doesn't zero memory allocations, so my guess may be completely wrong.
Here is my setup:
I guess I'll try to code some small simple NN at the weekend and check coefficients at different computation stages.
Edit: formatting
The text was updated successfully, but these errors were encountered: