Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-deterministic behavior on the GPU? #399

Open
yoavg opened this issue Mar 23, 2017 · 7 comments

Comments

Projects
None yet
4 participants
@yoavg
Copy link
Contributor

commented Mar 23, 2017

I am unsure why, but we observe some non-deterministic behavior when running on the GPU.
The dynet-seed is set (as well as the python seed and the numpy seed), yet the losses in different runs start to diverge after a while (by the 100th update, is almost surely happens).

This does not happen with the same code on the CPU. Maybe there is a race condition somewhere?

See the following gist:
https://gist.github.com/yoavg/510154a959e76370627a81da173cec9b

python not_deterministic.py > t1
python not_deterministic.py > t2
diff t1 t1

@neubig

This comment has been minimized.

Copy link
Contributor

commented Mar 23, 2017

We've observed this before as well, and confirmed that it also affects other libraries such as Theano. My guess is that the reason for this is because computation on GPUs is inherently parallel, and because this is the case you will get small differences due to rounding differently due to the computation order when some threads finish earlier than others. We haven't 100% confirmed that this is the case, but it seems to make sense.

@yoavg

This comment has been minimized.

Copy link
Contributor Author

commented Mar 23, 2017

Why should computation order matter?
The only thing I can think of is that we have a race condition when computing the gradients for the same node from different parents, and thus skipping an update.

@neubig

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2017

Well the idea is that somewhere we might do a "reduce" operation, summing over a vector of elements (such as in softmax), and because this is done in parallel, depending on the order that the additions are done, the result might be different due to numerical precision issues. I honestly haven't thought through this well enough to know whether this is a feasible explanation or not though, so your interpretation could also be correct.

The one thing I do know is that we've noticed this before and it also happened in Theano, so whatever the cause is it is likely also affecting other toolkits as well.

@tianjianjiang

This comment has been minimized.

Copy link

commented Feb 7, 2018

I'm googling related issues and have found that the most problematic part is cuDNN.
Since @neubig's Japanese is really good, so here's a post in Japanese. My apologies to @yoavg
https://qiita.com/TokyoMickey/items/cc8cd43545f2656b1cbd
One quick example, which is just like what @neubig has explained, when applying Dropout randomly in parallel on cuDNN stack, masked instances can be different. Some tensorflow users noticed that deterministic results can be obtained if Dropout is set to 1.0, which is impractical and merely a test.

Some people claimed that Caffe has found a solution (torch/cunn#84) while Torch has a partial one (soumith/cudnn.torch#270).

A somewhat related workaround for reduce operation can be found in an English post though:
https://www.twosigma.com/insights/a-workaround-for-non-determinism-in-tensorflow

@neubig neubig added minor bug and removed need more info labels Mar 23, 2018

@neubig

This comment has been minimized.

Copy link
Contributor

commented Mar 23, 2018

Thanks for the info @tianjianjiang !

I think probably the best thing would be if we could at least have some indication of when behavior is non-deterministic so we can point to something in the documentation when people ask questions. This would also help us identify any operations where we might be doing something wrong.

In order to do this, we would have to create tests where we run operations multiple times on the GPU with the same values and compare whether they get exactly the same result or not. I think this shouldn't be too hard to create.

@tianjianjiang

This comment has been minimized.

Copy link

commented Mar 28, 2018

You're very welcome @neubig !

I have only briefly checked cupy for chainer and it tests cudnn and randomness:
https://github.com/cupy/cupy/pull/82/files

BTW, FYI:
http://www.theregister.co.uk/2018/03/21/nvidia_titan_v_reproducibility/

@pmichel31415

This comment has been minimized.

Copy link
Collaborator

commented Apr 17, 2018

Update on this: using the functionality introduced in my PR #1351 I ran this code:

import sys
import numpy as np
sys.argv.extend(['--dynet-devices', 'CPU,GPU:0'])
import dynet as dy
SEED = 42

np.random.seed(SEED)
VECTOR_VALUE = np.random.uniform()

def test_randomness_sum(seed, vector_size, n_repeats):
    results = np.zeros(n_repeats)
    for i in range(n_repeats):
        dy.reset_random_seed(seed)
        dy.renew_cg()
        np.random.seed(SEED)
        x_value = np.random.rand(vector_size)
        x_value /= x_value.sum()
        x = dy.inputTensor(x_value)
        results[i] = dy.sum_elems(x).value()
    print('Std dev for size %d over %d trials: %.2e' % (vector_size, n_repeats, results.std()))

def test_randomness_dropout(seed, vector_size, n_repeats):
    results = np.zeros(n_repeats)
    for i in range(n_repeats):
        dy.reset_random_seed(seed)
        dy.renew_cg()
        np.random.seed(SEED)
        x_value = np.random.rand(vector_size)
        x_value /= x_value.sum()
        x = dy.inputTensor(x_value)
        x = dy.dropout(x, 0.5)
        results[i] = dy.sum_elems(x).value()
    print('Std dev for size %d over %d trials: %.2e' % (vector_size, n_repeats, results.std()))

def test_randomness_param(seed, vector_size, n_repeats):
    results = np.zeros(n_repeats)
    pc = dy.ParameterCollection()
    for i in range(n_repeats):
        dy.reset_random_seed(seed)
        dy.renew_cg()
        x = pc.add_parameters(vector_size)
        results[i] = dy.sum_elems(x).value()
    print('Std dev for size %d over %d trials: %.2e' % (vector_size, n_repeats, results.std()))

print('\n*** Testing stochasticity in sum_elems ***')
for exponent in range(1, 6):
    test_randomness_sum(SEED, 10**exponent, 100)
print('\n*** Testing stochasticity in dropout+sum_elems ***')
for exponent in range(1, 6):
    test_randomness_dropout(SEED, 10**exponent, 100)
print('\n*** Testing stochasticity in param init+sum_elems ***')
for exponent in range(1, 6):
    test_randomness_param(SEED, 10**exponent, 100)

Here are the results on 2 GPUs (Titan X and 1080 Ti)

[dynet] initializing CUDA
[dynet] Request for 1 specific GPU ...
[dynet] Device Number: 0
[dynet]   Device name: TITAN X (Pascal)
[dynet]   Memory Clock Rate (KHz): 5005000
[dynet]   Memory Bus Width (bits): 384
[dynet]   Peak Memory Bandwidth (GB/s): 480.48
[dynet]   Memory Free (GB): 12.6133/12.7885
[dynet]
[dynet] Device(s) selected: 0
[dynet] random seed: 710623261
[dynet] allocating memory: 512MB
[dynet] memory allocation done.

*** Testing stochasticity in sum_elems ***
Std dev for size 10 over 100 trials: 0.00e+00
Std dev for size 100 over 100 trials: 0.00e+00
Std dev for size 1000 over 100 trials: 0.00e+00
Std dev for size 10000 over 100 trials: 2.55e-08
Std dev for size 100000 over 100 trials: 5.57e-08

*** Testing stochasticity in dropout+sum_elems ***
Std dev for size 10 over 100 trials: 0.00e+00
Std dev for size 100 over 100 trials: 0.00e+00
Std dev for size 1000 over 100 trials: 2.92e-08
Std dev for size 10000 over 100 trials: 3.06e-08
Std dev for size 100000 over 100 trials: 6.68e-08

*** Testing stochasticity in param init+sum_elems ***
Std dev for size 10 over 100 trials: 0.00e+00
Std dev for size 100 over 100 trials: 0.00e+00
Std dev for size 1000 over 100 trials: 2.74e-08
Std dev for size 10000 over 100 trials: 1.18e-07
Std dev for size 100000 over 100 trials: 5.87e-08
[dynet] initializing CUDA
[dynet] Request for 1 specific GPU ...
[dynet] Device Number: 0
[dynet]   Device name: GeForce GTX 1080 Ti
[dynet]   Memory Clock Rate (KHz): 5505000
[dynet]   Memory Bus Width (bits): 352
[dynet]   Peak Memory Bandwidth (GB/s): 484.44
[dynet]   Memory Free (GB): 11.5463/11.7215
[dynet]
[dynet] Device(s) selected: 0
[dynet] random seed: 3091469477
[dynet] allocating memory: 512MB
[dynet] memory allocation done.

*** Testing stochasticity in sum_elems ***
Std dev for size 10 over 100 trials: 0.00e+00
Std dev for size 100 over 100 trials: 0.00e+00
Std dev for size 1000 over 100 trials: 0.00e+00
Std dev for size 10000 over 100 trials: 2.00e-08
Std dev for size 100000 over 100 trials: 5.30e-08

*** Testing stochasticity in dropout+sum_elems ***
Std dev for size 10 over 100 trials: 0.00e+00
Std dev for size 100 over 100 trials: 0.00e+00
Std dev for size 1000 over 100 trials: 4.86e-08
Std dev for size 10000 over 100 trials: 3.30e-08
Std dev for size 100000 over 100 trials: 5.34e-08

*** Testing stochasticity in param init+sum_elems ***
Std dev for size 10 over 100 trials: 0.00e+00
Std dev for size 100 over 100 trials: 0.00e+00
Std dev for size 1000 over 100 trials: 2.40e-08
Std dev for size 10000 over 100 trials: 7.75e-08
Std dev for size 100000 over 100 trials: 6.56e-08

For comparison here's the output on CPU (deterministic as expected)

*** Testing stochasticity in sum_elems ***
Std dev for size 10 over 100 trials: 0.00e+00
Std dev for size 100 over 100 trials: 0.00e+00
Std dev for size 1000 over 100 trials: 0.00e+00
Std dev for size 10000 over 100 trials: 0.00e+00
Std dev for size 100000 over 100 trials: 0.00e+00

*** Testing stochasticity in dropout+sum_elems ***
Std dev for size 10 over 100 trials: 0.00e+00
Std dev for size 100 over 100 trials: 0.00e+00
Std dev for size 1000 over 100 trials: 0.00e+00
Std dev for size 10000 over 100 trials: 0.00e+00
Std dev for size 100000 over 100 trials: 0.00e+00

*** Testing stochasticity in param init+sum_elems ***
Std dev for size 10 over 100 trials: 0.00e+00
Std dev for size 100 over 100 trials: 0.00e+00
Std dev for size 1000 over 100 trials: 0.00e+00
Std dev for size 10000 over 100 trials: 0.00e+00
Std dev for size 100000 over 100 trials: 0.00e+00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.