Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Dropout inconsistency bug #16705

Open
sxjscience opened this issue Nov 1, 2019 · 9 comments
Open

Dropout inconsistency bug #16705

sxjscience opened this issue Nov 1, 2019 · 9 comments
Assignees
Labels

Comments

@sxjscience
Copy link
Member

sxjscience commented Nov 1, 2019

In the following script, we should obtain the same dropout mask but currently the result is related to nrepeat. Note that I've turned off cudnn dropout by setting cudnn_off=True.

import mxnet as mx
import numpy as np
import random
from numpy.testing import assert_allclose

base_y_np = None

for nrepeat in [1, 2, 3, 4]:
    seed = 123
    mx.random.seed(seed)
    np.random.seed(seed)
    random.seed(seed)

    x = mx.nd.ones((3, 3), ctx=mx.gpu())
    for _ in range(nrepeat):
        y = mx.nd.Dropout(x, cudnn_off=True)
    with mx.autograd.record():
        y = mx.nd.Dropout(x, cudnn_off=True)
        y_np = y.asnumpy()
    if base_y_np is None:
        base_y_np = y_np
    else:
        assert_allclose(base_y_np, y_np)

Output:

Not equal to tolerance rtol=1e-07, atol=0

Mismatch: 55.6%
Max absolute difference: 2.
Max relative difference: 1.
 x: array([[0., 2., 0.],
       [0., 0., 2.],
       [0., 2., 0.]], dtype=float32)
 y: array([[2., 2., 0.],
       [0., 2., 0.],
       [0., 0., 2.]], dtype=float32)

If we set the nrepeat to be the same value, the result is consistent

import mxnet as mx
import numpy as np
import random
from numpy.testing import assert_allclose

base_y_np = None
ctx = mx.gpu()

for nrepeat in [3, 3, 3, 3]:
    seed = 123
    mx.random.seed(seed)
    np.random.seed(seed)
    random.seed(seed)

    x = mx.nd.ones((3, 3), ctx=ctx)
    for _ in range(nrepeat):
        y = mx.nd.Dropout(x, cudnn_off=True)
    with mx.autograd.record():
        y = mx.nd.Dropout(x, cudnn_off=True)
        y_np = y.asnumpy()
    if base_y_np is None:
        base_y_np = y_np
    else:
        assert_allclose(base_y_np, y_np)
@sxjscience sxjscience added the Bug label Nov 1, 2019
@sxjscience
Copy link
Member Author

Also, I've confirmed that CPU-side does not have this problem.

@DickJC123
Copy link
Contributor

What behavior do we expect from a model that has two Dropouts, where no seeds have been set explicitly in advance? Are the dropout patterns identical or different?

If the answer is 'different', then I would think that by setting the seeds in advance, the two-Dropout model would then have repeatable behavior, but the Dropouts would continue to be different.

Also, feel free @sxjscience to chime in on the discussion of PR #16532.

@sxjscience
Copy link
Member Author

@DickJC123 The answer should be different because these two dropouts should share the same internal random number generator and the random state will be updated accordingly.

For the inconsistency bug mentioned in this issue, it's not exactly related to the seeding problem.

For example, consider the following script:

import mxnet as mx
mx.random.seed(123)
x = mx.nd.ones((10, 10))

y = mx.nd.Dropout(x, cudnn_off=True)
with mx.autograd.record():
   y = mx.nd.Dropout(x, cudnn_off=True)

The first y = mx.nd.Dropout(x, cudnn_off=True) is not surrounded by autograd, and should not update the random state. However, in the current implementation (https://github.com/apache/incubator-mxnet/blob/bb6305d11d4383af2022e53ad94d6a1d5d93cb00/src/operator/nn/dropout-inl.h#L495), the rand() function will still be called when the node is constructed.. Thus, running y = mx.nd.Dropout(x, cudnn_off=True) outside the train loop will still interfere the random state.

This means, the following two code snippets will obtain different results:

  • Case 1
import mxnet as mx
mx.random.seed(123)
x = mx.nd.ones((3, 3), ctx=mx.gpu())

y = mx.nd.Dropout(x, cudnn_off=True)
with mx.autograd.record():
   y = mx.nd.Dropout(x, cudnn_off=True)
print(y)
[[0. 2. 0.]
 [0. 0. 2.]
 [0. 2. 0.]]
<NDArray 3x3 @gpu(0)>
  • Case 2
import mxnet as mx
mx.random.seed(123)
x = mx.nd.ones((3, 3), ctx=mx.gpu())

with mx.autograd.record():
   y = mx.nd.Dropout(x, cudnn_off=True)
print(y)
[[0. 0. 2.]
 [0. 0. 2.]
 [0. 2. 0.]]
<NDArray 3x3 @gpu(0)>

@sxjscience
Copy link
Member Author

@DickJC123 You may see that I've manually set cudnn_off=True. Also, I think #16532 will solve this problem.

@xidulu
Copy link
Contributor

xidulu commented Nov 5, 2019

Clearly, dropout in inference mode affects the random state:

>>> mx.random.seed(123)
>>> mx.nd.Dropout(x, cudnn_off=True)

[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
<NDArray 3x3 @gpu(0)>
>>> mx.random.uniform(shape=(2,2),ctx=mx.gpu(0))

[[0.6512425  0.11220306]
 [0.86499107 0.68052745]]
<NDArray 2x2 @gpu(0)>
>>> mx.random.seed(123)
>>> mx.random.uniform(shape=(2,2),ctx=mx.gpu(0))

[[0.9423294  0.68506277]
 [0.19981462 0.60299706]]
<NDArray 2x2 @gpu(0)>

@sxjscience
Copy link
Member Author

sxjscience commented Nov 5, 2019

With the help of @xidulu , we have located the root cause of the issue:

The bug is triggered because we have multiple parallel GPU random resources: https://github.com/apache/incubator-mxnet/blob/c583e44816a5e383493f35e69daaa92a47e40e39/src/resource.cc#L93-L94

When we create a new Dropout Node, we will attach a random resource to the node: https://github.com/apache/incubator-mxnet/blob/c583e44816a5e383493f35e69daaa92a47e40e39/src/operator/nn/dropout.cc#L148-L164

Since there are multiple random resources, we select one in a round-robin fashion. Each resource has it's specific seed, which results in the inconsistent behavior. https://github.com/apache/incubator-mxnet/blob/c583e44816a5e383493f35e69daaa92a47e40e39/src/resource.cc#L344-L351

The simplest fix is to use 1 GPU random generator. Thus, setting os.environ['MXNET_GPU_PARALLEL_RAND_COPY'] = '1' will fix this problem:

import os

os.environ['MXNET_GPU_PARALLEL_RAND_COPY'] = '1'

import mxnet as mx
import numpy as np
import random
from numpy.testing import assert_allclose

base_y_np = None

for nrepeat in [1, 2, 3, 4]:
    seed = 123
    mx.random.seed(seed)
    np.random.seed(seed)
    random.seed(seed)

    x = mx.nd.ones((3, 3), ctx=mx.gpu())
    for _ in range(nrepeat):
        y = mx.nd.Dropout(x, cudnn_off=True)
    with mx.autograd.record():
        y = mx.nd.Dropout(x, cudnn_off=True)
        y_np = y.asnumpy()
    if base_y_np is None:
        base_y_np = y_np
    else:
        assert_allclose(base_y_np, y_np)

@samskalicky
Copy link
Contributor

@zachgk assign @larroy

@zachgk
Copy link
Contributor

zachgk commented Nov 11, 2019

@larroy Please comment if you want to take this issue.

@larroy
Copy link
Contributor

larroy commented Nov 18, 2019

Hi

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants