-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you! : )
@wkcn Thanks for the review! On local GPU, running the following passed:
However, this test failed on CI with both I wil try to reproduce CI failure locally first. Or try to add this to nightly test where less nosetests are executed at the same time. Suspect other nosetest running parallel on CI workers will affect the result. |
1cc69b4
to
2f48295
Compare
I am able to reproduce CI failure locally now by running the following on P3.8x with DLAMI result:
However, running the test standalone with the same seed failed with CI enviroment passed:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @roywei , could you please add a unit-test for multi-GPU? The result of Dropout on multi-GPU should be different.
Thanks a lot : )
src/operator/nn/dropout-inl.h
Outdated
Tensor<xpu, 1, unsigned>(reinterpret_cast<unsigned *>(workspace_ptr), | ||
Shape1(1), s); | ||
prnd->GetRandInt(random_number); | ||
uint64_t seed_ = 17 + reinterpret_cast<uint64_t>(&random_number[0]) % 4096; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it need the modulus operator %
? the modulus operator makes the seed_
between 0+17~4096+17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just keeping the original logic here:
https://github.com/roywei/incubator-mxnet/blob/master/src/operator/nn/dropout-inl.h#L95
and here:
https://github.com/roywei/incubator-mxnet/blob/master/src/operator/nn/dropout-inl.h#L491
while fixing seed to respect mxnet seed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be uint64_t seed_ = 17 + static_cast<uint64_t>(random_number[0]) % 4096;
, because the type of random_number[0]
is unsigned
.
https://github.com/apache/incubator-mxnet/blob/master/3rdparty/mshadow/mshadow/tensor.h#L591
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will give a segfault during dropout. Also why would dropout on mult-gpu return different result? I thought the seed in fixed at global? so dropout on different GPU will use the same seed, thus return the same result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that I didn't express it clearly. If different GPUs use the same seed, the result of drouout of different GPU should be the same. When training a model, do GPUs use different random seed?
I believe GPU unit tests are running on instances with 1GPU. I will try to move the entire test to nightly tests where it's using P3 instances with 4 gpus. I can add multi-gpu test there. Hopefully the seed can be properly fixed with less parallel jobs on CI workers. |
from mxnet.test_utils import assert_almost_equal | ||
|
||
|
||
def test_dropout_with_seed(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the with seed annotation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm manually choosing a random seed and set it before each forward, so with_seed decorator will not take effect. See comment: #16532 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In line 29 you're generating a random number to feed the seed though, so that generator needs to be fed with a seed as well
Ci runs on g3.8xlarge with 2 GPUs |
src/operator/nn/dropout-inl.h
Outdated
Tensor<xpu, 1, unsigned>(reinterpret_cast<unsigned *>(workspace_ptr), | ||
Shape1(1), s); | ||
prnd->GetRandInt(random_number); | ||
uint64_t seed_ = 17 + static_cast<uint64_t>(random_number[0]) % 4096; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tensor is on GPU, we need to explicitly copy it back to CPU using cudaMemCopy. You might get garbage value if you just access data on GPU mem address directly
tests/nightly/test_dropout.py
Outdated
with mx.autograd.record(): | ||
result2 = dropout(data1) | ||
|
||
mx.random.seed(seed, ctx=mx.gpu(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be mx.random.seed(seed, ctx=mx.gpu(1))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to fix seed on gpu 0 only, so gpu1 still have random seed, so result3 and result2 will be different. Otherwise, I will need to create a different seed2
and fix it on gpu1, it would take the same effect. (different seed on gpu0 and gpu1 leading to result2 and result3 to be different)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to use a different seed on gpu1
src/operator/nn/dropout-inl.h
Outdated
// copy generated random int to cpu | ||
unsigned data = 0; | ||
CUDA_CALL(cudaMemcpy(&data, &random_number[0], sizeof(unsigned), cudaMemcpyDeviceToHost)); | ||
uint64_t seed_ = 17 + static_cast<uint64_t>(data) % 4096; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ptrendx @DickJC123 any concern for the fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there are multiple problems with this:
- why are you trying to seed the dropout every time you do forward? It will not actually do anything I believe as
get_cudnn_dropout_desc
after the first call calls cudnnRestoreDropoutDescriptor which seems to ignore the seed value. If you actually want to seed the dropout every time (viacudnnSetDropoutDescriptor
) that would be super costly as seeding random number generator is way more expensive than actually using it. - you should never use
cudaMemcpy
in the operator code. If you really need to copy values from GPU to CPU you should usecudaMemcpyAsync
followed bycudaStreamSynchronize
. The difference iscudaMemcpy
synchronizes on all streams (so it waits on all GPU activity and prevents all subsequent work to be done on all worker threads) vscudaStreamSynchronize
which waits only on the stream that you pass as argument (so other GPU workers are not affected).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about having the following logic inside get_cudnn_dropout_desc
:
if (!state_space->handle.size) {
request = ResourceManager::Get()->Request(cpu_ctx, ResourceRequest::kRandom)
seed = request.GetRandInt()
CUDNN_CALL(cudnnSetDropoutDescriptor(..., seed))
} else {
// use a dummy seed (e.g. 0) for cudnnRestoreDropoutDescriptor
}
and we remove the seed argument for get_cudnn_dropout_desc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a separate note, I see 34 occurrence of cudaMemcpy
in various places (mainly operators) in the codebase, we probably need to do some cleanup:
src/kvstore/kvstore_utils.cu: CUDA_CALL(cudaMemcpy(sort_output_ptr, dptr, sort_output_bytes,
src/kvstore/kvstore_utils.cu: CUDA_CALL(cudaMemcpy(&num_selected_out, num_selected_ptr, num_selected_bytes,
src/ndarray/ndarray_function.cu: CUDA_CALL(cudaMemcpy(&nnr_out, &row_flg[num_rows-1], sizeof(dim_t),
src/operator/contrib/adamw.cu: CUDA_CALL(cudaMemcpy(&scale, scale_blob.dptr<DType>(), sizeof(DType),
src/operator/contrib/boolean_mask.cu: CUDA_CALL(cudaMemcpy(&valid_num, &prefix_sum[idx_size - 1], sizeof(int32_t),
src/operator/contrib/index_array.cu: CUDA_CALL(cudaMemcpy(workspace.dptr_, cpu_workspace.data(), sizeof(int64_t) * (2 * naxes),
src/operator/contrib/index_array.cu: CUDA_CALL(cudaMemcpy(workspace.dptr_, inshape.data(), sizeof(dim_t) * ndim,
src/operator/contrib/multi_proposal.cu: FRCNN_CUDA_CHECK(cudaMemcpy(&mask_host[0],
src/operator/contrib/multi_proposal.cu: FRCNN_CUDA_CHECK(cudaMemcpy(workspace_proposals.dptr_, &anchors[0],
src/operator/contrib/multi_proposal.cu: FRCNN_CUDA_CHECK(cudaMemcpy(keep, &_keep[0], sizeof(int) * _keep.size(),
src/operator/contrib/proposal.cu: FRCNN_CUDA_CHECK(cudaMemcpy(&mask_host[0],
src/operator/contrib/proposal.cu: FRCNN_CUDA_CHECK(cudaMemcpy(workspace_proposals.dptr_,
src/operator/contrib/proposal.cu: FRCNN_CUDA_CHECK(cudaMemcpy(&cpu_im_info[0], im_info.dptr_,
src/operator/contrib/proposal.cu: FRCNN_CUDA_CHECK(cudaMemcpy(keep, &_keep[0], sizeof(int) * _keep.size(),
src/operator/numpy/np_boolean_mask_assign.cu: CUDA_CALL(cudaMemcpy(&valid_num, &prefix_sum[mask_size], sizeof(size_t),
src/operator/numpy/np_nonzero_op.cu: CUDA_CALL(cudaMemcpy(&valid_num, &prefix_sum[in_size - 1], sizeof(int32_t),
src/operator/numpy/np_nonzero_op.cu: CUDA_CALL(cudaMemcpy(out.data().dptr<int64_t>(), &temp, sizeof(int64_t),
src/operator/numpy/np_unique_op.cu: CUDA_CALL(cudaMemcpy(&valid_num, thrust::raw_pointer_cast(&prefix_sum[input_size - 1]),
src/operator/numpy/np_unique_op.cu: CUDA_CALL(cudaMemcpy(&valid_num, thrust::raw_pointer_cast(&prefix_sum[temp_shape[0] - 1]),
src/operator/numpy/np_unique_op.cu: CUDA_CALL(cudaMemcpy(outputs[0].data().dptr<DType>(), inputs[0].data().dptr<DType>(),
src/operator/numpy/random/dist_common.cu:CUDA_CALL(cudaMemcpy(dst, src, sizeof(float), cudaMemcpyDeviceToHost));
src/operator/numpy/random/dist_common.cu:CUDA_CALL(cudaMemcpy(dst, src, sizeof(double), cudaMemcpyDeviceToHost));
src/operator/numpy/random/np_multinomial_op.cu: CUDA_CALL(cudaMemcpy(&pvals_[0], input, sizeof(DType) * prob_length,
src/operator/rnn-inl.h: CUDA_CALL(cudaMemcpy(sequence_length_cpu_itype, sequence_length_ptr_gpu,
src/operator/tensor/cast_storage-inl.cuh: CUDA_CALL(cudaMemcpy(&nnr, &row_flg[num_rows - 1], sizeof(dim_t), cudaMemcpyDeviceToHost));
src/operator/tensor/cast_storage-inl.cuh: CUDA_CALL(cudaMemcpy(&nnz, &(indptr[num_rows]), sizeof(IType), cudaMemcpyDeviceToHost));
src/operator/tensor/dot-inl.cuh: CUDA_CALL(cudaMemcpy(&nnr, nnr_ptr, nnr_bytes, cudaMemcpyDeviceToHost));
src/operator/tensor/dot-inl.cuh: CUDA_CALL(cudaMemcpy(&nnr_out, &row_flg_out[num_cols_l-1], sizeof(dim_t),
src/operator/tensor/elemwise_binary_op_basic.cu: CUDA_CALL(cudaMemcpy(&nnr_out, &common_row_table[num_rows-1], sizeof(nnvm::dim_t),
src/operator/tensor/indexing_op.cu: CUDA_CALL(cudaMemcpy(&is_valid, is_valid_ptr, sizeof(char),
src/operator/tensor/indexing_op.cu: CUDA_CALL(cudaMemcpy(&nnr, grad_row_idx + data_size, sizeof(RType),
src/operator/tensor/indexing_op.cu: CUDA_CALL(cudaMemcpy(&nnr, &prefix_sum[num_rows-1], sizeof(dim_t),
src/operator/tensor/matrix_op.cu: CUDA_CALL(cudaMemcpy(&nnr, &out_indptr[indptr_len-1], sizeof(RType),
src/operator/tensor/square_sum.cu: CUDA_CALL(cudaMemcpy(&is_diff, is_diff_ptr, sizeof(int32_t), cudaMemcpyDeviceToHost));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reminisce @haojin2 can we fix the cudaMemcpy
calls in the numpy op? They impact GPU performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-haibin-lin let's create an issue for tracking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already created one #16583
8e030d2
to
ec298ec
Compare
Passed nightly test locally:
result:
|
Just noticed our I have verified locally this test passed with my PR, the test is still flaky though (failed with 10000 runs, use MXNET_TEST_SEED=107821594)
|
d9d3b84
to
4dec974
Compare
With updated implementation, dropout seed will come from cpu, setting seed on different gpu won't work. (it also does not work before)
|
@ptrendx any concerns on the new fix? |
CUDNN_CALL(cudnnSetDropoutDescriptor(dropout_desc_, s->dnn_handle_, | ||
param_.p, // discard probability | ||
dropout_states, dropout_bytes, | ||
seed_)); | ||
0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the minimum, the comment should be updated. The way these calls work is:
cudnnSetDropoutDescriptor(..., dropout_states==NULL,...) // Set dropout probability and seed, leave states alone.
cudnnSetDropoutDescriptor(..., dropout_states!=NULL,...) // Set dropout probability and seed, init states based on these values.
cudnnRestoreDropoutDescriptor() // Set dropout probability, seed and states ptr from provided args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, and I found another issue, the seed will be fixed during resource initialization. So once a dropout layer is created, event if we don't fix seed, each dropout result will be the same.
If we want each forward to have a different result, and the random result must respect mxnet random seed. The only solution is to get mxnet seed during each forward. Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new test case
@with_seed()
def test_dropout_with_seed():
info = np.iinfo(np.int32)
seed = np.random.randint(info.min, info.max)
_test_dropout(mx.cpu(), seed)
_test_dropout(mx.gpu(), seed)
_test_dropout(mx.cpu())
_test_dropout(mx.gpu())
def _test_dropout(ctx, seed=None):
data = mx.nd.ones((100, 100), ctx=ctx)
dropout = mx.gluon.nn.Dropout(0.5)
if seed:
mx.random.seed(seed)
with mx.autograd.record():
result1 = dropout(data)
if seed:
mx.random.seed(seed)
with mx.autograd.record():
result2 = dropout(data)
if seed:
# dropout on gpu should return same result with fixed seed
assert_almost_equal(result1.asnumpy(), result2.asnumpy())
else:
# seed not fixed, result should be different
with assert_raises(AssertionError):
assert_almost_equal(result1.asnumpy(), result2.asnumpy())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dick is looking into it, but I don't think sentence event if we don't fix seed, each dropout result will be the same
is true. What is true though, is the fact that the dropout as is implemented now would not respond to the change of seed on the MXNet side after the initial creation of the op. Seeding each forward would destroy performance, so how about a solution like this - if you know that you will cache the value of seed (like in the dropout descriptor resource case), every time when you get the descriptor it should internally ask the random resource "did a user set a new seed?" (which could be implemented as a set in the random resource that would keep track of who asked about this already and resetting that set when user calls to mxnet.random.seed
). If the answer is "no", then no reseeding is required, but if the answer is yes, the dropout descriptor should be reseeded with a new value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
echoing @ptrendx 's comment, we should implement SetSeed() for the cudnn resource in https://github.com/apache/incubator-mxnet/blob/master/src/resource.cc#L174-L201
@roywei I've been playing with a variant of your proposed test in which I set the seed to two different values for the two models and expect the results to be different. This fails, because the results are identical even with the differing seeds. The two models each get their own gpu random resource, but the two are seeded by cpu random number generators that are identical. The problem here is that the cpu rng's are not responding to mx.random.seed(), and instead have their seed set to 0. The reason is that cpu rngs are requested from the ResourceManager, and ResourceManager is a thread-local variable. The main python thread (performing the mx.random.seed()) only affects the global_seed_ data member for its ResourceManager instance, which does not affect the seeds of the cpu rngs that are requested of the worker thread's ResourceManager. |
Yeah, this is going to be tricky to fix solidly, particularly considering models with multiple dropouts and rnns. If all the gpu rngs of a worker indeed share the same rng state, then only the seed from the first rng in a model effects the initialization (the other seeds are ignored). Also, I believe it is non-deterministic which GPU worker handles an operator during execution, so this represents a problem (could be the cause of some low-probability test failures). I believe moving the gpu rng state to be a resource was motivated by the high initialization overhead, particularly painful for imperative models. Would it be possible to add a seed argument at the python level to dropout and rnn operators, with the understanding that by setting the seed, the operator will get its own rng state (at some memory and initialization expense)? Not setting a seed would grab the global resource- not sure how the determinism would work, but it would be fast. |
closing in favor of #17547 |
Description
fix #15662