Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add axes support to Dropout for variational dropout in NLP #9931

Merged
merged 8 commits into from Mar 6, 2018

Conversation

zhanghang1989
Copy link
Contributor

@zhanghang1989 zhanghang1989 commented Mar 1, 2018

add axes support to dropout for variational dropout in NLP

  1. test pending
  2. MKL part hasn't been updated

@szha Could you test this implementation? ping @yzhliu for MKL implementation

Description

(Brief description on what this PR is about)

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

Copy link
Member

@cjolivier01 cjolivier01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the speed differences between the old and new default axes implementation (cpu, gpu, mkl) it can be measured by dropout_perf test

@szha szha self-assigned this Mar 1, 2018
LOG(FATAL) << "NDim too large "; \
}

inline int BinaryBroadcastShapeCompact(const TShape& lshape, const TShape& rshape,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look like a copy from src/operator/tensor/elemwise_binary_broadcast_op.h. Can we avoid copying the code?

*new_lshape = TShape(odim);
*new_rshape = TShape(odim);
*new_oshape = TShape(odim);
index_t bl = oshape.ndim() - lshape.ndim();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const index_t
should we do a CHECK()) for oshape.dim() >= lshape.dim()?

@cjolivier01
Copy link
Member

If MKL doesn't support the non-default axis behavior, it should skip using MKL for non-default axes similarly to how batch norm doesn't use MKL of CUDNN for non-default channel axis.

if (dshape.ndim() == 0) return false;
out_shape->clear();
out_shape->push_back(dshape);
if (param.axes.ndim() != 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This axes can be empty for normal dropout :)

for (int i = 1; i < length; ++i) {
inc(&coord, oshape, &lidx, lstride, &ridx, rstride);
// When tuning, don't actually run the op, since it's not going to be tuned against
// the actual op we'll eventually be using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cannot quite get the point.

in_data[dropout::kData].dptr<DType>(),
mask.dptr<DType>(), out.dptr<DType>());
});
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if MKL enables, the broadcast will not happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't updated the MKL code for variational dropout (enabling axes). I need help with MKL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for non default axes, you can fall back to this op instead of the MKL op

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mask.dptr<DType>(),
this->pkeep_);
if (req[0] != kNullOp) {
// broardcast mul
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo broadcast

const real_t pkeep) {
RNG_KERNEL_LOOP(xpu, DType, id, gen, N, step, {
const real_t rand_num = static_cast<real_t>(genImpl.uniform());
mask_out[i] = mshadow_op::threshold::Map<real_t>(rand_num, pkeep) * (1.0f / pkeep);
dropout_out[i] = input_data[i] * mask_out[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not saying it needs to be done, but have you considered merging this operation with the other kernel, perhaps by deriving from broadcast_kernel or passing a modified version of the mul OP to broadcast_kernel?
Making two full passes across the memory is going to cause a performance hit due to both OMP overhead as well as CPU cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand you clearly.
I separate the original dropout kernel into two parts: 1) BernoulliKernel 2) broad_cast Mul

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand you clearly.
I separate the original dropout kernel into two parts: 1) BernoulliKernel 2) broad_cast mul
so that we can enable axes support for variational dropout.

Thx

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. What's the performance impact of using two kernels instead of one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @cjolivier01 . I get your point for efficiency. I have added a condition check for standard dropout, which has the same efficiency when none-axes provided:
https://github.com/apache/incubator-mxnet/pull/9931/files#diff-4aea2cc24c0bb4e8e48face9faf4aa26R252

@@ -337,6 +336,7 @@ class DropoutOp {
real_t pkeep_;
/*! \brief Dropout mode */
dropout::DropoutOpMode mode_;
TShape axes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: member variable name should end in an underscore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx 👍

Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@zhanghang1989 could you provide some performance number reports for the speed difference before and after the change?

Pinging @cjolivier01 and @yzhliu for a final review. I intend to merge this as soon as possible, so I will wait for either your approval or three days for lazy consensus, whichever is earlier.

@zhanghang1989
Copy link
Contributor Author

@szha The performance of dropout should be the same as before, when no axes are given.

@cjolivier01
Copy link
Member

If you have a request for changes from a committer, you can't merge per Apache guidelines.

@cjolivier01
Copy link
Member

What is the performance impact of these changes for default axes behavior compared ot the older code?

@szha
Copy link
Member

szha commented Mar 6, 2018

Hang suggested that if axes is empty, the behavior is exactly the same as before the change https://github.com/apache/incubator-mxnet/pull/9931/files#diff-4aea2cc24c0bb4e8e48face9faf4aa26R252

@szha
Copy link
Member

szha commented Mar 6, 2018

Did the guideline explain how committers deal with stale reviews?

@cjolivier01
Copy link
Member

Apache say:

"A code-modification proposal may be stopped dead in its tracks by a -1 vote by a qualified voter. This constitutes a veto, and it cannot be overruled nor overridden by anyone. Vetos stand until and unless withdrawn by their casters."

I am guessing for a "stale" review (stale I would imagine > 2 months old?), a death certificate of said committer would be useful.

@szha
Copy link
Member

szha commented Mar 6, 2018

Haha OK. Jokes aside, did Hang sufficiently address your concern?

@cjolivier01
Copy link
Member

Yeah, I'm good.

@yzhliu yzhliu merged commit 40de6ab into apache:master Mar 6, 2018
@marcoabreu
Copy link
Contributor

marcoabreu commented Mar 6, 2018

Sorry to be a bit late here, but this PR has no test coverage. Could you please elaborate @yzhliu @szha ?

@szha
Copy link
Member

szha commented Mar 6, 2018

@zhanghang1989 is working on it

@zhanghang1989
Copy link
Contributor Author

I am creating another PR for unit test.

@marcoabreu
Copy link
Contributor

Next time please make sure to have code changes and tests within the same PR instead of splitting them.

@piiswrong
Copy link
Contributor

@yzhliu @zhanghang1989
Please make sure we don't merge code without test coverage next time.

@zhanghang1989
Copy link
Contributor Author

👍 Got it. My bad.

jinhuang415 pushed a commit to jinhuang415/incubator-mxnet that referenced this pull request Mar 30, 2018
* add axes support to dropout for variational dropout, test pending, mkl part hasn't been updated

* avoid copy code

* fix typo

* consider non broadcast case

* fix backward

* avoid mkl

* rm redundent

* condition check for standard dropout
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* add axes support to dropout for variational dropout, test pending, mkl part hasn't been updated

* avoid copy code

* fix typo

* consider non broadcast case

* fix backward

* avoid mkl

* rm redundent

* condition check for standard dropout
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* add axes support to dropout for variational dropout, test pending, mkl part hasn't been updated

* avoid copy code

* fix typo

* consider non broadcast case

* fix backward

* avoid mkl

* rm redundent

* condition check for standard dropout
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants