Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MKL-DNN integration: request for reviews #7931

Closed
wants to merge 97 commits into from

Conversation

ykim362
Copy link
Contributor

@ykim362 ykim362 commented Sep 18, 2017

This PR is a beta version for the code reviews and experiments. There are several known issues which are being debugged.

If this version is built with 'USE_MKL2017' and 'USE_MKL2017_EXPERIMENTAL' flag, it will provide the same functionalities and performance as the current MKLML release. If this is built with 'USE_MKLDNN' flag, it will go through the new code pass (MKL-DNN integration).

MKL-DNN

A new open-source deep learning library providing IA optimized DNN kernels.
https://github.com/01org/mkl-dnn

Advantages

More functionalities

New functionalities will be mainly added to MKL-DNN rather than MKLML library.
Below are two examples.

  1. Fused RNN cell (To be added)
  2. int8 inference (To be added)

Performance optimization

As of Sep. 18 2017.

  • all units are (img/sec)
    Alexnet Inference (BS:256): 1474 (MKLML) --> 1568 (MKL-DNN)
    inception-bn inference (BS:32): 454 (MKLML) --> 483 (MKL-DNN)
    on Skylake 20-core machine (6148)
    Resnet 50 inference (BS: 32): 99 (MKLML) --> 116 (MKL-DNN)
    on KNL 7250

Known issues

  • Convergence (resnet training)

Contributors for this PR

@ashokei @karkadad @louisfeng @adstraw

Young Jin Kim and others added 30 commits April 10, 2017 10:05
@ykim362
Copy link
Contributor Author

ykim362 commented Oct 18, 2017

@piiswrong We have fixed convergence issue in Resnet. There were some problems in Conv and Batch norm layers. Also, we have added some more optimizations to get more speed-ups. Now, MKL-DNN version is 15% faster than MKLML (MKL2017) for inference and training on average.

@ykim362
Copy link
Contributor Author

ykim362 commented Oct 18, 2017

@szha @piiswrong MKL-DNN doesn't support fp64(double) data type. Do you think this is an issue? The library team is more focusing on adding lower precisions.

* @param req
* @param out_data
*/
template<typename xpu, typename DType>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need xpu for any MKLDNN functions? Doesn't the code always run on CPU?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I was following the convention of other MKLDNN operators to be consistent. I can change this to cpu only.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also discussed with Young and Ashok, we would like to keep this template parameter for supporting future Intel devices. We may support devices other than traditional CPU in the future.


if (req[0] == kNullOp) return;

Stream<xpu> *s = ctx.get_stream<xpu>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem the stream is used anywhere.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true right now. We may need to use the stream when we try to support other tensor shapes beside nchw. I can remove it for now.

.set_attr<FInferStorageType>("FInferStorageType",
ElemwiseStorageType<2, 1, true, false, false>) \
.set_attr<FCompute>("FCompute<cpu>", MKLDNNElementWiseAddCompute<cpu>) \
.set_attr<FComputeEx>("FComputeEx<cpu>", ElemwiseBinaryOp::ComputeEx<cpu, mshadow::op::plus>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MKLDNN implementation is only defined for FCompute?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since FComputeEx is for NDArray inputs and all current MKLDNN operators are only supporting TBlobs and the main benefit of the MKLDNN elemwise sum operator comes from working with other MKLDNN operators. With the upcoming sparse tensor support, we may need to make some adjustments.

@szha
Copy link
Member

szha commented Oct 19, 2017

@ykim362 could you verify if #8196 is fixed?

@ykim362
Copy link
Contributor Author

ykim362 commented Oct 19, 2017

@szha Sure, I am looking into it.

@szha
Copy link
Member

szha commented Oct 19, 2017

@ykim362 BTW is the fix in mklml_lnx_2018.0.20170908.tgz? Does it make sense to upgrade the library for mkl2017 use case? Many people are using MKL version (with MKL2017 and experimental on).

@ykim362
Copy link
Contributor Author

ykim362 commented Oct 20, 2017

@szha MKL-DNN also utilizes MKL2017. So, it would be useful to update MKL2017(2018) as well. And, the fix would be in the MXNet code. I am still investigating it.

@jesterhazy
Copy link
Contributor

is there an update on this issue? we are keen to include MKL in an upcoming project.

@ykim362
Copy link
Contributor Author

ykim362 commented Oct 30, 2017

@piiswrong @szha @sbodenstein MKL-DNN now officially supports OSX. So, we don't need to worry about this issue.
https://github.com/01org/mkl-dnn/releases/tag/v0.11

@sbodenstein
Copy link
Contributor

@ykim362: thanks! I saw that.

Btw: do you have any estimate for when this PR will be ready?

@ykim362
Copy link
Contributor Author

ykim362 commented Nov 7, 2017

@sbodenstein From my understanding, this PR is not going to be directly merged. It's going to be merged with another revision with sparse tensor (mkl storage). @piiswrong Is this correct?

@sbodenstein
Copy link
Contributor

@ykim362: do you know if bugs, like the resnet convergence bug, are still unsolved with v0.11 MKL-DNN?

@leezu
Copy link
Contributor

leezu commented Nov 20, 2017

@sbodenstein MKL with v0.11 is quite buggy. I often got inf values during training for no obvious reason (i.e. training is stable without mkl / on GPU).
I'm not sure about v0.12 due to #8280 .

(i.e. even without MKL-DNN MKL is buggy)

@piiswrong
Copy link
Contributor

closing since @zheng-da is making a new PR for this

@piiswrong piiswrong closed this Dec 12, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet