Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix horovod build failure when mxnet is built from source #15213

Merged
merged 11 commits into from
Jun 16, 2019

Conversation

apeforest
Copy link
Contributor

@apeforest apeforest commented Jun 11, 2019

Description

Fix #15211

The problem was due to the missing mkldnn_version.h file under include/mkldnn.

The mkldnn_version.h is generated at compile time by MKLDNN library. It is linked to 3rdparty/mkldnn/build/include when mxnet is built using make, the default build system currently supported in MXNet.

An earlier PR #14899 fixed the issue when mxnet is installed by pip. This PR is to fix the build error when mxnet is built from source.

Thanks @TaoLv for rootcausing the problem and his suggestions.

@apeforest
Copy link
Contributor Author

@PatricZhao @yuxihu @TaoLv Please help to review.

@TaoLv
Copy link
Member

TaoLv commented Jun 11, 2019

Seems cmake is not happy with the change :(

@apeforest
Copy link
Contributor Author

apeforest commented Jun 11, 2019

Seems cmake is not happy with the change :(

Unfortunately this is a hotfix that can unblock user from building horovod with mxnet from source. I agree we need a better solution in the long run to solve the dual building system issue and exporting header files. For 1.5 release, we can ask users to build using make, which I think is still the de facto build system of mxnet in 1.5 release.

Copy link
Member

@yuxihu yuxihu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Please change mxnet_version.h to mkldnn_version.h in the PR description.

@@ -0,0 +1 @@
../../3rdparty/mkldnn/include/mkldnn.h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to link this file? We only miss mkldnn_version.h, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier it was linked to the entire 3rdparty/mkldnn/include folder. However it does not contain mkldnn_version.h there. In this change, I only link to required files instead of linking to the entire folder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you verify if it works? I remember mkldnn.h may depend on other header files in the original folder.

Copy link
Contributor Author

@apeforest apeforest Jun 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I could verify it works fine in my local branch.

make -j$(nproc)
export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH
pip install --no-cache-dir horovod

Here is the files in my include:

(mxnet) ubuntu@ip-172-31-7-88:~/src/mxnet$ ls include/mkldnn/
mkldnn.h  mkldnn_version.h

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Please retrigger the CI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the including chain is: ndarray.h -> mkldnn.hpp -> mkldnn.h -> mkldnn_types.h and mkldnn_version.h.

@roywei
Copy link
Member

roywei commented Jun 12, 2019

I don't think re-trigger CI is going to pass, is there any way to fix CMake errors in CI?

@apeforest
Copy link
Contributor Author

apeforest commented Jun 12, 2019

@TaoLv This is not working because there are two issues in CI:
(1) ASAN is using Cmake and it will break
(2) CI lint complains about mxnet_version.h. Since it's a generated file at compile time, I cannot fix it.

After some more thoughts, I think a more elegant fix may be modifying the make script in MKLDNN module such that it copies mkldnn_version.h to the include folder after compilation is complete? This way it solves problem for both cmake and make, and it does not cause any CI issue in MXNet.

This problem is not limited to horovod. Any downstream project that needs to build mxnet from source will fail due to this. I think this is currently a blocker for MXNet release.

@PatricZhao Your advice is highly appreciated. Thanks!

@marcoabreu
Copy link
Contributor

My two cents are that we must not break make as well as cmake. Both are supported and should be kept.

@TaoLv
Copy link
Member

TaoLv commented Jun 13, 2019

@apeforest I think we need add an install target to the Makefile and copy libraries and including files to the install destination. Downstream projects like horovod should find libraries and headers there, not in the source code of mxnet or mkldnn. @szha @marcoabreu

install:
       mkdir -p $(INSTALL_PREFIX)/lib/mxnet
       mkdir -p $(INSTALL_PREFIX)/include/mxnet
       cp -rf lib/* $(INSTALL_PREFIX)/lib/mxnet/
       cp -Lrf include/* $(INSTALL_PREFIX)/include/mxnet/

       if [ -e "lib/libmkldnn.so.0" ]; then \
               mkdir -p $(INSTALL_PREFIX)/include/mxnet/mkldnn; \
               cp -rf 3rdparty/mkldnn/build/install/include/* $(INSTALL_PREFIX)/include/mxnet/mkldnn/; \
       fi

@apeforest
Copy link
Contributor Author

@TaoLv There is an issue with this approach:
When Horovod builds with mxnet, it actually uses mxnet.libinfo.find_include_path() to look for the header file location. With your suggested change, how can mxnet know the include path at runtime?

@apeforest apeforest requested a review from szha as a code owner June 14, 2019 00:08
@TaoLv
Copy link
Member

TaoLv commented Jun 14, 2019

My two cents:

  1. If you really want to copy files in source code folder, you can do that in the mkldnn.mk file. It will help to leave MXNet Makefile clean.
  2. The change will generate new folder and file under include/mxnet folder which will be shown as untracked files when call git status and the new folder and file cannot be removed by make clean.

@apeforest
Copy link
Contributor Author

@TaoLv Thanks for your suggestion.

  1. I updated mkldnn.mk to leave Makefile clean
  2. I updated .gitignore file so that you would not see untracked files in include/mkldnn
  3. It can be cleaned by make clean.

Please help to review again.

@vandanavk
Copy link
Contributor

@mxnet-label-bot add [pr-awaiting-review, Distributed]

@marcoabreu marcoabreu added Distributed pr-awaiting-review PR is waiting for code review labels Jun 14, 2019
@roywei
Copy link
Member

roywei commented Jun 15, 2019

Thanks for the fix!! I have verified this works. Please trigger CI again.
mxnet:

sudo make -j$(nproc) USE_CUDA=1 USE_CUDNN=1 USE_CUDA_PATH=/usr/local/cuda/ USE_NCCL=1

horovod:

 HOROVOD_GPU_ALLREDUCE=NCCL  pip install horovod --user --no-cache-dir -U

result:

Collecting horovod
  Downloading https://files.pythonhosted.org/packages/42/f8/0a2fedf45122d8a1b2dbd573e737ccb32cd0776aa4c4b157d3f18b9ff0ca/horovod-0.16.4.tar.gz (2.6MB)
    100% |████████████████████████████████| 2.6MB 87.4MB/s 
Requirement already satisfied, skipping upgrade: cloudpickle in /usr/local/lib/python2.7/dist-packages (from horovod) (1.0.0)
Requirement already satisfied, skipping upgrade: psutil in /home/ubuntu/.local/lib/python2.7/site-packages (from horovod) (5.6.3)
Requirement already satisfied, skipping upgrade: six in /usr/local/lib/python2.7/dist-packages (from horovod) (1.10.0)
Requirement already satisfied, skipping upgrade: cffi>=1.4.0 in /home/ubuntu/.local/lib/python2.7/site-packages (from horovod) (1.12.3)
Requirement already satisfied, skipping upgrade: pycparser in /home/ubuntu/.local/lib/python2.7/site-packages (from cffi>=1.4.0->horovod) (2.19)
Installing collected packages: horovod
  Running setup.py install for horovod ... done
Successfully installed horovod-0.16.4

git status

git status
On branch bugfix/horovod_build
Your branch is up-to-date with 'origin/bugfix/horovod_build'.
nothing to commit, working directory clean```

`sudo make clean` also removes it

Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great that the horovod issue is fixed :)

@TaoLv please review the PR again

@@ -68,6 +68,9 @@
# Git submodules under different licenses
'3rdparty',

# 3rdparty headerfiles under different licenses
'include/mkldnn',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain? MKL-DNN is also under Apache-2.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine mkldnn would be licensed to intel instead of ASF even though the license is apache 2.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license check would fail if this is not whitelisted.

Copy link
Member

@TaoLv TaoLv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve now. But just for the record, I don't really like this solution but take it as a hot fix to make horovod happy and to unblock the 1.5.0 release. We need consider a better way to expose headers to downstream projects. Looking for headers from source code is definitely not a good choice and is not guaranteed - that's why horovod's build was broken when MKL-DNN is updated.

@@ -1 +0,0 @@
../3rdparty/mkldnn/include
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just change this to the install path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we do not need to export all the header files under the install path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's simpler that way. Otherwise next time if there's any change mkldnn header we need to go through this all over again.

@roywei
Copy link
Member

roywei commented Jun 16, 2019

I have created feature request to better address this #15255

Can we merge this PR if it all looks good?

@apeforest apeforest added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-review PR is waiting for code review labels Jun 16, 2019
@szha szha merged commit 85aaa3a into apache:master Jun 16, 2019
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* fix horovod build when mxnet is built from source

* copy header file in mkldnn

* copy mkldnn_version header file to include

* remove unnecessary line

* fix the untracked file

* whitelist mkldnn headerfile

* exlude lint path
@apeforest apeforest deleted the bugfix/horovod_build branch August 23, 2019 17:09
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Distributed pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MXNet with MKLDNN failed to install horovod
8 participants