New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SINGA-487 Include NCCL and MPICH in conda build #624
Conversation
@dcslin encountered the same error? |
@dcslin FYI |
I have locked versions of nccl 2.4.8 and mpich 3.3.2, because it is possible that newer versions of dependencies may have API change in the future After including nccl and mpich, the build can be done with -DUSE_DIST=ON. There is other problem in conda build due to the inexistence of conda dnnl, but I am not going to add dnnl in this PR. Therefore, I temporary use -DUSE_DNNL=OFF. So the purpose of this PR is just to add nccl 2.4.8 and mpich 3.3.2 in conda and turn on -DUSE_DIST=ON |
This comment has been minimized.
This comment has been minimized.
I thought i fixed deprecated module multiple times... |
This comment has been minimized.
This comment has been minimized.
@chrishkchris is the error due to dnnl missing? if yes, then the solution is to build a dnnl conda package. any other errors? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
1bd2c48
to
6da1845
Compare
@dcslin |
This comment has been minimized.
This comment has been minimized.
travis build is always cpu-only as there is no GPU in travis. |
This comment has been minimized.
This comment has been minimized.
You may be able to build cuda in Travis without GPU, see this example project. However, travis will not be able to run or test it. With Github Actions, it is possible to build and test with GPU support with self-hosted runners. So SINGA can be tested on the singa GPU servers at NUS using Github Actions self-hosted runners. This means we do not need Travis or Jenkins if we switch to Github Actions which can replace both of them with much easier and better features. |
This comment has been minimized.
This comment has been minimized.
sounds great! |
|
This comment has been minimized.
This comment has been minimized.
5968efe
to
d112000
Compare
I have added the folder singa/dist and do the corresponding conda config setting (can view the code change). However, I don't know how to solve the error encountered in travis CI. This time it returned message "cudnn is undefined". Anywhere else I need to change? |
thanks, I try first |
as I said, travis has no GPUs, hence it will not buil the dist package.. |
yes, I will use travis to build CPU version only, so I am learning how to switch off the cudnn, nccl, mpich in the travis cpu build |
if there is no CUDA, the condition check |
This comment has been minimized.
This comment has been minimized.
15fed97
to
e4b656e
Compare
I simplified the selection logic to use only one environment variable CUDA:
If we need two environment variables (CUDA and DIST) to determine the selection logic, I can change it. |
Yes, sure, I will change to use two env variables: CUDA=9.0 or 10.0; DIST (set or not set) |
I changed the selection logic to two env variables:
|
tool/conda/dist/meta.yaml
Outdated
# under the License. | ||
# | ||
|
||
{% set version = "2.1.0.dev" %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcslin need to remove the hardcode version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove {% set version = "2.1.0.dev" %}
.
Assume that git latest tag is always available and reliable, then we could replace all {{ version }}
with {{ environ.get('GIT_DESCRIBE_TAG') }}
.
One side effect is in travis build, we need to git fetch --unshallow
to get all the tags(including latest) from remote to build environment before build starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have modified the codes (singa/cpu, singa/gpu, singa/dist) based on @dcslin suggestion:
- remove
{% set version = "2.1.0.dev" %}
. - replace
{{ version }}
with{{ environ.get('GIT_DESCRIBE_TAG') }}
.
If this is not ok, I can reverse one commit to cancel this
tool/conda/singa/build.sh
Outdated
mkdir build | ||
cd build | ||
cmake -DCMAKE_INSTALL_PREFIX=$PREFIX -DUSE_CUDA=$USE_CUDA \ | ||
-DUSE_PYTHON3=ON -DUSE_MKLDNN=ON -DCMAKE_OSX_SYSROOT=${CONDA_BUILD_SYSROOT} .. | ||
-DUSE_PYTHON3=ON -DUSE_DNNL=OFF -DUSE_DIST=$USE_DIST -DCMAKE_OSX_SYSROOT=${CONDA_BUILD_SYSROOT} .. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
USE_DNNL is a variable whose value depends on the environment. It should not be hard coded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DNNL is necessary for CNN, so will change to -DUSE_DNNL=ON after DNNL package is ready. (In the past -DUSE_MKLDNN=ON at default)
This comment has been minimized.
This comment has been minimized.
I just resolved the conflict, and the travis CPU build is successful. some test case error on test_onnx_backend "ImportError: libprotobuf.so.20: cannot open shared object file: No such file or directory" may be solved later. So this PR concerning NCCL and MPICH should be ready for merge. Thanks! |
Ok, I updated the versions of dependence conda package to make singa and onnx backend coexists #631. This PR is ready for merge. |
@joddiy is there any big change from onnx 1.5 to 1.6? |
The operator's version of onnx 1.5 is 10 and for 1.6 is 11, not big changes but some about operators, some operators attributes have been moved to its inputs, such as clip, the max and min values. It's big trouble for my previous implement, however, I re-constructed the backend and frontend last week, so it's fine for now to change from onnx 1.5 to 1.6. Do we need to? |
if there are dependent libs compatible with 1.5, then we have to change to
1.6..
…On Mon, Mar 23, 2020 at 11:49 AM Joddiy Zhang ***@***.***> wrote:
@joddiy <https://github.com/joddiy> is there any big change from onnx 1.5
to 1.6?
The operator's version of onnx 1.5 is 10 and for 1.6 is 11, not big
changes but some about operators, some operators attributes have been moved
to its inputs, such as clip, the max and min values. It's big trouble for
my previous implement, however, I re-constructed the backend and frontend
last week, so it's fine for now to change from onnx 1.5 to 1.6. Do we need
to?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#624 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA47DR7WPMPIU5BK35OPCYLRI3L5FANCNFSM4LEL3EFQ>
.
|
Ok, little changes for me, not too much, please change to onnx 1.6. |
When I restore numpy version, ubuntu build pass, but macos build does not pass. |
seems that 1.16.5 is okay, I will wait for travis ci |
numpy 1.16.5 is ok |
Ready for merge |
Seems that adding nccl and mpich is okay in conda build of singa, but need to check further and add other thing such as python "deprecated"