Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SINGA-487 Include NCCL and MPICH in conda build #624

Merged
merged 9 commits into from Mar 25, 2020

Conversation

chrishkchris
Copy link
Contributor

@chrishkchris chrishkchris commented Mar 9, 2020

Seems that adding nccl and mpich is okay in conda build of singa, but need to check further and add other thing such as python "deprecated"

======================================================================
ERROR: test_tensor (unittest.loader._FailedTest)
----------------------------------------------------------------------
ImportError: Failed to import test module: test_tensor
Traceback (most recent call last):
  File "/root/miniconda/conda-bld/singa_1583767754024/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/lib/python3.6/unittest/loader.py", line 428, in _find_test_path
    module = self._get_module_from_name(name)
  File "/root/miniconda/conda-bld/singa_1583767754024/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/lib/python3.6/unittest/loader.py", line 369, in _get_module_from_name
    __import__(name)
  File "/root/miniconda/conda-bld/singa_1583767754024/test_tmp/test/python/test_tensor.py", line 24, in <module>
    from singa import tensor
  File "/root/miniconda/conda-bld/singa_1583767754024/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/lib/python3.6/site-packages/singa/tensor.py", line 58, in <module>
    from deprecated import deprecated
ModuleNotFoundError: No module named 'deprecated'


======================================================================
FAIL: test_conv2d (test_mkldnn.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/miniconda/conda-bld/singa_1583767754024/test_tmp/test/python/test_mkldnn.py", line 78, in test_conv2d
    self.assertAlmostEqual(4.0, _dW[0], places=5)
AssertionError: 4.0 != 0.040000003 within 5 places

----------------------------------------------------------------------
Ran 13 tests in 0.003s

FAILED (failures=1, errors=12)
TEST CONV2D FORWARD
TEST CONV2D DATA BACKWARD
TEST CONV2D WEIGHT BACKWARD
+ exit 0

Resource usage statistics from testing singa:
   Process count: 1
   CPU time: unavailable
   Memory: 3.0M
   Disk usage: 656B
   Time elapsed: 0:00:02.1

TEST END: /root/miniconda/conda-bld/linux-64/singa-2.1.0.dev-cudnn7.3.1_cuda10.0_py36.tar.bz2
Renaming work directory,  /root/miniconda/conda-bld/singa_1583767754024/work  to  /root/miniconda/conda-bld/singa_1583767754024/work_moved_singa-2.1.0.dev-cudnn7.3.1_cuda10.0_py36_linux-64_main_build_loop
# Automatic uploading is disabled
# If you want to upload package(s) to anaconda.org later, type:

anaconda upload /root/miniconda/conda-bld/linux-64/singa-2.1.0.dev-cudnn7.3.1_cuda10.0_py36.tar.bz2

# To have conda build upload to anaconda.org automatically, use
# $ conda config --set anaconda_upload yes

anaconda_upload is not set.  Not uploading wheels: []
####################################################################################
Resource usage summary:

Total time: 0:07:23.3
CPU usage: sys=0:00:06.5, user=0:02:17.0
Maximum memory usage observed: 777.8M
Total disk usage observed (not including envs): 252.4K


####################################################################################
Source and build intermediates have been left in /root/miniconda/conda-bld.
There are currently 4 accumulated.
To remove them, you can run the ```conda build purge``` command
root@3c17fd6cb72e:~/dcsysh/singa/tool/conda/singa#

@nudles
Copy link
Member

nudles commented Mar 10, 2020

@dcslin encountered the same error?

@chrishkchris
Copy link
Contributor Author

@dcslin FYI
One error should be solved after adding python deprecated module in conda
Another error is due to without the conda dnnl
i am not sure if there are other errors
by the way i success build the conda but only test error

@chrishkchris chrishkchris changed the title [WIP] SINGA-487 Include NCCL and MPICH in conda build SINGA-487 Include NCCL and MPICH in conda build Mar 10, 2020
@chrishkchris chrishkchris marked this pull request as ready for review March 10, 2020 14:48
@chrishkchris
Copy link
Contributor Author

@nudles

I have locked versions of nccl 2.4.8 and mpich 3.3.2, because it is possible that newer versions of dependencies may have API change in the future

After including nccl and mpich, the build can be done with -DUSE_DIST=ON.

There is other problem in conda build due to the inexistence of conda dnnl, but I am not going to add dnnl in this PR. Therefore, I temporary use -DUSE_DNNL=OFF.

So the purpose of this PR is just to add nccl 2.4.8 and mpich 3.3.2 in conda and turn on -DUSE_DIST=ON

@chrishkchris

This comment has been minimized.

@dcslin
Copy link
Member

dcslin commented Mar 11, 2020

@dcslin FYI
One error should be solved after adding python deprecated module in conda
Another error is due to without the conda dnnl
i am not sure if there are other errors
by the way i success build the conda but only test error

I thought i fixed deprecated module multiple times...

@chrishkchris

This comment has been minimized.

@nudles
Copy link
Member

nudles commented Mar 11, 2020

@chrishkchris is the error due to dnnl missing? if yes, then the solution is to build a dnnl conda package. any other errors?

@chrishkchris

This comment has been minimized.

@chrishkchris

This comment has been minimized.

@chrishkchris chrishkchris changed the title SINGA-487 Include NCCL and MPICH in conda build [WIP] SINGA-487 Include NCCL and MPICH in conda build Mar 12, 2020
@chrishkchris
Copy link
Contributor Author

@dcslin
Now it is better that travis-ci test pass in ubuntu.
The remaining work is to exclude nccl and mpich for CPU only version
I am thinking how to exclude them in CPU only version

@chrishkchris

This comment has been minimized.

@chrishkchris chrishkchris changed the title [WIP] SINGA-487 Include NCCL and MPICH in conda build SINGA-487 Include NCCL and MPICH in conda build Mar 12, 2020
@nudles
Copy link
Member

nudles commented Mar 12, 2020

travis build is always cpu-only as there is no GPU in travis.
I suggest to have a separate package for distributed training on GPUs like singa-dist. There is no need to compile singa-dist on CPU-only machines. The default singa packages should be compiled without mpich and nccl.

@chrishkchris

This comment has been minimized.

@moazreyad
Copy link
Contributor

travis build is always cpu-only as there is no GPU in travis.

You may be able to build cuda in Travis without GPU, see this example project. However, travis will not be able to run or test it.

With Github Actions, it is possible to build and test with GPU support with self-hosted runners. So SINGA can be tested on the singa GPU servers at NUS using Github Actions self-hosted runners. This means we do not need Travis or Jenkins if we switch to Github Actions which can replace both of them with much easier and better features.

@chrishkchris

This comment has been minimized.

@nudles
Copy link
Member

nudles commented Mar 12, 2020

travis build is always cpu-only as there is no GPU in travis.

You may be able to build cuda in Travis without GPU, see this example project. However, travis will not be able to run or test it.

With Github Actions, it is possible to build and test with GPU support with self-hosted runners. So SINGA can be tested on the singa GPU servers at NUS using Github Actions self-hosted runners. This means we do not need Travis or Jenkins if we switch to Github Actions which can replace both of them with much easier and better features.

sounds great!

@nudles
Copy link
Member

nudles commented Mar 12, 2020

Sorry, please let me clarify:

1. yes, travis is for cpu only, that’s why I created a separate folder tool/conda/singa/travis to NOT include the gpu related library cudnn, nccl, and mpich. This folder is only used by travis for cpu version build.

2. I added the nccl and mpich in the meta file of original folder tool/conda/singa, which is not for travis and used for gpu version of singa conda build

3. Currently this PR should be adding nccl into singa-gpu. Meanwhile, it is also good to separate singa-dist and singa-gpu
   Thanks a lot!
  1. since travis is for cpu-only, why do you create another folder for it? nothing needs to be changed. the scripts for travis building are here
  2. we only need to add a folder conda/dist like conda/cpu and conda/gpu for building singa-dist, and update the scripts in conda/singa/ accordingly. meta.yml of conda supports if like this line.

@chrishkchris

This comment has been minimized.

@chrishkchris
Copy link
Contributor Author

I have added the folder singa/dist and do the corresponding conda config setting (can view the code change). However, I don't know how to solve the error encountered in travis CI. This time it returned message "cudnn is undefined". Anywhere else I need to change?

tool/conda/singa/conda_build_config.yaml Outdated Show resolved Hide resolved
tool/conda/singa/meta.yaml Outdated Show resolved Hide resolved
tool/conda/singa/meta.yaml Outdated Show resolved Hide resolved
@chrishkchris
Copy link
Contributor Author

thanks, I try first

@nudles
Copy link
Member

nudles commented Mar 13, 2020

as I said, travis has no GPUs, hence it will not buil the dist package..

@chrishkchris
Copy link
Contributor Author

as I said, travis has no GPUs, hence it will not buil the dist package..

yes, I will use travis to build CPU version only, so I am learning how to switch off the cudnn, nccl, mpich in the travis cpu build

@nudles
Copy link
Member

nudles commented Mar 13, 2020

if there is no CUDA, the condition check # will fail, then the corresponding build will not happen.

@chrishkchris

This comment has been minimized.

@chrishkchris chrishkchris changed the title SINGA-487 Include NCCL and MPICH in conda build [WIP] SINGA-487 Include NCCL and MPICH in conda build Mar 13, 2020
@chrishkchris
Copy link
Contributor Author

chrishkchris commented Mar 13, 2020

I simplified the selection logic to use only one environment variable CUDA:

  1. export CUDA = 9.0, then use CUDA9.0 without nccl and moich
  2. export CUDA = 10.0, then use CUDA10.0 without nccl and mpich
  3. export CUDA = DIST, then use CUDA10.0 WITH nccl and mpich
  4. without export CUDA, then use CPU only

If we need two environment variables (CUDA and DIST) to determine the selection logic, I can change it.

@chrishkchris
Copy link
Contributor Author

Yes, sure, I will change to use two env variables: CUDA=9.0 or 10.0; DIST (set or not set)

@chrishkchris
Copy link
Contributor Author

I changed the selection logic to two env variables:

  1. CUDA = 9.0 or 10.0
  2. DIST = OFF or ON
    If the CUDA is not set, then will use CPU
    However, for singa-gpu or singa-dist, both the env variables CUDA and DIST should be defined, otherwise there is error (I tried to change the logic before to let DIST unset for default DIST=OFF, but this is not successful in conda_build_config.yaml)

@chrishkchris chrishkchris changed the title [WIP] SINGA-487 Include NCCL and MPICH in conda build SINGA-487 Include NCCL and MPICH in conda build Mar 17, 2020
tool/conda/dist/README.md Outdated Show resolved Hide resolved
# under the License.
#

{% set version = "2.1.0.dev" %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcslin need to remove the hardcode version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove {% set version = "2.1.0.dev" %}.

Assume that git latest tag is always available and reliable, then we could replace all {{ version }} with {{ environ.get('GIT_DESCRIBE_TAG') }}.

One side effect is in travis build, we need to git fetch --unshallow to get all the tags(including latest) from remote to build environment before build starts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have modified the codes (singa/cpu, singa/gpu, singa/dist) based on @dcslin suggestion:

  1. remove {% set version = "2.1.0.dev" %}.
  2. replace {{ version }} with {{ environ.get('GIT_DESCRIBE_TAG') }}.

If this is not ok, I can reverse one commit to cancel this

mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=$PREFIX -DUSE_CUDA=$USE_CUDA \
-DUSE_PYTHON3=ON -DUSE_MKLDNN=ON -DCMAKE_OSX_SYSROOT=${CONDA_BUILD_SYSROOT} ..
-DUSE_PYTHON3=ON -DUSE_DNNL=OFF -DUSE_DIST=$USE_DIST -DCMAKE_OSX_SYSROOT=${CONDA_BUILD_SYSROOT} ..
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

USE_DNNL is a variable whose value depends on the environment. It should not be hard coded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DNNL is necessary for CNN, so will change to -DUSE_DNNL=ON after DNNL package is ready. (In the past -DUSE_MKLDNN=ON at default)

tool/conda/singa/conda_build_config.yaml Show resolved Hide resolved
@chrishkchris

This comment has been minimized.

@chrishkchris
Copy link
Contributor Author

I just resolved the conflict, and the travis CPU build is successful.

some test case error on test_onnx_backend "ImportError: libprotobuf.so.20: cannot open shared object file: No such file or directory" may be solved later.

So this PR concerning NCCL and MPICH should be ready for merge. Thanks!

@chrishkchris
Copy link
Contributor Author

Ok, I updated the versions of dependence conda package to make singa and onnx backend coexists #631. This PR is ready for merge.

@nudles
Copy link
Member

nudles commented Mar 21, 2020

@joddiy is there any big change from onnx 1.5 to 1.6?

@joddiy
Copy link
Member

joddiy commented Mar 23, 2020

@joddiy is there any big change from onnx 1.5 to 1.6?

The operator's version of onnx 1.5 is 10 and for 1.6 is 11, not big changes but some about operators, some operators attributes have been moved to its inputs, such as clip, the max and min values. It's big trouble for my previous implement, however, I re-constructed the backend and frontend last week, so it's fine for now to change from onnx 1.5 to 1.6. Do we need to?

@nudles
Copy link
Member

nudles commented Mar 23, 2020 via email

@joddiy
Copy link
Member

joddiy commented Mar 23, 2020

if there are dependent libs compatible with 1.5, then we have to change to 1.6..

On Mon, Mar 23, 2020 at 11:49 AM Joddiy Zhang @.***> wrote: @joddiy https://github.com/joddiy is there any big change from onnx 1.5 to 1.6? The operator's version of onnx 1.5 is 10 and for 1.6 is 11, not big changes but some about operators, some operators attributes have been moved to its inputs, such as clip, the max and min values. It's big trouble for my previous implement, however, I re-constructed the backend and frontend last week, so it's fine for now to change from onnx 1.5 to 1.6. Do we need to? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#624 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA47DR7WPMPIU5BK35OPCYLRI3L5FANCNFSM4LEL3EFQ .

Ok, little changes for me, not too much, please change to onnx 1.6.

@chrishkchris
Copy link
Contributor Author

When I restore numpy version, ubuntu build pass, but macos build does not pass.
So I may further search for any better version of numpy

@chrishkchris
Copy link
Contributor Author

seems that 1.16.5 is okay, I will wait for travis ci

@chrishkchris
Copy link
Contributor Author

numpy 1.16.5 is ok

@chrishkchris
Copy link
Contributor Author

Ready for merge

@nudles nudles merged commit 38fa20b into apache:dev Mar 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants