Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CD] dynamic libmxet pipeline fix + small fixes #16966

Merged
merged 4 commits into from Dec 10, 2019
Merged

Conversation

@perdasilva
Copy link
Contributor

perdasilva commented Dec 3, 2019

Description

MKL builds for the dynamic libmxet are failing because because of a previous PR's change. It deleted mx_mkldnn_deps from the Jenkins file, however, this is needed by an underlying import.

I also noticed a couple of inconsistencies:

  1. USE_NVTX=1 was not set for the cuda 9.0 make configuration
  2. ubuntu_gpu_cu101 was using an older cudnn version

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
@perdasilva perdasilva requested a review from szha as a code owner Dec 3, 2019
@TaoLv

This comment has been minimized.

Copy link
Member

TaoLv commented Dec 3, 2019

Thank you for the fix @perdasilva. Where can I find the broken status of CD?

@zachgk
zachgk approved these changes Dec 3, 2019
@perdasilva perdasilva force-pushed the perdasilva:cd_dynlib_fix branch from 4c2d606 to 4a4d0e3 Dec 5, 2019
@perdasilva

This comment has been minimized.

Copy link
Contributor Author

perdasilva commented Dec 5, 2019

@TaoLv sorry for the delay in responding. CD runs on a daily cadance here - if you have access, you can also test any changes to the CD pipeline on jenkins dev by updating the configuration for this job by pointing it to your repository and changing the branch specified to point to your branch. That would give you a dry run of CD.

@perdasilva

This comment has been minimized.

Copy link
Contributor Author

perdasilva commented Dec 5, 2019

@DickJC123, I'm trying to fix CD and I think it's been failing since the fuse op PR. Do you have any idea why it could be failing for the cuda 9.0 builds?

======================================================================
ERROR: test_operator_gpu.test_batchnorm_training
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/local/lib/python2.7/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/test_operator.py", line 1830, in test_batchnorm_training
    check_batchnorm_training('default')
  File "/work/mxnet/tests/python/gpu/../unittest/test_operator.py", line 1769, in check_batchnorm_training
    check_numeric_gradient(test, in_location, mean_std, numeric_eps=1e-2, rtol=0.16, atol=1e-2)
  File "/work/mxnet/python/mxnet/test_utils.py", line 1101, in check_numeric_gradient
    symbolic_grads = {k:executor.grad_dict[k].asnumpy() for k in grad_nodes}
  File "/work/mxnet/python/mxnet/test_utils.py", line 1101, in <dictcomp>
    symbolic_grads = {k:executor.grad_dict[k].asnumpy() for k in grad_nodes}
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 2532, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 255, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [21:10:06] src/operator/fusion/fused_op.cu:558: Check failed: compileResult == NVRTC_SUCCESS (6 vs. 0) : NVRTC Compilation failed. Please set environment variable MXNET_USE_FUSION to 0.
@perdasilva perdasilva force-pushed the perdasilva:cd_dynlib_fix branch from 4a4d0e3 to 52e0767 Dec 5, 2019
@perdasilva perdasilva changed the title [CD] dynamic libmxet pipeline fix [WIP][CD] dynamic libmxet pipeline fix Dec 5, 2019
@DickJC123

This comment has been minimized.

Copy link
Contributor

DickJC123 commented Dec 5, 2019

I'll work with @ptrendx to resolve this.

@perdasilva perdasilva force-pushed the perdasilva:cd_dynlib_fix branch from 52e0767 to 9e987a3 Dec 9, 2019
@perdasilva perdasilva changed the title [WIP][CD] dynamic libmxet pipeline fix [CD] dynamic libmxet pipeline fix + small fixes Dec 9, 2019
@perdasilva

This comment has been minimized.

Copy link
Contributor Author

perdasilva commented Dec 9, 2019

@DickJC123 I've created an issue to track this problem: #17020 - thanks again for looking into it

@perdasilva perdasilva force-pushed the perdasilva:cd_dynlib_fix branch from 9e987a3 to ec02673 Dec 10, 2019
@szha szha merged commit 60f77f5 into apache:master Dec 10, 2019
11 checks passed
11 checks passed
ci/jenkins/mxnet-validation/centos-cpu Job succeeded
Details
ci/jenkins/mxnet-validation/centos-gpu Job succeeded
Details
ci/jenkins/mxnet-validation/clang Job succeeded
Details
ci/jenkins/mxnet-validation/edge Job succeeded
Details
ci/jenkins/mxnet-validation/miscellaneous Job succeeded
Details
ci/jenkins/mxnet-validation/sanity Job succeeded
Details
ci/jenkins/mxnet-validation/unix-cpu Job succeeded
Details
ci/jenkins/mxnet-validation/unix-gpu Job succeeded
Details
ci/jenkins/mxnet-validation/website Job succeeded
Details
ci/jenkins/mxnet-validation/windows-cpu Job succeeded
Details
ci/jenkins/mxnet-validation/windows-gpu Job succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.