[NumPy] enable large tensor in np #18368

szha · 2020-05-19T20:20:26Z

Description

(Brief description on what this PR is about)

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

mxnet-bot · 2020-05-19T20:20:28Z

Hey @szha , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [centos-gpu, windows-gpu, miscellaneous, clang, unix-cpu, website, unix-gpu, centos-cpu, sanity, edge, windows-cpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

szha · 2020-07-03T21:14:18Z

@TaoLv @PatricZhao large tensor build for MKL builds seems to be failing. See the CI checks.

TaoLv · 2020-07-04T14:44:17Z

@szha, do you mean the errors below? As I mentioned in #18645 (comment), we need also enable MKL_USE_ILP64 in cmake line when LTS is enabled. More information, see https://software.intel.com/content/www/us/en/develop/documentation/mkl-macos-developer-guide/top/linking-your-application-with-the-intel-math-kernel-library/linking-in-detail/linking-with-interface-libraries/using-the-ilp64-interface-vs-lp64-interface.html. I'm afraid that similar problem may also exist in mshadow code: https://github.com/apache/incubator-mxnet/blob/master/3rdparty/mshadow/mshadow/dot_engine-inl.h#L317.

[2020-07-03T21:00:50.972Z] /work/mxnet/src/operator/contrib/transformer.cc:143:31: error: narrowing conversion of 'm' from 'mxnet::index_t {aka long int}' to 'int' inside { } [-Werror=narrowing]
[2020-07-03T21:00:50.972Z]    MKL_INT p_m[GROUP_SIZE] = {m};
[2020-07-03T21:00:50.972Z]                                ^
[2020-07-03T21:00:50.972Z] /work/mxnet/src/operator/contrib/transformer.cc:144:31: error: narrowing conversion of 'n' from 'mxnet::index_t {aka long int}' to 'int' inside { } [-Werror=narrowing]
[2020-07-03T21:00:50.972Z]    MKL_INT p_n[GROUP_SIZE] = {n};
[2020-07-03T21:00:50.972Z]                                ^
[2020-07-03T21:00:50.972Z] /work/mxnet/src/operator/contrib/transformer.cc:145:31: error: narrowing conversion of 'k' from 'mxnet::index_t {aka long int}' to 'int' inside { } [-Werror=narrowing]
[2020-07-03T21:00:50.972Z]    MKL_INT p_k[GROUP_SIZE] = {k};
[2020-07-03T21:00:50.972Z]                                ^
[2020-07-03T21:00:50.972Z] /work/mxnet/src/operator/contrib/transformer.cc:146:35: error: narrowing conversion of 'lda' from 'mxnet::index_t {aka long int}' to 'int' inside { } [-Werror=narrowing]
[2020-07-03T21:00:50.972Z]    MKL_INT p_lda[GROUP_SIZE] = {lda};
[2020-07-03T21:00:50.972Z]                                    ^
[2020-07-03T21:00:50.972Z] /work/mxnet/src/operator/contrib/transformer.cc:147:35: error: narrowing conversion of 'ldb' from 'mxnet::index_t {aka long int}' to 'int' inside { } [-Werror=narrowing]
[2020-07-03T21:00:50.972Z]    MKL_INT p_ldb[GROUP_SIZE] = {ldb};
[2020-07-03T21:00:50.972Z]                                    ^
[2020-07-03T21:00:50.972Z] /work/mxnet/src/operator/contrib/transformer.cc:148:35: error: narrowing conversion of 'ldc' from 'mxnet::index_t {aka long int}' to 'int' inside { } [-Werror=narrowing]
[2020-07-03T21:00:50.972Z]    MKL_INT p_ldc[GROUP_SIZE] = {ldc};

szha · 2020-07-04T21:07:11Z

@TaoLv thanks. I will wait for that PR to be resolved.

sandeep-krishnamurthy · 2020-07-16T19:26:06Z

@access2rohit - Please help review. Thanks.

szha · 2020-07-17T05:01:01Z

@sandeep-krishnamurthy the CI is currently stuck on build issue that @TaoLv pointed out. @access2rohit 's PR on the build fix is needed to address the issue but it seems that there has been no progress in #18645. What's the plan?

sandeep-krishnamurthy · 2020-07-17T06:16:07Z

As discussed in this issue #17331 (comment) as a first step we are focussed on OpenBLAS updates only as it is the primary BLAS engine we ship with PyPi. MKL updates is next.

python/mxnet/numpy/multiarray.py

tests/python/unittest/test_np_large_array.py

access2rohit · 2020-07-17T19:13:32Z

@szha
Can we move the file: tests/python/unittest/test_np_large_array.py to nightly folder instead? Keeping them in unittest will significantly slowdown the CI and may result in timeouts as we are allocating tensors with over 4.3 Billion elements. Even if CI doesn't timeout right now once all numpy ops are added it will definitely timeout.

szha · 2020-07-17T19:14:56Z

@sandeep-krishnamurthy I'm not asking about MKL as feature request. I'm saying that the current build is broken and it needs fix.

access2rohit · 2020-07-17T19:15:35Z

Overall functionality seems fine. Can you run all the tests in the file tests/python/unittest/test_np_large_array.py using pytest and paste the output in the comments section so we can be sure that nightly CD pipeline doesn't break.

szha · 2020-07-17T19:15:38Z

@access2rohit will do. We still need to fix the CI for enabling large tensor first.

access2rohit · 2020-07-17T19:18:52Z

@sandeep-krishnamurthy I'm not asking about MKL as feature request. I'm saying that the current build is broken and it needs fix.

Basically I will disable large tensor build with MKL in CI when enabling Large Tensor by default and that should fix it(Since we are currently focusing on making Large Tensor work with openBLAS). For now you can remove your change to make USE_INT64_TENSOR_SIZE=ON as default from CMakeLists.txt and your PR should pass CI. Sounds good ?

access2rohit

LGTM! BTW codecov is failing, does it block us from merging the PR ?

szha · 2020-07-20T16:40:47Z

@access2rohit thanks for the review. no codecov isn't enforced yet.

szha force-pushed the np_int64 branch 4 times, most recently from 3df1764 to c801113 Compare May 21, 2020 22:36

szha force-pushed the np_int64 branch from c801113 to 0e9bcac Compare July 3, 2020 20:38

szha mentioned this pull request Jul 10, 2020

[mxnet 2.0] [item 2.4] Turning on large tensor support by default #17331

Open

access2rohit reviewed Jul 17, 2020

View reviewed changes

python/mxnet/numpy/multiarray.py Show resolved Hide resolved

access2rohit reviewed Jul 17, 2020

View reviewed changes

python/mxnet/numpy/multiarray.py Show resolved Hide resolved

access2rohit reviewed Jul 17, 2020

View reviewed changes

python/mxnet/numpy/multiarray.py Show resolved Hide resolved

access2rohit reviewed Jul 17, 2020

View reviewed changes

tests/python/unittest/test_np_large_array.py Outdated Show resolved Hide resolved

szha force-pushed the np_int64 branch from 0e9bcac to 512d3da Compare July 19, 2020 21:33

szha added 3 commits July 19, 2020 14:34

enable default large tensor in np

4d405b7

revert cmake change

5fafcfd

move test_np_large_array.py to nightly

c20e41b

szha force-pushed the np_int64 branch from 512d3da to c20e41b Compare July 19, 2020 21:34

szha changed the title ~~[WIP] enable large tensor in np~~ [NumPy] enable large tensor in np Jul 19, 2020

szha marked this pull request as ready for review July 19, 2020 21:35

access2rohit approved these changes Jul 20, 2020

View reviewed changes

szha merged commit bf26bcc into apache:master Jul 20, 2020

szha deleted the np_int64 branch July 20, 2020 18:43

sxjscience mentioned this pull request Jul 29, 2020

[Numpy] Fix SQuAD + Fix GLUE downloading dmlc/gluon-nlp#1280

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NumPy] enable large tensor in np #18368

[NumPy] enable large tensor in np #18368

szha commented May 19, 2020

mxnet-bot commented May 19, 2020

szha commented Jul 3, 2020

TaoLv commented Jul 4, 2020

szha commented Jul 4, 2020

sandeep-krishnamurthy commented Jul 16, 2020

szha commented Jul 17, 2020

sandeep-krishnamurthy commented Jul 17, 2020

access2rohit commented Jul 17, 2020 •

edited

Loading

szha commented Jul 17, 2020

access2rohit commented Jul 17, 2020

szha commented Jul 17, 2020

access2rohit commented Jul 17, 2020

access2rohit left a comment

szha commented Jul 20, 2020

[NumPy] enable large tensor in np #18368

[NumPy] enable large tensor in np #18368

Conversation

szha commented May 19, 2020

Description

Checklist

Essentials

Changes

Comments

mxnet-bot commented May 19, 2020

szha commented Jul 3, 2020

TaoLv commented Jul 4, 2020

szha commented Jul 4, 2020

sandeep-krishnamurthy commented Jul 16, 2020

szha commented Jul 17, 2020

sandeep-krishnamurthy commented Jul 17, 2020

access2rohit commented Jul 17, 2020 • edited Loading

szha commented Jul 17, 2020

access2rohit commented Jul 17, 2020

szha commented Jul 17, 2020

access2rohit commented Jul 17, 2020

access2rohit left a comment

Choose a reason for hiding this comment

szha commented Jul 20, 2020

access2rohit commented Jul 17, 2020 •

edited

Loading