Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-1185] [WIP] Support large array in several operators #13191

Closed
wants to merge 46 commits into from

Conversation

apeforest
Copy link
Contributor

@apeforest apeforest commented Nov 8, 2018

Description

This PR fixed the large array issue (#13036, #13070) in the following operators:
ndarray.ones
ndarray.zeros
ndarray.sum
ndarray.arange
ndarray.slice
ndarray.random.uniform
ndarray.empty

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Support large arrays (> 3billion elements) in certain operators
  • nightly tests on arrays of 5 billion elements

Comments

  • More PRs in the future to support other operators
  • Need a thorough performance test
  • Rely on a defined data type index_t which is int64_t for now. It can be tuned based on various platforms.

@@ -215,6 +215,7 @@ def _load_lib():
# type definitions
mx_uint = ctypes.c_uint
mx_float = ctypes.c_float
mx_long = ctypes.c_longlong
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be longlong or just long?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be longlong. c_long is the same as c_int in python ctypes.

Copy link
Contributor

@samskalicky samskalicky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@yuxihu yuxihu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One question: what performance implication do we have to change the data type here? memory usage?

@anirudhacharya
Copy link
Member

anirudhacharya commented Nov 9, 2018

@mxnet-label-bot add [pr-awaiting-review]

@apeforest
Copy link
Contributor Author

@yuxihu That's a good question. We have not yet run a full performance test. @wkcn did some initial test and did not see much performance variation between int32_t and int64_t: https://github.com/wkcn/c_performance

@wkcn
Copy link
Member

wkcn commented Nov 10, 2018

@apeforest Could you please add the test? #13070
Thank you!

@apeforest
Copy link
Contributor Author

@apeforest Could you please add the test? #13070
Thank you!

Added: test_ndarray_empty()

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label Nov 12, 2018
@lanking520
Copy link
Member

@apeforest
Copy link
Contributor Author

@lanking520 @andrewfayres After offline discussions, I have updated the PR by limiting the changes in scala package to its JNI interface and leave a TODO item to systematically update Scala code to support large array. Please help to review the code again since there are still a few scala-unit test failure. Thanks!

Copy link
Member

@yuxihu yuxihu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dim_t, size_t, index_t are used in different places within the change. It is unclear to me when to use each one. It is going to be very difficult for others to follow on future changes. I think we need detailed documentation on how to choose data type for operators in order to support large array. It should be also mentioned the in PR description.

In addition, only a few operators are changed to support large array. This make our code base a mix of int/index_t for operators. This can be confusing. Please document and we may need a more systematic way of handling large array instead of hot fixes to a few operators.

@@ -354,8 +354,8 @@ inline std::vector<std::string> SafeGetListNames(const Rcpp::List& src) {
* \param rshape The dimension in R
* \return A internal vector representation of shapes in mxnet.
*/
inline std::vector<mx_uint> Dim2InternalShape(const Rcpp::Dimension &rshape) {
std::vector<mx_uint> shape(rshape.size());
inline std::vector<dim_t> Dim2InternalShape(const Rcpp::Dimension &rshape) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: what is the type of mx_uint and dim_t?

static_cast<size_t>(N), static_cast<size_t>(omp_threads))) {
for (int i = 0; i < N; ++i) {
N, static_cast<size_t>(omp_threads))) {
for (size_t i = 0; i < N; ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why size_t here and index_t on L549?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because omp parallel block requires index to be signed integer, while size_t is unsigned.

@apeforest apeforest changed the title [MXNET-1185] Support large array in several operators [MXNET-1185] [WIP] Support large array in several operators Nov 27, 2018
@apeforest
Copy link
Contributor Author

Change this PR to [WIP] because we want to merge a smaller part first (#13418)

@apeforest
Copy link
Contributor Author

@mxnet-label-bot update [pr-work-in-progress]

@marcoabreu marcoabreu added pr-work-in-progress PR is still work in progress and removed pr-awaiting-review PR is waiting for code review labels Nov 27, 2018
@roywei
Copy link
Member

roywei commented Dec 11, 2018

@apeforest do you plan to still use this PR for part 2? If not maybe close it first.
Thanks!

@apeforest
Copy link
Contributor Author

@roywei We can close this PR for now. I will reopen it when it comes to all the language binding support. Thanks!

@apeforest apeforest closed this Dec 11, 2018
@apeforest apeforest deleted the bugfix/large-array branch February 25, 2020 18:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-work-in-progress PR is still work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants