Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Static alloc for hybridblock #11320

Merged
merged 2 commits into from Jun 20, 2018
Merged

Static alloc for hybridblock #11320

merged 2 commits into from Jun 20, 2018

Conversation

piiswrong
Copy link
Contributor

This reverts commit 9b8eb56.

Description

(Brief description on what this PR is about)

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@piiswrong piiswrong requested a review from szha as a code owner June 16, 2018 05:44
@ThomasDelteil
Copy link
Contributor

@piiswrong could you please respond to the concerns that I listed there: #11318 ?

I might not be familiar enough with the C++ engine, but I think the python API exposed can be improved as it doesn't guard against undefined behaviour. I also think the "bulk_size" argument documentation can be improved to be expressed in more abstract terms to help users pick a value without having to read the C++ code.

Thanks!

@tqchen
Copy link
Member

tqchen commented Jun 16, 2018

@ThomasDelteil can you copy your list of concerns over and make comments on the code? It is good to comment on the code and request changes, so that these points can be discussed

Copy link
Contributor

@ThomasDelteil ThomasDelteil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of transparency, it would be good to benchmark this change in % speed improvements on a standard model like resnet50 from the model zoo with and without the optimisation flags.

Statically allocate memory to improve speed. Memory usage may increase.
static_shape : bool, default False
Optimize for invariant input shapes between iterations. Must also
set static_alloc to True. Change of input shapes is still allowed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we assert that static_alloc is set if static_shape is set so that we don't open the door to undefined behaviours?

if static_shape:
    assert static_alloc, "static_alloc must be True if static_shape is True"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ThomasDelteil please click request changes when commenting

set static_alloc to True. Change of input shapes is still allowed
but slower.
forward_bulk_size : int, default 15
Segment size of bulk execution during forward pass.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the doc about these bulk_size arguments be extended to be expressed in terms of memory / compute trade-offs to help users pick a value that work best given their model/hardware setup?

@tqchen
Copy link
Member

tqchen commented Jun 16, 2018

I think the specific comment of @ThomasDelteil on the documents are actionable and should be addressed. The time difference in terms of resnet50, however, would take a long time to setup and test each time, and only meaningful if the training lasts longer on the big dataset.

So in this case, we shall trust the decision of @piiswrong and move forward with this change

@ThomasDelteil
Copy link
Contributor

@tqchen, thanks I will explicitly hit "Request Change" next time.

Here is an easy to reproduce benchmark script https://gist.github.com/ThomasDelteil/b9ee7171c18e2db3ba47cb4c0439f116, I can run it after the PR is merged and report the results.

Here is the current timing (bug included -> it is fixed in this PR)

static_alloc:True, static_shape:True, time:68.9774 

static_alloc:True, static_shape:False, time:62.6213 

static_alloc:False, static_shape:True, time:62.3503 

static_alloc:False, static_shape:False, time:62.9120 

@tqchen
Copy link
Member

tqchen commented Jun 16, 2018

@ThomasDelteil a few things thoughts, please feel free to comment

  • Bench-marking against mnist is a good reference but may not reflect what will actually going on
  • When benchmark, always skip the first iteration(there is cost of cuda bootstrap, and so on)
  • It is helpful to provide such regression figure, but never the less, let us still trust the reviewer's decision on final merge given that they understand the trade-off and the detail of the code

Thanks

@ThomasDelteil
Copy link
Contributor

ThomasDelteil commented Jun 16, 2018

  • agreed that MNIST might not be representative of real-world workload. I have left a commented out "image.resize_short" if we feel like bigger ndarray might show off better the optimization gains
  • good catch, I have added a first forward call before starting timing.
  • I think you misunderstood my intent with this benchmark. It's not a "let's revert if it is slower". I understand that this work is a foundation for @zheng-da and @eric-haibin-lin on-going PRs and work on control flow operators, and I am looking forward to those! I interact a lot with MXNet users, the wording of the modification to the API is that these are optimizations. I want to be able to cite figures when recommending such and such flags. 1% faster, 10% faster on this specific case, etc. Regardless of this particular PR, I think generally when writing code to optimize speed, it is a well accepted practice to benchmark a before and after to make sure there hasn't been any regression? I think a description of the PR would have been welcome, to better understand the intent of the author. Right now the only description of intent is the code which states "Statically allocate memory to improve speed.".

@zheng-da
Copy link
Contributor

zheng-da commented Jun 16, 2018

@ThomasDelteil thanks for your benchmark script. I modified your script to exclude the first epoch and can verify that the fix in https://github.com/apache/incubator-mxnet/pull/11310/files, which is included in this PR, has fixed the performance regression issue, as we expected. I also insert printing and verify that the memory allocation works exactly as it should be.

@tqchen
Copy link
Member

tqchen commented Jun 16, 2018

@ThomasDelteil I understand that when possible, bench-marking the improvements bought by the change is a good bonus. But requiring to do so for every PR introduces burdens of engineering. So it is not about what is right, but more about what trade-off we could make in here. In this case, given the expertise of @piiswrong and reviewers of this area, and the common sense that static allocation will general results in an improvement, I feel such benchmark could be optional. But never-the-less we really appreciate that you make one :)

@szha
Copy link
Member

szha commented Jun 16, 2018

@szhengac could you comment on the speed improvement of static memory allocation in your transformer models?

@szhengac
Copy link
Contributor

In my experiment on gluon, this static_alloc option is very important for inputs with variable shapes. Without setting this to true, I cannot observe any speed up with multi-gpu training. When I set this to true, a nearly linear speed up can be obtained.

@szha
Copy link
Member

szha commented Jun 17, 2018

Thanks, @szhengac

with mx.autograd.record():
y = net(x)
y.backward()
mx.nd.waitall()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do wait_to_read here instead of waitall. As mentioned here: https://mxnet.incubator.apache.org/architecture/exception_handling.html rethrowing exceptions as part of wait all is not supported since it is expected to be used only with benchmark code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anirudh2290 do you think this could have been the cause for the flakiness of test_hybrid_static_memory_switching as reported here #11171?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i doubt it. i dont see why not rethrowing the exception would cause seg fault, the ops are still executed in both cases. the flakiness is also with test_hybrid_static_memory which is not using waitall.

@ThomasDelteil
Copy link
Contributor

Thanks @szhengac for testing out the optimisation and glad to see it is bringing such drastic improvements on multi-gpu variable-sized input cases, especially useful for training large NLP language models indeed!
Thanks @zheng-da for confirming the performance regression is solved now.
@tqchen thanks for your comments. To clarify, I wasn't requesting that every PR should be tested for speed regression, merely hinting that contributions that target speed improvements benefit from a benchmark to 1. quantify such improvements and 2. make sure no regression were introduced. See #11124 for a good example of what I mean.

@chinakook
Copy link
Contributor

It's complicated. I think some autotune method is better than user setting static_alloc by themselves.

@zheng-da
Copy link
Contributor

I agree that it's important to provide some benchmark results for a target use case for such a PR. It might be difficult to ensure no performance regression is introduced for all other cases. Sometimes we still need to rely on other users to report problems caused by this kind of fundamental changes.

@zheng-da
Copy link
Contributor

I ran the failed test thousands of times to collect a few failures. I checked the core dumps. It seems it always fails exactly in the same place. Please check my comment in #11171. It's unclear why there is such a memory error. Since it always fails in the same operator, it's less likely that the error was introduced by this PR.

@piiswrong piiswrong force-pushed the static branch 2 times, most recently from 48a9b94 to f874184 Compare June 19, 2018 03:40
Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, if it's only happening in the pooling operator, we should go ahead and merge this PR. The issue could then be investigated asynchronously.
Thanks a lot for iterating on this PR and addressing the reviews and fixing the regression. Also a big thanks to all the reviewers!

@marcoabreu marcoabreu merged commit a7ea976 into apache:master Jun 20, 2018
asitstands added a commit to asitstands/incubator-mxnet that referenced this pull request Jun 22, 2018
asitstands added a commit to asitstands/incubator-mxnet that referenced this pull request Jun 23, 2018
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* Revert "Revert "Static alloc for hybridblock (apache#11313)" (apache#11318)"

This reverts commit 9b8eb56.

* fix
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
* Revert "Revert "Static alloc for hybridblock (apache#11313)" (apache#11318)"

This reverts commit 9b8eb56.

* fix
szha pushed a commit that referenced this pull request May 24, 2019
* Fix broken build with cython 0.28

* Fix setup.py to be compatible with cython 0.28

* Fix broken cython ndarray module

* Revised comments

* Replace hard coded library path with one obtained by find_lib_path

* Add documentation for MXNET_ENABLE_CYTHON and MXNET_ENFORCE_CYTHON

* Add cython build to CI

* Fix for cython CI

* Adjust python environment for cython CI

* Add make variables to set python executable

* Fix typo

* Fix nnvm include path

* Does not use ccache for cython

* Fix issues with the wildcards in the library list in Jenkinsfile

* Fix issues with the wildcards in the library list in Jenkinsfile (continued)

* Fix issues with the wildcards in the library list in Jenkinsfile (continued)

* Intentionally introduce a bug to check that the tests actually runwith cython

* Remove the intentionally introduced bug

* Update installation doc

* Retrigger CI

* Run cython CI in ubuntu environment instead of CentOS environment

* Commit a missed file

* Fix a bug in check_cython

* Refine environments for cython CI

* Restore unrelated changes

* Fix a problem occurring when the cython modules for python 2 and 3 are built successively

* Trigger CI

* Pin the cython version in the CI

* Catch up #11320

* Remove optional arguments unused after #11320

* Trigger CI

* Trigger CI

* Trigger CI

* Remove unnecessary stype argument from the NDArrayBase constructor

* Revise confusing initialization of `_ndarray_cls`

* Add cython build for python3 in CI

* Fix misplaced cython build in CI

* Adjust CI environments for cython

* Fix invalid path for cython generated .so files in cmake build

* Revert invalid fix

* Revise docs

* Revise check_cython

* Temporaily use make instead of cmake for debugging

* Temporal changes for debugging

* Temporal changes for debugging

* Temporaily use ctypes instead of cython modules for debugging

* Temporaily disable ccache for debugging

* Temporaily use make (DEV = 0) instead of cmake for debugging

* Temporaily disable cudnn for debugging

* Restore temporal changes

* Temporarily disable coverage report

* Adapt to Jenkinsfile_utils

* Adapt to Jenkinsfile_utils (cont.)

* Restore unrelated changes

* Restore temporal changes

* Resolving conflict

* Test with the cmake build is removed

* Add MXNET_ENABLE_CYTHON=0 to tensorrt test

* Fix typo

* Trigger CI

* Trigger CI

* Adapt to Jenkinsfile refactoring

* Adapt to Jenkinsfile refactoring (cont.)

* Trigger CI

* Trigger CI

* Stash missing cython modules

* Trigger CI

* CMake build of cython modules without unit tests

* Fix typo

* Trigger CI

* Fix a mistake introduced in merging process

* trigger test

* Update Jenkinsfile_utils.groovy

* Trigger CI

* Trigger CI

* Trigger CI

* Trigger tests

* Trigger tests

* Trigger tests

* Trigger tests

* Trigger tests

* Trigger tests

* Trigger tests

* Trigger tests
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* Fix broken build with cython 0.28

* Fix setup.py to be compatible with cython 0.28

* Fix broken cython ndarray module

* Revised comments

* Replace hard coded library path with one obtained by find_lib_path

* Add documentation for MXNET_ENABLE_CYTHON and MXNET_ENFORCE_CYTHON

* Add cython build to CI

* Fix for cython CI

* Adjust python environment for cython CI

* Add make variables to set python executable

* Fix typo

* Fix nnvm include path

* Does not use ccache for cython

* Fix issues with the wildcards in the library list in Jenkinsfile

* Fix issues with the wildcards in the library list in Jenkinsfile (continued)

* Fix issues with the wildcards in the library list in Jenkinsfile (continued)

* Intentionally introduce a bug to check that the tests actually runwith cython

* Remove the intentionally introduced bug

* Update installation doc

* Retrigger CI

* Run cython CI in ubuntu environment instead of CentOS environment

* Commit a missed file

* Fix a bug in check_cython

* Refine environments for cython CI

* Restore unrelated changes

* Fix a problem occurring when the cython modules for python 2 and 3 are built successively

* Trigger CI

* Pin the cython version in the CI

* Catch up apache#11320

* Remove optional arguments unused after apache#11320

* Trigger CI

* Trigger CI

* Trigger CI

* Remove unnecessary stype argument from the NDArrayBase constructor

* Revise confusing initialization of `_ndarray_cls`

* Add cython build for python3 in CI

* Fix misplaced cython build in CI

* Adjust CI environments for cython

* Fix invalid path for cython generated .so files in cmake build

* Revert invalid fix

* Revise docs

* Revise check_cython

* Temporaily use make instead of cmake for debugging

* Temporal changes for debugging

* Temporal changes for debugging

* Temporaily use ctypes instead of cython modules for debugging

* Temporaily disable ccache for debugging

* Temporaily use make (DEV = 0) instead of cmake for debugging

* Temporaily disable cudnn for debugging

* Restore temporal changes

* Temporarily disable coverage report

* Adapt to Jenkinsfile_utils

* Adapt to Jenkinsfile_utils (cont.)

* Restore unrelated changes

* Restore temporal changes

* Resolving conflict

* Test with the cmake build is removed

* Add MXNET_ENABLE_CYTHON=0 to tensorrt test

* Fix typo

* Trigger CI

* Trigger CI

* Adapt to Jenkinsfile refactoring

* Adapt to Jenkinsfile refactoring (cont.)

* Trigger CI

* Trigger CI

* Stash missing cython modules

* Trigger CI

* CMake build of cython modules without unit tests

* Fix typo

* Trigger CI

* Fix a mistake introduced in merging process

* trigger test

* Update Jenkinsfile_utils.groovy

* Trigger CI

* Trigger CI

* Trigger CI

* Trigger tests

* Trigger tests

* Trigger tests

* Trigger tests

* Trigger tests

* Trigger tests

* Trigger tests

* Trigger tests
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet