Static alloc for hybridblock #11320

piiswrong · 2018-06-16T05:44:18Z

This reverts commit 9b8eb56.

Description

(Brief description on what this PR is about)

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

ThomasDelteil · 2018-06-16T16:36:55Z

@piiswrong could you please respond to the concerns that I listed there: #11318 ?

I might not be familiar enough with the C++ engine, but I think the python API exposed can be improved as it doesn't guard against undefined behaviour. I also think the "bulk_size" argument documentation can be improved to be expressed in more abstract terms to help users pick a value without having to read the C++ code.

Thanks!

tqchen · 2018-06-16T17:50:29Z

@ThomasDelteil can you copy your list of concerns over and make comments on the code? It is good to comment on the code and request changes, so that these points can be discussed

ThomasDelteil

In the interest of transparency, it would be good to benchmark this change in % speed improvements on a standard model like resnet50 from the model zoo with and without the optimisation flags.

ThomasDelteil · 2018-06-16T18:00:48Z

python/mxnet/gluon/block.py

+            Statically allocate memory to improve speed. Memory usage may increase.
+        static_shape : bool, default False
+            Optimize for invariant input shapes between iterations. Must also
+            set static_alloc to True. Change of input shapes is still allowed


can we assert that static_alloc is set if static_shape is set so that we don't open the door to undefined behaviours?

if static_shape: assert static_alloc, "static_alloc must be True if static_shape is True"

@ThomasDelteil please click request changes when commenting

ThomasDelteil · 2018-06-16T18:03:05Z

python/mxnet/gluon/block.py

+            set static_alloc to True. Change of input shapes is still allowed
+            but slower.
+        forward_bulk_size : int, default 15
+            Segment size of bulk execution during forward pass.


Can the doc about these bulk_size arguments be extended to be expressed in terms of memory / compute trade-offs to help users pick a value that work best given their model/hardware setup?

tqchen · 2018-06-16T18:23:20Z

I think the specific comment of @ThomasDelteil on the documents are actionable and should be addressed. The time difference in terms of resnet50, however, would take a long time to setup and test each time, and only meaningful if the training lasts longer on the big dataset.

So in this case, we shall trust the decision of @piiswrong and move forward with this change

ThomasDelteil · 2018-06-16T19:07:26Z

@tqchen, thanks I will explicitly hit "Request Change" next time.

Here is an easy to reproduce benchmark script https://gist.github.com/ThomasDelteil/b9ee7171c18e2db3ba47cb4c0439f116, I can run it after the PR is merged and report the results.

Here is the current timing (bug included -> it is fixed in this PR)

static_alloc:True, static_shape:True, time:68.9774 

static_alloc:True, static_shape:False, time:62.6213 

static_alloc:False, static_shape:True, time:62.3503 

static_alloc:False, static_shape:False, time:62.9120

tqchen · 2018-06-16T21:12:50Z

@ThomasDelteil a few things thoughts, please feel free to comment

Bench-marking against mnist is a good reference but may not reflect what will actually going on
When benchmark, always skip the first iteration(there is cost of cuda bootstrap, and so on)
It is helpful to provide such regression figure, but never the less, let us still trust the reviewer's decision on final merge given that they understand the trade-off and the detail of the code

Thanks

ThomasDelteil · 2018-06-16T21:39:42Z

agreed that MNIST might not be representative of real-world workload. I have left a commented out "image.resize_short" if we feel like bigger ndarray might show off better the optimization gains
good catch, I have added a first forward call before starting timing.
I think you misunderstood my intent with this benchmark. It's not a "let's revert if it is slower". I understand that this work is a foundation for @zheng-da and @eric-haibin-lin on-going PRs and work on control flow operators, and I am looking forward to those! I interact a lot with MXNet users, the wording of the modification to the API is that these are optimizations. I want to be able to cite figures when recommending such and such flags. 1% faster, 10% faster on this specific case, etc. Regardless of this particular PR, I think generally when writing code to optimize speed, it is a well accepted practice to benchmark a before and after to make sure there hasn't been any regression? I think a description of the PR would have been welcome, to better understand the intent of the author. Right now the only description of intent is the code which states "Statically allocate memory to improve speed.".

zheng-da · 2018-06-16T21:48:42Z

@ThomasDelteil thanks for your benchmark script. I modified your script to exclude the first epoch and can verify that the fix in https://github.com/apache/incubator-mxnet/pull/11310/files, which is included in this PR, has fixed the performance regression issue, as we expected. I also insert printing and verify that the memory allocation works exactly as it should be.

tqchen · 2018-06-16T21:55:14Z

@ThomasDelteil I understand that when possible, bench-marking the improvements bought by the change is a good bonus. But requiring to do so for every PR introduces burdens of engineering. So it is not about what is right, but more about what trade-off we could make in here. In this case, given the expertise of @piiswrong and reviewers of this area, and the common sense that static allocation will general results in an improvement, I feel such benchmark could be optional. But never-the-less we really appreciate that you make one :)

szha · 2018-06-16T23:34:27Z

@szhengac could you comment on the speed improvement of static memory allocation in your transformer models?

szhengac · 2018-06-17T00:30:24Z

In my experiment on gluon, this static_alloc option is very important for inputs with variable shapes. Without setting this to true, I cannot observe any speed up with multi-gpu training. When I set this to true, a nearly linear speed up can be obtained.

szha · 2018-06-17T02:39:27Z

Thanks, @szhengac

anirudh2290 · 2018-06-17T02:50:38Z

tests/python/unittest/test_gluon.py

+    with mx.autograd.record():
+        y = net(x)
+        y.backward()
+    mx.nd.waitall()


can we do wait_to_read here instead of waitall. As mentioned here: https://mxnet.incubator.apache.org/architecture/exception_handling.html rethrowing exceptions as part of wait all is not supported since it is expected to be used only with benchmark code.

@anirudh2290 do you think this could have been the cause for the flakiness of test_hybrid_static_memory_switching as reported here #11171?

i doubt it. i dont see why not rethrowing the exception would cause seg fault, the ops are still executed in both cases. the flakiness is also with test_hybrid_static_memory which is not using waitall.

ThomasDelteil · 2018-06-17T03:43:20Z

Thanks @szhengac for testing out the optimisation and glad to see it is bringing such drastic improvements on multi-gpu variable-sized input cases, especially useful for training large NLP language models indeed!
Thanks @zheng-da for confirming the performance regression is solved now.
@tqchen thanks for your comments. To clarify, I wasn't requesting that every PR should be tested for speed regression, merely hinting that contributions that target speed improvements benefit from a benchmark to 1. quantify such improvements and 2. make sure no regression were introduced. See #11124 for a good example of what I mean.

chinakook · 2018-06-17T13:40:30Z

It's complicated. I think some autotune method is better than user setting static_alloc by themselves.

zheng-da · 2018-06-17T16:53:15Z

I agree that it's important to provide some benchmark results for a target use case for such a PR. It might be difficult to ensure no performance regression is introduced for all other cases. Sometimes we still need to rely on other users to report problems caused by this kind of fundamental changes.

zheng-da · 2018-06-17T23:54:53Z

I ran the failed test thousands of times to collect a few failures. I checked the core dumps. It seems it always fails exactly in the same place. Please check my comment in #11171. It's unclear why there is such a memory error. Since it always fails in the same operator, it's less likely that the error was introduced by this PR.

marcoabreu

Agree, if it's only happening in the pooling operator, we should go ahead and merge this PR. The issue could then be investigated asynchronously.
Thanks a lot for iterating on this PR and addressing the reviews and fixing the regression. Also a big thanks to all the reviewers!

…11318)" This reverts commit 9b8eb56.

* Revert "Revert "Static alloc for hybridblock (apache#11313)" (apache#11318)" This reverts commit 9b8eb56. * fix

* Fix broken build with cython 0.28 * Fix setup.py to be compatible with cython 0.28 * Fix broken cython ndarray module * Revised comments * Replace hard coded library path with one obtained by find_lib_path * Add documentation for MXNET_ENABLE_CYTHON and MXNET_ENFORCE_CYTHON * Add cython build to CI * Fix for cython CI * Adjust python environment for cython CI * Add make variables to set python executable * Fix typo * Fix nnvm include path * Does not use ccache for cython * Fix issues with the wildcards in the library list in Jenkinsfile * Fix issues with the wildcards in the library list in Jenkinsfile (continued) * Fix issues with the wildcards in the library list in Jenkinsfile (continued) * Intentionally introduce a bug to check that the tests actually runwith cython * Remove the intentionally introduced bug * Update installation doc * Retrigger CI * Run cython CI in ubuntu environment instead of CentOS environment * Commit a missed file * Fix a bug in check_cython * Refine environments for cython CI * Restore unrelated changes * Fix a problem occurring when the cython modules for python 2 and 3 are built successively * Trigger CI * Pin the cython version in the CI * Catch up #11320 * Remove optional arguments unused after #11320 * Trigger CI * Trigger CI * Trigger CI * Remove unnecessary stype argument from the NDArrayBase constructor * Revise confusing initialization of `_ndarray_cls` * Add cython build for python3 in CI * Fix misplaced cython build in CI * Adjust CI environments for cython * Fix invalid path for cython generated .so files in cmake build * Revert invalid fix * Revise docs * Revise check_cython * Temporaily use make instead of cmake for debugging * Temporal changes for debugging * Temporal changes for debugging * Temporaily use ctypes instead of cython modules for debugging * Temporaily disable ccache for debugging * Temporaily use make (DEV = 0) instead of cmake for debugging * Temporaily disable cudnn for debugging * Restore temporal changes * Temporarily disable coverage report * Adapt to Jenkinsfile_utils * Adapt to Jenkinsfile_utils (cont.) * Restore unrelated changes * Restore temporal changes * Resolving conflict * Test with the cmake build is removed * Add MXNET_ENABLE_CYTHON=0 to tensorrt test * Fix typo * Trigger CI * Trigger CI * Adapt to Jenkinsfile refactoring * Adapt to Jenkinsfile refactoring (cont.) * Trigger CI * Trigger CI * Stash missing cython modules * Trigger CI * CMake build of cython modules without unit tests * Fix typo * Trigger CI * Fix a mistake introduced in merging process * trigger test * Update Jenkinsfile_utils.groovy * Trigger CI * Trigger CI * Trigger CI * Trigger tests * Trigger tests * Trigger tests * Trigger tests * Trigger tests * Trigger tests * Trigger tests * Trigger tests

* Fix broken build with cython 0.28 * Fix setup.py to be compatible with cython 0.28 * Fix broken cython ndarray module * Revised comments * Replace hard coded library path with one obtained by find_lib_path * Add documentation for MXNET_ENABLE_CYTHON and MXNET_ENFORCE_CYTHON * Add cython build to CI * Fix for cython CI * Adjust python environment for cython CI * Add make variables to set python executable * Fix typo * Fix nnvm include path * Does not use ccache for cython * Fix issues with the wildcards in the library list in Jenkinsfile * Fix issues with the wildcards in the library list in Jenkinsfile (continued) * Fix issues with the wildcards in the library list in Jenkinsfile (continued) * Intentionally introduce a bug to check that the tests actually runwith cython * Remove the intentionally introduced bug * Update installation doc * Retrigger CI * Run cython CI in ubuntu environment instead of CentOS environment * Commit a missed file * Fix a bug in check_cython * Refine environments for cython CI * Restore unrelated changes * Fix a problem occurring when the cython modules for python 2 and 3 are built successively * Trigger CI * Pin the cython version in the CI * Catch up apache#11320 * Remove optional arguments unused after apache#11320 * Trigger CI * Trigger CI * Trigger CI * Remove unnecessary stype argument from the NDArrayBase constructor * Revise confusing initialization of `_ndarray_cls` * Add cython build for python3 in CI * Fix misplaced cython build in CI * Adjust CI environments for cython * Fix invalid path for cython generated .so files in cmake build * Revert invalid fix * Revise docs * Revise check_cython * Temporaily use make instead of cmake for debugging * Temporal changes for debugging * Temporal changes for debugging * Temporaily use ctypes instead of cython modules for debugging * Temporaily disable ccache for debugging * Temporaily use make (DEV = 0) instead of cmake for debugging * Temporaily disable cudnn for debugging * Restore temporal changes * Temporarily disable coverage report * Adapt to Jenkinsfile_utils * Adapt to Jenkinsfile_utils (cont.) * Restore unrelated changes * Restore temporal changes * Resolving conflict * Test with the cmake build is removed * Add MXNET_ENABLE_CYTHON=0 to tensorrt test * Fix typo * Trigger CI * Trigger CI * Adapt to Jenkinsfile refactoring * Adapt to Jenkinsfile refactoring (cont.) * Trigger CI * Trigger CI * Stash missing cython modules * Trigger CI * CMake build of cython modules without unit tests * Fix typo * Trigger CI * Fix a mistake introduced in merging process * trigger test * Update Jenkinsfile_utils.groovy * Trigger CI * Trigger CI * Trigger CI * Trigger tests * Trigger tests * Trigger tests * Trigger tests * Trigger tests * Trigger tests * Trigger tests * Trigger tests

piiswrong requested a review from szha as a code owner June 16, 2018 05:44

eric-haibin-lin approved these changes Jun 16, 2018

View reviewed changes

haojin2 approved these changes Jun 16, 2018

View reviewed changes

ThomasDelteil reviewed Jun 16, 2018

View reviewed changes

reminisce approved these changes Jun 16, 2018

View reviewed changes

zheng-da approved these changes Jun 16, 2018

View reviewed changes

szha approved these changes Jun 17, 2018

View reviewed changes

anirudh2290 reviewed Jun 17, 2018

View reviewed changes

ThomasDelteil approved these changes Jun 18, 2018

View reviewed changes

piiswrong force-pushed the static branch 2 times, most recently from 48a9b94 to f874184 Compare June 19, 2018 03:40

marcoabreu approved these changes Jun 19, 2018

View reviewed changes

piiswrong added 2 commits June 19, 2018 10:44

Revert "Revert "Static alloc for hybridblock (apache#11313)" (apache#…

d8b017d

…11318)" This reverts commit 9b8eb56.

fix

48711c2

piiswrong force-pushed the static branch from f874184 to 48711c2 Compare June 19, 2018 17:44

marcoabreu merged commit a7ea976 into apache:master Jun 20, 2018

asitstands added a commit to asitstands/incubator-mxnet that referenced this pull request Jun 22, 2018

Catch up apache#11320

96ed71f

asitstands added a commit to asitstands/incubator-mxnet that referenced this pull request Jun 23, 2018

Remove optional arguments unused after apache#11320

991dd8c

zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018

Static alloc for hybridblock (apache#11320)

67b0605

* Revert "Revert "Static alloc for hybridblock (apache#11313)" (apache#11318)" This reverts commit 9b8eb56. * fix

XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018

Static alloc for hybridblock (apache#11320)

c1dfc7a

* Revert "Revert "Static alloc for hybridblock (apache#11313)" (apache#11318)" This reverts commit 9b8eb56. * fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Static alloc for hybridblock #11320

Static alloc for hybridblock #11320

piiswrong commented Jun 16, 2018

ThomasDelteil commented Jun 16, 2018

tqchen commented Jun 16, 2018 •

edited

ThomasDelteil left a comment

ThomasDelteil Jun 16, 2018

tqchen Jun 16, 2018

ThomasDelteil Jun 16, 2018

tqchen commented Jun 16, 2018 •

edited

ThomasDelteil commented Jun 16, 2018

tqchen commented Jun 16, 2018

ThomasDelteil commented Jun 16, 2018 •

edited

zheng-da commented Jun 16, 2018 •

edited

tqchen commented Jun 16, 2018

szha commented Jun 16, 2018

szhengac commented Jun 17, 2018

szha commented Jun 17, 2018

anirudh2290 Jun 17, 2018

ThomasDelteil Jun 17, 2018

anirudh2290 Jun 17, 2018

ThomasDelteil commented Jun 17, 2018

chinakook commented Jun 17, 2018

zheng-da commented Jun 17, 2018

zheng-da commented Jun 17, 2018

marcoabreu left a comment

Static alloc for hybridblock #11320

Static alloc for hybridblock #11320

Conversation

piiswrong commented Jun 16, 2018

Description

Checklist

Essentials

Changes

Comments

ThomasDelteil commented Jun 16, 2018

tqchen commented Jun 16, 2018 • edited

ThomasDelteil left a comment

Choose a reason for hiding this comment

ThomasDelteil Jun 16, 2018

Choose a reason for hiding this comment

tqchen Jun 16, 2018

Choose a reason for hiding this comment

ThomasDelteil Jun 16, 2018

Choose a reason for hiding this comment

tqchen commented Jun 16, 2018 • edited

ThomasDelteil commented Jun 16, 2018

tqchen commented Jun 16, 2018

ThomasDelteil commented Jun 16, 2018 • edited

zheng-da commented Jun 16, 2018 • edited

tqchen commented Jun 16, 2018

szha commented Jun 16, 2018

szhengac commented Jun 17, 2018

szha commented Jun 17, 2018

anirudh2290 Jun 17, 2018

Choose a reason for hiding this comment

ThomasDelteil Jun 17, 2018

Choose a reason for hiding this comment

anirudh2290 Jun 17, 2018

Choose a reason for hiding this comment

ThomasDelteil commented Jun 17, 2018

chinakook commented Jun 17, 2018

zheng-da commented Jun 17, 2018

zheng-da commented Jun 17, 2018

marcoabreu left a comment

Choose a reason for hiding this comment

tqchen commented Jun 16, 2018 •

edited

tqchen commented Jun 16, 2018 •

edited

ThomasDelteil commented Jun 16, 2018 •

edited

zheng-da commented Jun 16, 2018 •

edited