Conversation
@piiswrong could you please respond to the concerns that I listed there: #11318 ? I might not be familiar enough with the C++ engine, but I think the python API exposed can be improved as it doesn't guard against undefined behaviour. I also think the "bulk_size" argument documentation can be improved to be expressed in more abstract terms to help users pick a value without having to read the C++ code. Thanks! |
@ThomasDelteil can you copy your list of concerns over and make comments on the code? It is good to comment on the code and request changes, so that these points can be discussed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interest of transparency, it would be good to benchmark this change in % speed improvements on a standard model like resnet50 from the model zoo with and without the optimisation flags.
Statically allocate memory to improve speed. Memory usage may increase. | ||
static_shape : bool, default False | ||
Optimize for invariant input shapes between iterations. Must also | ||
set static_alloc to True. Change of input shapes is still allowed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we assert that static_alloc is set if static_shape is set so that we don't open the door to undefined behaviours?
if static_shape:
assert static_alloc, "static_alloc must be True if static_shape is True"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ThomasDelteil please click request changes when commenting
python/mxnet/gluon/block.py
Outdated
set static_alloc to True. Change of input shapes is still allowed | ||
but slower. | ||
forward_bulk_size : int, default 15 | ||
Segment size of bulk execution during forward pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the doc about these bulk_size
arguments be extended to be expressed in terms of memory / compute trade-offs to help users pick a value that work best given their model/hardware setup?
I think the specific comment of @ThomasDelteil on the documents are actionable and should be addressed. The time difference in terms of resnet50, however, would take a long time to setup and test each time, and only meaningful if the training lasts longer on the big dataset. So in this case, we shall trust the decision of @piiswrong and move forward with this change |
@tqchen, thanks I will explicitly hit "Request Change" next time. Here is an easy to reproduce benchmark script https://gist.github.com/ThomasDelteil/b9ee7171c18e2db3ba47cb4c0439f116, I can run it after the PR is merged and report the results. Here is the current timing (bug included -> it is fixed in this PR)
|
@ThomasDelteil a few things thoughts, please feel free to comment
Thanks |
|
@ThomasDelteil thanks for your benchmark script. I modified your script to exclude the first epoch and can verify that the fix in https://github.com/apache/incubator-mxnet/pull/11310/files, which is included in this PR, has fixed the performance regression issue, as we expected. I also insert printing and verify that the memory allocation works exactly as it should be. |
@ThomasDelteil I understand that when possible, bench-marking the improvements bought by the change is a good bonus. But requiring to do so for every PR introduces burdens of engineering. So it is not about what is right, but more about what trade-off we could make in here. In this case, given the expertise of @piiswrong and reviewers of this area, and the common sense that static allocation will general results in an improvement, I feel such benchmark could be optional. But never-the-less we really appreciate that you make one :) |
@szhengac could you comment on the speed improvement of static memory allocation in your transformer models? |
In my experiment on gluon, this static_alloc option is very important for inputs with variable shapes. Without setting this to true, I cannot observe any speed up with multi-gpu training. When I set this to true, a nearly linear speed up can be obtained. |
Thanks, @szhengac |
with mx.autograd.record(): | ||
y = net(x) | ||
y.backward() | ||
mx.nd.waitall() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do wait_to_read here instead of waitall. As mentioned here: https://mxnet.incubator.apache.org/architecture/exception_handling.html rethrowing exceptions as part of wait all is not supported since it is expected to be used only with benchmark code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anirudh2290 do you think this could have been the cause for the flakiness of test_hybrid_static_memory_switching
as reported here #11171?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i doubt it. i dont see why not rethrowing the exception would cause seg fault, the ops are still executed in both cases. the flakiness is also with test_hybrid_static_memory which is not using waitall
.
Thanks @szhengac for testing out the optimisation and glad to see it is bringing such drastic improvements on multi-gpu variable-sized input cases, especially useful for training large NLP language models indeed! |
It's complicated. I think some autotune method is better than user setting static_alloc by themselves. |
I agree that it's important to provide some benchmark results for a target use case for such a PR. It might be difficult to ensure no performance regression is introduced for all other cases. Sometimes we still need to rely on other users to report problems caused by this kind of fundamental changes. |
I ran the failed test thousands of times to collect a few failures. I checked the core dumps. It seems it always fails exactly in the same place. Please check my comment in #11171. It's unclear why there is such a memory error. Since it always fails in the same operator, it's less likely that the error was introduced by this PR. |
48a9b94
to
f874184
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, if it's only happening in the pooling operator, we should go ahead and merge this PR. The issue could then be investigated asynchronously.
Thanks a lot for iterating on this PR and addressing the reviews and fixing the regression. Also a big thanks to all the reviewers!
* Revert "Revert "Static alloc for hybridblock (apache#11313)" (apache#11318)" This reverts commit 9b8eb56. * fix
* Revert "Revert "Static alloc for hybridblock (apache#11313)" (apache#11318)" This reverts commit 9b8eb56. * fix
* Fix broken build with cython 0.28 * Fix setup.py to be compatible with cython 0.28 * Fix broken cython ndarray module * Revised comments * Replace hard coded library path with one obtained by find_lib_path * Add documentation for MXNET_ENABLE_CYTHON and MXNET_ENFORCE_CYTHON * Add cython build to CI * Fix for cython CI * Adjust python environment for cython CI * Add make variables to set python executable * Fix typo * Fix nnvm include path * Does not use ccache for cython * Fix issues with the wildcards in the library list in Jenkinsfile * Fix issues with the wildcards in the library list in Jenkinsfile (continued) * Fix issues with the wildcards in the library list in Jenkinsfile (continued) * Intentionally introduce a bug to check that the tests actually runwith cython * Remove the intentionally introduced bug * Update installation doc * Retrigger CI * Run cython CI in ubuntu environment instead of CentOS environment * Commit a missed file * Fix a bug in check_cython * Refine environments for cython CI * Restore unrelated changes * Fix a problem occurring when the cython modules for python 2 and 3 are built successively * Trigger CI * Pin the cython version in the CI * Catch up #11320 * Remove optional arguments unused after #11320 * Trigger CI * Trigger CI * Trigger CI * Remove unnecessary stype argument from the NDArrayBase constructor * Revise confusing initialization of `_ndarray_cls` * Add cython build for python3 in CI * Fix misplaced cython build in CI * Adjust CI environments for cython * Fix invalid path for cython generated .so files in cmake build * Revert invalid fix * Revise docs * Revise check_cython * Temporaily use make instead of cmake for debugging * Temporal changes for debugging * Temporal changes for debugging * Temporaily use ctypes instead of cython modules for debugging * Temporaily disable ccache for debugging * Temporaily use make (DEV = 0) instead of cmake for debugging * Temporaily disable cudnn for debugging * Restore temporal changes * Temporarily disable coverage report * Adapt to Jenkinsfile_utils * Adapt to Jenkinsfile_utils (cont.) * Restore unrelated changes * Restore temporal changes * Resolving conflict * Test with the cmake build is removed * Add MXNET_ENABLE_CYTHON=0 to tensorrt test * Fix typo * Trigger CI * Trigger CI * Adapt to Jenkinsfile refactoring * Adapt to Jenkinsfile refactoring (cont.) * Trigger CI * Trigger CI * Stash missing cython modules * Trigger CI * CMake build of cython modules without unit tests * Fix typo * Trigger CI * Fix a mistake introduced in merging process * trigger test * Update Jenkinsfile_utils.groovy * Trigger CI * Trigger CI * Trigger CI * Trigger tests * Trigger tests * Trigger tests * Trigger tests * Trigger tests * Trigger tests * Trigger tests * Trigger tests
* Fix broken build with cython 0.28 * Fix setup.py to be compatible with cython 0.28 * Fix broken cython ndarray module * Revised comments * Replace hard coded library path with one obtained by find_lib_path * Add documentation for MXNET_ENABLE_CYTHON and MXNET_ENFORCE_CYTHON * Add cython build to CI * Fix for cython CI * Adjust python environment for cython CI * Add make variables to set python executable * Fix typo * Fix nnvm include path * Does not use ccache for cython * Fix issues with the wildcards in the library list in Jenkinsfile * Fix issues with the wildcards in the library list in Jenkinsfile (continued) * Fix issues with the wildcards in the library list in Jenkinsfile (continued) * Intentionally introduce a bug to check that the tests actually runwith cython * Remove the intentionally introduced bug * Update installation doc * Retrigger CI * Run cython CI in ubuntu environment instead of CentOS environment * Commit a missed file * Fix a bug in check_cython * Refine environments for cython CI * Restore unrelated changes * Fix a problem occurring when the cython modules for python 2 and 3 are built successively * Trigger CI * Pin the cython version in the CI * Catch up apache#11320 * Remove optional arguments unused after apache#11320 * Trigger CI * Trigger CI * Trigger CI * Remove unnecessary stype argument from the NDArrayBase constructor * Revise confusing initialization of `_ndarray_cls` * Add cython build for python3 in CI * Fix misplaced cython build in CI * Adjust CI environments for cython * Fix invalid path for cython generated .so files in cmake build * Revert invalid fix * Revise docs * Revise check_cython * Temporaily use make instead of cmake for debugging * Temporal changes for debugging * Temporal changes for debugging * Temporaily use ctypes instead of cython modules for debugging * Temporaily disable ccache for debugging * Temporaily use make (DEV = 0) instead of cmake for debugging * Temporaily disable cudnn for debugging * Restore temporal changes * Temporarily disable coverage report * Adapt to Jenkinsfile_utils * Adapt to Jenkinsfile_utils (cont.) * Restore unrelated changes * Restore temporal changes * Resolving conflict * Test with the cmake build is removed * Add MXNET_ENABLE_CYTHON=0 to tensorrt test * Fix typo * Trigger CI * Trigger CI * Adapt to Jenkinsfile refactoring * Adapt to Jenkinsfile refactoring (cont.) * Trigger CI * Trigger CI * Stash missing cython modules * Trigger CI * CMake build of cython modules without unit tests * Fix typo * Trigger CI * Fix a mistake introduced in merging process * trigger test * Update Jenkinsfile_utils.groovy * Trigger CI * Trigger CI * Trigger CI * Trigger tests * Trigger tests * Trigger tests * Trigger tests * Trigger tests * Trigger tests * Trigger tests * Trigger tests
This reverts commit 9b8eb56.
Description
(Brief description on what this PR is about)
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments