Fix Cached_op with static_shape=true #15298

ZhennanQin · 2019-06-21T02:18:25Z

Description

Should address #15281

@pengzhao-intel @TaoLv @junrushao1994 @zheng-da

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

junrushao

LGTM. Thanks for the fix :-)

ZhennanQin · 2019-06-25T06:14:59Z

#15297 should be fixed with this PR.

pengzhao-intel · 2019-06-25T13:23:19Z

@anirudh2290 @roywei please take a review.

roywei · 2019-06-25T15:17:28Z

The segfault & core dump is fixed.
On sockeye side, I'm still getting 8 unit test failed with python3 setup.py test


============================================= 8 failed, 536 passed in 42.53 seconds ==============================================
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "/home/ubuntu/incubator-mxnet/python/mxnet/_ctypes/ndarray.py", line 51, in __del__
AttributeError: 'NoneType' object has no attribute 'MXNDArrayFree'

all failure seems to happen at sockeye side.

        # Bucket sentences as padded np arrays
>       for source, target in zip(source_list, target_sentences):
E       TypeError: zip argument #1 must support iteration

anirudh2290 · 2019-06-25T17:12:46Z

src/nnvm/legacy_op_util.cc

-      }
-      CHECK_EQ(outputs.size(), in_grad_.size());
-      for (size_t i = 0; i < outputs.size(); ++i) in_grad_[i] = outputs[i];
-      bwd_init_ = true;


this caching was first removed in #14738 . I think this has certain performance implications since we are not caching the TBlobs anymore. Is the use case also similar, is this caused by split operator ?

When using legacy ops in Cached_op, this caching is not correct, because even static_alloc=true and static_shape=true, the input or output TBlobs may changed if they are the input or output of Cached_op.

Thinking a small case that end-user only hybridize one legacy op, then its input is the Cached_op's input, and also for output. Then end-user may pass different NDArrays to this Cached_op, and this TBlobs cache isn't correct.

okay, thanks for clarifying!

ZhennanQin · 2019-06-26T08:48:07Z

@roywei
On my side, sockeye has only 2 failures after this fix:

test/unit/test_data_io.py::test_parallel_sample_iter FAILED
test/unit/test_data_io.py::test_sharded_parallel_sample_iter FAILED

Those failures are also reproducible before merging trouble PR: 09202f7.

So I think those sockeye failures doesn't relate to that PR.

roywei

I have verified this resolves the sockeye failure, remaining test failures should be fixed at sockeye side. it's not related to cached op. Thanks for the fix!

pengzhao-intel

LGTM

pengzhao-intel · 2019-06-26T23:07:03Z

@ZhennanQin Please verify the performance of this PR with our internal tests and NLP tests.
If everything is OK, I will merge this soon.

anirudh2290

Thanks for the quick fix

ZhennanQin · 2019-06-27T02:43:31Z

@pengzhao-intel Tested symbolic & gluon inference speed and bert, seems everything works fine.

pengzhao-intel · 2019-06-27T03:08:18Z

Thanks, merging now.

pengzhao-intel · 2019-06-27T03:08:52Z

Please pick up this fix to r1.5 branch.

* Fix * run ci

sandeep-krishnamurthy · 2019-06-27T05:41:42Z

@ZhennanQin Please verify the performance of this PR with our internal tests and NLP tests.
If everything is OK, I will merge this soon.

@pengzhao-intel - I will be very interested to learn more about what internal tests and benchmark setup you have. Main motivation is to see if some of these tests should be bought to Nightly CI.

pengzhao-intel · 2019-06-27T08:26:06Z

@ZhennanQin Please verify the performance of this PR with our internal tests and NLP tests.
If everything is OK, I will merge this soon.

@pengzhao-intel - I will be very interested to learn more about what internal tests and benchmark setup you have. Main motivation is to see if some of these tests should be bought to Nightly CI.

@sandeep-krishnamurthy It's a good idea :) We have a branch of models and tested the latency and throughput for each CI so we can guarantee the performance of FP32 and INT8. Currently, the 2nd generation scalable processor is available in EC2, C5.12xlarge and C5.24xlarge.
Thus, it's good to move our tests (public models) into the nightly build. Is it possible to set up these two type of instances for night run?

* Fix Cached_op with static_shape=true (#15298) * Fix * run ci * trigger

sandeep-krishnamurthy · 2019-06-27T18:09:50Z

@ZhennanQin Please verify the performance of this PR with our internal tests and NLP tests.
If everything is OK, I will merge this soon.

@pengzhao-intel - I will be very interested to learn more about what internal tests and benchmark setup you have. Main motivation is to see if some of these tests should be bought to Nightly CI.

@sandeep-krishnamurthy It's a good idea :) We have a branch of models and tested the latency and throughput for each CI so we can guarantee the performance of FP32 and INT8. Currently, the 2nd generation scalable processor is available in EC2, C5.12xlarge and C5.24xlarge.
Thus, it's good to move our tests (public models) into the nightly build. Is it possible to set up these two type of instances for night run?

Thanks @pengzhao-intel - I will create a Github issue to discuss this with community members helping in CI and other activities around benchmarks/performance tests.

pengzhao-intel mentioned this pull request Jun 21, 2019

Sockeye failure with MXNet #15297

Open

junrushao approved these changes Jun 23, 2019

View reviewed changes

Roshrini added the pr-awaiting-review PR is waiting for code review label Jun 23, 2019

Fix

ac55c06

ZhennanQin force-pushed the fix_cached_op_again branch from ecf77b3 to ac55c06 Compare June 25, 2019 05:56

run ci

cbe9dc7

anirudh2290 reviewed Jun 25, 2019

View reviewed changes

roywei approved these changes Jun 26, 2019

View reviewed changes

pengzhao-intel approved these changes Jun 26, 2019

View reviewed changes

anirudh2290 approved these changes Jun 26, 2019

View reviewed changes

pengzhao-intel merged commit 582489c into apache:master Jun 27, 2019

ZhennanQin deleted the fix_cached_op_again branch June 27, 2019 04:28

roywei pushed a commit to roywei/incubator-mxnet that referenced this pull request Jun 27, 2019

Fix Cached_op with static_shape=true (apache#15298)

2de0b44

* Fix * run ci

roywei mentioned this pull request Jun 27, 2019

[backport 1.5.x]Fix Cached_op with static_shape=true (#15298) #15380

Merged

szha pushed a commit that referenced this pull request Jun 27, 2019

[backport 1.5.x]Fix Cached_op with static_shape=true (#15298) (#15380)

75a9e18

* Fix Cached_op with static_shape=true (#15298) * Fix * run ci * trigger

roywei mentioned this pull request Jul 19, 2019

Gluon Inference failed #15281

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Cached_op with static_shape=true #15298

Fix Cached_op with static_shape=true #15298

ZhennanQin commented Jun 21, 2019

junrushao left a comment

ZhennanQin commented Jun 25, 2019

pengzhao-intel commented Jun 25, 2019

roywei commented Jun 25, 2019

anirudh2290 Jun 25, 2019

ZhennanQin Jun 25, 2019

anirudh2290 Jun 25, 2019

ZhennanQin commented Jun 26, 2019

roywei left a comment

pengzhao-intel left a comment

pengzhao-intel commented Jun 26, 2019 •

edited

anirudh2290 left a comment

ZhennanQin commented Jun 27, 2019 •

edited

pengzhao-intel commented Jun 27, 2019

pengzhao-intel commented Jun 27, 2019

sandeep-krishnamurthy commented Jun 27, 2019

pengzhao-intel commented Jun 27, 2019

sandeep-krishnamurthy commented Jun 27, 2019

Fix Cached_op with static_shape=true #15298

Fix Cached_op with static_shape=true #15298

Conversation

ZhennanQin commented Jun 21, 2019

Description

Checklist

Essentials

Changes

Comments

junrushao left a comment

Choose a reason for hiding this comment

ZhennanQin commented Jun 25, 2019

pengzhao-intel commented Jun 25, 2019

roywei commented Jun 25, 2019

anirudh2290 Jun 25, 2019

Choose a reason for hiding this comment

ZhennanQin Jun 25, 2019

Choose a reason for hiding this comment

anirudh2290 Jun 25, 2019

Choose a reason for hiding this comment

ZhennanQin commented Jun 26, 2019

roywei left a comment

Choose a reason for hiding this comment

pengzhao-intel left a comment

Choose a reason for hiding this comment

pengzhao-intel commented Jun 26, 2019 • edited

anirudh2290 left a comment

Choose a reason for hiding this comment

ZhennanQin commented Jun 27, 2019 • edited

pengzhao-intel commented Jun 27, 2019

pengzhao-intel commented Jun 27, 2019

sandeep-krishnamurthy commented Jun 27, 2019

pengzhao-intel commented Jun 27, 2019

sandeep-krishnamurthy commented Jun 27, 2019

pengzhao-intel commented Jun 26, 2019 •

edited

ZhennanQin commented Jun 27, 2019 •

edited