Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory fixes. Resolves #10867, and resolves #14080 #14372

Merged
merged 14 commits into from Mar 28, 2019

Conversation

Projects
None yet
5 participants
@andrewfayres
Copy link
Contributor

andrewfayres commented Mar 8, 2019

Description

There were a few memory leaks in executor and some instances of where ResourceScope was causing problems. This should fix those known issues.

To elaborate on the ResourceScope issue: there are some times when classes such as Executor creates other NativeResources (usually NDArrays) after the Executor has been instantiated. The executor then has a dependency on that new NativeResource. Since these happen at different times, they could potentially be created at different ResourceScope levels. The result is an unstable state where exiting the scope would cause a crash. Fixed this by adding a method to NativeResource which allows moving a resource to the same scope as another. This allows the parent class to control the scope of their dependencies.

Resolves #10867, and resolves #14080.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@andrewfayres andrewfayres requested review from nswamy and yzhliu as code owners Mar 8, 2019

@andrewfayres andrewfayres changed the title Memory fixes. Resolves #11926, and resolves #14080 Memory fixes. Resolves #10867, and resolves #14080 Mar 8, 2019

@vandanavk

This comment has been minimized.

Copy link
Contributor

vandanavk commented Mar 8, 2019

@mxnet-label-bot add [Memory, Scala, pr-awaiting-review]

@andrewfayres

This comment has been minimized.

Copy link
Contributor Author

andrewfayres commented Mar 9, 2019

NeuralStyle on gpu seems to be failing. I'll have to dig into why. I also think I've thought of a better way to deal with the interdependency of symbol/executor than what is being done right now so I'll update that as well.

@nswamy

This comment has been minimized.

Copy link
Member

nswamy commented Mar 9, 2019

If I understood correct(from the description), the problem is Executor creates some NDArrays that gets disposed before the executor?
I am wondering how are these NDArrays getting released before the destruction of the executor?

@andrewfayres

This comment has been minimized.

Copy link
Contributor Author

andrewfayres commented Mar 9, 2019

@nswamy This is because there is no guarantee that they are created at the same ResourceScope level. For example, the Executor gets created and returned to the user. The user then opens up a ResourceScope to do inference. In this scope, the executor might end up creating new resources for itself. Those resources are then in the scope level of the user's ResourceScope and get disposed when that closes. This leaves the Executor without the dependencies it created.

The solution is to have objects like executor put any dependencies that they create into the same ResourceScope as themselves. This way they are all coupled together. I thought about using moveToOuterScope but there's no way to know how many levels of scope could be between the two.

@andrewfayres andrewfayres changed the title Memory fixes. Resolves #10867, and resolves #14080 [WIP] Memory fixes. Resolves #10867, and resolves #14080 Mar 9, 2019

@nswamy

This comment has been minimized.

Copy link
Member

nswamy commented Mar 11, 2019

@andrewfayres, thanks for taking care of this, Its a really nice find :)

I find the design of executor less than desirable(where it creates states in a method dynamically) after the
the object is created.
Anyway, here is what we discussed offline:

  • moveToScopeOf needs cognizant use of it after the creation of the resources
  • calling moveToScopeOf on a large amount of objects will have performance impact.

My suggestion is to create a couple methods in ResourceScope.
ResourceScope.setScope(scope: ResourceScope)
ResourceScope.resetScope()
We can document these methods to be used when you want to hold NativeResources as state variables inside another NativeResource, in this case the Executor.

Explore to automatically detect(possibly reflection) creation of NativeResource within another NativeResource.
Ideal would have been to pass the scope in which to create to a native resource, unfortunately that would require changing every API of NDArray, Symbol, etc., since most of them are exposed as static methods.

I am open to hear any other interesting ideas that is less intrusive.

@andrewfayres

This comment has been minimized.

Copy link
Contributor Author

andrewfayres commented Mar 11, 2019

I came up with a different solution which we (@nswamy and I) discussed offline. Essentially, ResourceScope.using already has an optional parameter for an existing scope. A few changes minor changes should let us reuse that.

This approach should be more consistent and intuitive. I'll update the PR.

@lanking520

This comment has been minimized.

Copy link
Contributor

lanking520 commented Mar 19, 2019

Any updates?

@andrewfayres

This comment has been minimized.

Copy link
Contributor Author

andrewfayres commented Mar 19, 2019

I was on call last week and didn't have time to pick this back up unfortunately. I started working on it again yesterday and finished the code changes. I'm going to do testing today and if all goes well I'll have an update this afternoon.

Ayres added some commits Mar 21, 2019

Ayres
Show resolved Hide resolved scala-package/core/src/main/scala/org/apache/mxnet/ResourceScope.scala Outdated
@@ -179,6 +175,16 @@ object ResourceScope {
}
}

private[mxnet] def usingIfScopeExists[A](scope: Option[ResourceScope])(body: => A): A = {

This comment has been minimized.

Copy link
@nswamy

nswamy Mar 21, 2019

Member

why do you need this new method?

This comment has been minimized.

Copy link
@andrewfayres

andrewfayres Mar 21, 2019

Author Contributor

Yeah so originally I wanted to use the .using method directly but ran into some problems. There are times when we need to add new native resources to the same scope of the parent but there's no guarantee that the parent is actually in an existing scope.

When this happens it leads to a few issues. First, the parent is in None scope which we cannot pass to ResourceScope.using. Second, we still want to execute the body and allocate all the new resources but don't want to want to make a new ResourceScope because all those new resources will disappear with the new scope.

Alternatives that I thought of were: 1.) to have the caller check whether or not it was in a scope and handle it appropriately. This is ugly and puts the onus on the callers in multiple places. 2.) Changing the using method to work with a None scope. I opted not to do this because it complicates that method and I believe would require changing the method parameters which I didn't want to do. 3.) Changing the default scope to be something other than None. This is probably a reasonable solution. Maybe we have some kind of base scope or something similar. That's likely to be a fairly significant change in both the design and behavior of this class.

@andrewfayres

This comment has been minimized.

Copy link
Contributor Author

andrewfayres commented Mar 21, 2019

Found a new bug last night where many of the optimizers cause a seg fault when used with FeedForward in 1.4.0. Surprisingly, no one has filed an issue for this so I went ahead and did so. It's fixed in this PR.

Resolves #14498

@andrewfayres andrewfayres changed the title [WIP] Memory fixes. Resolves #10867, and resolves #14080 Memory fixes. Resolves #10867, and resolves #14080 Mar 21, 2019

@andrewfayres

This comment has been minimized.

Copy link
Contributor Author

andrewfayres commented Mar 21, 2019

CI seems to be experiencing unrelated failures. I'll restart this later once they appear to be resolved.

@nswamy

This comment has been minimized.

Copy link
Member

nswamy commented Mar 23, 2019

I am surprised that the tests did not catch the seg-faults, I ran the code through several examples for many days and did not see the seg-faults.

@nswamy

This comment has been minimized.

Copy link
Member

nswamy commented Mar 23, 2019

Feedforward is to be deprecated and Module should be used instead, we never deprecated that module, may be we should consider deprecating that.

Ayres

@andrewfayres andrewfayres force-pushed the andrewfayres:reshape_leak branch from 7ae3c3c to 33c113c Mar 27, 2019

@nswamy

This comment has been minimized.

Copy link
Member

nswamy commented Mar 27, 2019

summary of offline discussion:

  • The method is private, this needs to be re-evaluate if we can do without adding scope twice into threadlocal. (ResourceScope:L162)
  • There is a bug in using method that closes the passed in scope(it should not), we will track a JIRA to track and resolve.
  • Also noting that checking the map could impact the performance, with assumption that we won't have too many ResourceScope in action on each thread at at time, at this time this solution is acceptable.
@nswamy

nswamy approved these changes Mar 27, 2019

Ayres
@nswamy

This comment has been minimized.

Copy link
Member

nswamy commented Mar 28, 2019

Thanks for nailing down this issue, great Job 💯 Please link the JIRA we discussed for fixing the scope closing if passed

@nswamy nswamy merged commit 102b46f into apache:master Mar 28, 2019

12 checks passed

ci/jenkins/mxnet-validation/centos-cpu Job succeeded
Details
ci/jenkins/mxnet-validation/centos-gpu Job succeeded
Details
ci/jenkins/mxnet-validation/clang Job succeeded
Details
ci/jenkins/mxnet-validation/edge Job succeeded
Details
ci/jenkins/mxnet-validation/miscellaneous Job succeeded
Details
ci/jenkins/mxnet-validation/sanity Job succeeded
Details
ci/jenkins/mxnet-validation/unix-cpu Job succeeded
Details
ci/jenkins/mxnet-validation/unix-gpu Job succeeded
Details
ci/jenkins/mxnet-validation/website Job succeeded
Details
ci/jenkins/mxnet-validation/windows-cpu Job succeeded
Details
ci/jenkins/mxnet-validation/windows-gpu Job succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@andrewfayres

This comment has been minimized.

Copy link
Contributor Author

andrewfayres commented Mar 28, 2019

Link to Jira for looking into & fixing the ResourceScope.using closing scopes that are passed in.

https://issues.apache.org/jira/browse/MXNET-1376

vdantu added a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019

Memory fixes. Resolves apache#10867, and resolves apache#14080 (apach…
…e#14372)

* Fixes for memory leak when reshaping executor

* Fixed Adam Optimizer memory leak

* Cleanup for PR

* Added unit test for new ResourceScope method

* Removing import that was added by overzealous ide

* Add back in an import

* Added flags for executor to know whether or not it owns NDArrays for disposal

* Moving to ResourceScope.using implementation

* Changes to make ResourceScope.using work with existing scope

* Updating ResourceScope to work with existing scopes via usingIfScopeExists method

* Fix clojure unit tests

* Fixes to be compatibile with how clojure is using ResourceScope

* Removing some unnecessary changes

* Adding scope assertion in unit test

lanking520 added a commit to lanking520/incubator-mxnet that referenced this pull request Apr 1, 2019

Memory fixes. Resolves apache#10867, and resolves apache#14080 (apach…
…e#14372)

* Fixes for memory leak when reshaping executor

* Fixed Adam Optimizer memory leak

* Cleanup for PR

* Added unit test for new ResourceScope method

* Removing import that was added by overzealous ide

* Add back in an import

* Added flags for executor to know whether or not it owns NDArrays for disposal

* Moving to ResourceScope.using implementation

* Changes to make ResourceScope.using work with existing scope

* Updating ResourceScope to work with existing scopes via usingIfScopeExists method

* Fix clojure unit tests

* Fixes to be compatibile with how clojure is using ResourceScope

* Removing some unnecessary changes

* Adding scope assertion in unit test

ZhennanQin added a commit to ZhennanQin/incubator-mxnet that referenced this pull request Apr 3, 2019

Memory fixes. Resolves apache#10867, and resolves apache#14080 (apach…
…e#14372)

* Fixes for memory leak when reshaping executor

* Fixed Adam Optimizer memory leak

* Cleanup for PR

* Added unit test for new ResourceScope method

* Removing import that was added by overzealous ide

* Add back in an import

* Added flags for executor to know whether or not it owns NDArrays for disposal

* Moving to ResourceScope.using implementation

* Changes to make ResourceScope.using work with existing scope

* Updating ResourceScope to work with existing scopes via usingIfScopeExists method

* Fix clojure unit tests

* Fixes to be compatibile with how clojure is using ResourceScope

* Removing some unnecessary changes

* Adding scope assertion in unit test

pengxin99 pushed a commit to pengxin99/incubator-mxnet that referenced this pull request Apr 4, 2019

Memory fixes. Resolves apache#10867, and resolves apache#14080 (apach…
…e#14372)

* Fixes for memory leak when reshaping executor

* Fixed Adam Optimizer memory leak

* Cleanup for PR

* Added unit test for new ResourceScope method

* Removing import that was added by overzealous ide

* Add back in an import

* Added flags for executor to know whether or not it owns NDArrays for disposal

* Moving to ResourceScope.using implementation

* Changes to make ResourceScope.using work with existing scope

* Updating ResourceScope to work with existing scopes via usingIfScopeExists method

* Fix clojure unit tests

* Fixes to be compatibile with how clojure is using ResourceScope

* Removing some unnecessary changes

* Adding scope assertion in unit test

nswamy added a commit that referenced this pull request Apr 5, 2019

Memory fixes. Resolves #10867, and resolves #14080 (#14372)
* Fixes for memory leak when reshaping executor

* Fixed Adam Optimizer memory leak

* Cleanup for PR

* Added unit test for new ResourceScope method

* Removing import that was added by overzealous ide

* Add back in an import

* Added flags for executor to know whether or not it owns NDArrays for disposal

* Moving to ResourceScope.using implementation

* Changes to make ResourceScope.using work with existing scope

* Updating ResourceScope to work with existing scopes via usingIfScopeExists method

* Fix clojure unit tests

* Fixes to be compatibile with how clojure is using ResourceScope

* Removing some unnecessary changes

* Adding scope assertion in unit test

lanking520 added a commit that referenced this pull request Apr 5, 2019

Memory fixes. Resolves #10867, and resolves #14080 (#14372) (#14586)
* Fixes for memory leak when reshaping executor

* Fixed Adam Optimizer memory leak

* Cleanup for PR

* Added unit test for new ResourceScope method

* Removing import that was added by overzealous ide

* Add back in an import

* Added flags for executor to know whether or not it owns NDArrays for disposal

* Moving to ResourceScope.using implementation

* Changes to make ResourceScope.using work with existing scope

* Updating ResourceScope to work with existing scopes via usingIfScopeExists method

* Fix clojure unit tests

* Fixes to be compatibile with how clojure is using ResourceScope

* Removing some unnecessary changes

* Adding scope assertion in unit test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.