-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Add AMP patching of npi ops in _api_internal module #19488
Conversation
Hey @mk-61 , Thanks for submitting the PR
CI supported jobs: [unix-gpu, centos-cpu, windows-cpu, windows-gpu, centos-gpu, miscellaneous, website, clang, edge, unix-cpu, sanity] Note: |
@mxnet-bot run ci [centos-cpu] |
Jenkins CI successfully triggered : [centos-cpu] |
We may need to add the example in #19463 as a test case. |
@mxnet-bot run ci [centos-cpu, centos-gpu, unix-cpu, unix-gpu] |
Jenkins CI successfully triggered : [unix-cpu, unix-gpu, centos-cpu, centos-gpu] |
@mxnet-bot run ci [centos-gpu, unix-gpu] |
Jenkins CI successfully triggered : [unix-gpu, centos-gpu] |
f7ae8ff
to
529f5f6
Compare
529f5f6
to
4797e10
Compare
@sandeep-krishnamurthy Could someone help for the CI test failure? It seems the failure not related with the PR. |
@leezu, yes, the failures do appear to be related to this PR, yet in a way I don't fully understand. I tried running these test locally, and it appears that running test_amp_init.py changes something in a global state, that causes some other tests to fail. It is understandable, in a way, since the test calls amp.init(), and performs some Python monkey patching - that's how AMP works. That's why I put this test into a separate module. Apparently, this is not enough, as it is not fully isolated. So I can suggest the following ways to fix this, from most to least desirable, but for the first 2 I'll need a help from someone, more familiar with pytest and how it's integrated into our CI:
The last option is not ideal, since we do expect to add more tests, which require calling amp.init(). |
mx.npx.set_np(*flags) | ||
|
||
|
||
@pytest.fixture(scope='module') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we try to narrow the scope to function
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter. The only reason I put scope='module' here is that amp.init() is called once per module. But even that doesn't matter much, since all subsequent calls would be a noop.
Regardless of what you declare in pytest amp.init() cannot be undone, at least with the current API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a utility to disable amp? If it is too much work to add quickly, you may add a separate pytest call in the runtime_functions.sh
file for the amp test files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, the amp in pytorch is implemented as a context manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's a nifty API to enable/disable AMP with a context manager, but would definitely require a separate, non-trivial change.
For now, to unblock this PR, moved amp_init tests into a separate process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mk-61 would you like to help create a tracking issue for that change? It sounds like something we should do before the stable 2.0 release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leezu - for now, I've just added an item here - #18896 - I keep it as a TODO list for AMP, of sorts. Still, can create a separate issue if you prefer to track it like this.
About adding it before the stable 2.0 release, well... To be clear - it's not just a matter of exposing something existing via a new API. Also, I'd like to clarify our plans about #18697 first, since it affects pretty much everything AMP-related.
When do we expect the stable 2.0 release? We still have some other, higher-priority items to fix.
Finally, my hope was that if a new API is going to be a strict superset of already existing one - i.e., we'll still keep amp.init() call, but also add a way to turn it off - may be it's acceptable to add it later? Or, to put it differently, may be it's less of an issue to add a new APIs, as long as we are not removing anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leezu - for now, I've just added an item here - #18896 - I keep it as a TODO list for AMP, of sorts. Still, can create a separate issue if you prefer to track it like this.
That's fine. Thank you
About adding it before the stable 2.0 release, well... To be clear - it's not just a matter of exposing something existing via a new API. Also, I'd like to clarify our plans about #18697 first, since it affects pretty much everything AMP-related.
Using a graph pass may address the current issue of global state. A potential downside is that graph pass will only work with hybrid models.
When do we expect the stable 2.0 release? We still have some other, higher-priority items to fix.
We'd like to create an Alpha release soon. There is no date for a stable release yet.
Finally, my hope was that if a new API is going to be a strict superset of already existing one - i.e., we'll still keep amp.init() call, but also add a way to turn it off - may be it's acceptable to add it later? Or, to put it differently, may be it's less of an issue to add a new APIs, as long as we are not removing anything?
It's fine to change the API in 2.0, if we feel it should be improved.
Thank you guy for your help @leezu @sxjscience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Apparently, some NumPy ops are registered in mxnet.ndarray.numpy._api_internal module, in addition to *._internal. This PR implements AMP patching of such ops there.
Fixes #19463.
Checklist
Essentials
@ptrendx