Add AMP patching of npi ops in _api_internal module #19488

mk-61 · 2020-11-06T22:53:57Z

Description

Apparently, some NumPy ops are registered in mxnet.ndarray.numpy._api_internal module, in addition to *._internal. This PR implements AMP patching of such ops there.

Fixes #19463.

Checklist

Essentials

Changes are complete (i.e. I finished coding on this PR)

@ptrendx

mxnet-bot · 2020-11-06T22:54:02Z

Hey @mk-61 , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [unix-gpu, centos-cpu, windows-cpu, windows-gpu, centos-gpu, miscellaneous, website, clang, edge, unix-cpu, sanity]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

leezu · 2020-11-07T03:22:14Z

@mxnet-bot run ci [centos-cpu]

mxnet-bot · 2020-11-07T03:22:18Z

Jenkins CI successfully triggered : [centos-cpu]

sxjscience · 2020-11-08T03:37:41Z

We may need to add the example in #19463 as a test case.

python/mxnet/amp/amp.py

mk-61 · 2020-11-10T01:50:16Z

@mxnet-bot run ci [centos-cpu, centos-gpu, unix-cpu, unix-gpu]

mxnet-bot · 2020-11-10T01:50:21Z

Jenkins CI successfully triggered : [unix-cpu, unix-gpu, centos-cpu, centos-gpu]

mk-61 · 2020-11-10T04:26:40Z

@mxnet-bot run ci [centos-gpu, unix-gpu]

mxnet-bot · 2020-11-10T04:26:44Z

Jenkins CI successfully triggered : [unix-gpu, centos-gpu]

TristonC · 2020-11-18T22:20:43Z

@sandeep-krishnamurthy Could someone help for the CI test failure? It seems the failure not related with the PR.

leezu · 2020-11-18T22:24:41Z

@TristonC @mk-61 I haven't seen such error before. I think it may be introduced by this PR. Please try rebasing on master first and if the issue persists we can be sure that it is due to this PR.

mk-61 · 2020-11-19T02:18:23Z

@leezu, yes, the failures do appear to be related to this PR, yet in a way I don't fully understand. I tried running these test locally, and it appears that running test_amp_init.py changes something in a global state, that causes some other tests to fail.

It is understandable, in a way, since the test calls amp.init(), and performs some Python monkey patching - that's how AMP works. That's why I put this test into a separate module. Apparently, this is not enough, as it is not fully isolated.

So I can suggest the following ways to fix this, from most to least desirable, but for the first 2 I'll need a help from someone, more familiar with pytest and how it's integrated into our CI:

Call tests in test_amp_init.py in a completely separate, isolated process.
If we cannot run it completely isolated, run test_amp_init.py the last.
I can temporarily mark the new test skipped (one can still run it explicitly, I guess, and it does succeed).

The last option is not ideal, since we do expect to add more tests, which require calling amp.init().

sxjscience · 2020-11-19T03:56:16Z

tests/python/gpu/test_amp_init.py

+    mx.npx.set_np(*flags)
+
+
+@pytest.fixture(scope='module')


Can we try to narrow the scope to function?

It doesn't matter. The only reason I put scope='module' here is that amp.init() is called once per module. But even that doesn't matter much, since all subsequent calls would be a noop.
Regardless of what you declare in pytest amp.init() cannot be undone, at least with the current API.

Should we add a utility to disable amp? If it is too much work to add quickly, you may add a separate pytest call in the runtime_functions.sh file for the amp test files.

In fact, the amp in pytorch is implemented as a context manager.

I agree it's a nifty API to enable/disable AMP with a context manager, but would definitely require a separate, non-trivial change.
For now, to unblock this PR, moved amp_init tests into a separate process.

@mk-61 would you like to help create a tracking issue for that change? It sounds like something we should do before the stable 2.0 release

@leezu - for now, I've just added an item here - #18896 - I keep it as a TODO list for AMP, of sorts. Still, can create a separate issue if you prefer to track it like this.
About adding it before the stable 2.0 release, well... To be clear - it's not just a matter of exposing something existing via a new API. Also, I'd like to clarify our plans about #18697 first, since it affects pretty much everything AMP-related.
When do we expect the stable 2.0 release? We still have some other, higher-priority items to fix.
Finally, my hope was that if a new API is going to be a strict superset of already existing one - i.e., we'll still keep amp.init() call, but also add a way to turn it off - may be it's acceptable to add it later? Or, to put it differently, may be it's less of an issue to add a new APIs, as long as we are not removing anything?

@leezu - for now, I've just added an item here - #18896 - I keep it as a TODO list for AMP, of sorts. Still, can create a separate issue if you prefer to track it like this.

That's fine. Thank you

About adding it before the stable 2.0 release, well... To be clear - it's not just a matter of exposing something existing via a new API. Also, I'd like to clarify our plans about #18697 first, since it affects pretty much everything AMP-related.

Using a graph pass may address the current issue of global state. A potential downside is that graph pass will only work with hybrid models.

When do we expect the stable 2.0 release? We still have some other, higher-priority items to fix.

We'd like to create an Alpha release soon. There is no date for a stable release yet.

Finally, my hope was that if a new API is going to be a strict superset of already existing one - i.e., we'll still keep amp.init() call, but also add a way to turn it off - may be it's acceptable to add it later? Or, to put it differently, may be it's less of an issue to add a new APIs, as long as we are not removing anything?

It's fine to change the API in 2.0, if we feel it should be improved.

TristonC · 2020-11-19T19:33:37Z

Thank you guy for your help @leezu @sxjscience.

sxjscience

LGTM

mk-61 requested a review from szha as a code owner November 6, 2020 22:53

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Nov 6, 2020

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Nov 7, 2020

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-review PR is waiting for code review labels Nov 9, 2020

sxjscience reviewed Nov 9, 2020

View reviewed changes

python/mxnet/amp/amp.py Show resolved Hide resolved

lanking520 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Nov 10, 2020

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Nov 10, 2020

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Nov 11, 2020

mk-61 force-pushed the pr-amp-fix-npi-patching branch from f7ae8ff to 529f5f6 Compare November 12, 2020 18:17

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Nov 12, 2020

Vladimir Cherepanov added 2 commits November 17, 2020 19:01

Add AMP patching of npi ops in _api_internal module

0f970c5

Add test for npi_concatenate multicast

4797e10

mk-61 force-pushed the pr-amp-fix-npi-patching branch from 529f5f6 to 4797e10 Compare November 18, 2020 03:01

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Nov 18, 2020

sxjscience reviewed Nov 19, 2020

View reviewed changes

Run amp_init tests in a separate process in CI

66f6793

mk-61 requested review from aaronmarkham and marcoabreu as code owners November 19, 2020 19:16

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Nov 19, 2020

sxjscience approved these changes Nov 19, 2020

View reviewed changes

lanking520 added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Nov 19, 2020

leezu merged commit 6648866 into apache:master Nov 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AMP patching of npi ops in _api_internal module #19488

Add AMP patching of npi ops in _api_internal module #19488

mk-61 commented Nov 6, 2020 •

edited

Loading

mxnet-bot commented Nov 6, 2020

leezu commented Nov 7, 2020

mxnet-bot commented Nov 7, 2020

sxjscience commented Nov 8, 2020

mk-61 commented Nov 10, 2020

mxnet-bot commented Nov 10, 2020

mk-61 commented Nov 10, 2020

mxnet-bot commented Nov 10, 2020

TristonC commented Nov 18, 2020

leezu commented Nov 18, 2020

mk-61 commented Nov 19, 2020

sxjscience Nov 19, 2020

mk-61 Nov 19, 2020

leezu Nov 19, 2020

sxjscience Nov 19, 2020

mk-61 Nov 19, 2020

leezu Nov 19, 2020

mk-61 Nov 19, 2020

leezu Nov 19, 2020

TristonC commented Nov 19, 2020

sxjscience left a comment

		mx.npx.set_np(*flags)


		@pytest.fixture(scope='module')

Add AMP patching of npi ops in _api_internal module #19488

Add AMP patching of npi ops in _api_internal module #19488

Conversation

mk-61 commented Nov 6, 2020 • edited Loading

Description

Checklist

Essentials

mxnet-bot commented Nov 6, 2020

leezu commented Nov 7, 2020

mxnet-bot commented Nov 7, 2020

sxjscience commented Nov 8, 2020

mk-61 commented Nov 10, 2020

mxnet-bot commented Nov 10, 2020

mk-61 commented Nov 10, 2020

mxnet-bot commented Nov 10, 2020

TristonC commented Nov 18, 2020

leezu commented Nov 18, 2020

mk-61 commented Nov 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TristonC commented Nov 19, 2020

sxjscience left a comment

Choose a reason for hiding this comment

mk-61 commented Nov 6, 2020 •

edited

Loading