V1.8.x #19262

leezu · 2020-10-01T03:47:44Z

Use magit-cherry mode to check for commits present in v1.7.x but missing from v1.8.x. - indicates that the commit is also present in v1.8.x, whereas + indicates that the commit is missing.

There are a couple of false positives (declared missing but actually present), as those commits were forward ported to v1.8.x in a squashed form. I went through all + marked commits and applied them to the v1.8.x branch.

Only conflict to resolve was 8c7c2f1 which had to take into account the change by ce0a518

* add zero grad for npi_unique (apache#18080) * fix np.clip scalar input case (apache#17788) * fix true_divide (apache#18393) Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com>

* Fix Windows GPU CI (apache#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <vexilligera@gmail.com> * backport mixed type Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: vexilligera <vexilligera@gmail.com>

… variable input shapes (apache#18632) (apache#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <tao.a.lv@intel.com>

mxnet-bot · 2020-10-01T03:47:49Z

Hey @leezu , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [website, centos-cpu, edge, miscellaneous, windows-cpu, unix-gpu, clang, sanity, unix-cpu, centos-gpu, windows-gpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

sxjscience · 2020-10-01T03:56:09Z

Also CC @samskalicky @sandeep-krishnamurthy

sandeep-krishnamurthy · 2020-10-01T04:46:46Z

Thank you so much @leezu

sxjscience · 2020-10-01T04:48:23Z

In fact, should we ensure that each release needs to include all the commits of the previous release? @samskalicky @sandeep-krishnamurthy

samskalicky · 2020-10-01T04:52:28Z

In fact, should we ensure that each release needs to include all the commits of the previous release? @samskalicky @sandeep-krishnamurthy

I like the idea of comparing to the previous release to ensure no PRs were missing (backwards comparing all previous releases seems to be a bit too much though). But theres already a lot of work required by the release manager: https://cwiki.apache.org/confluence/display/MXNET/Release+Process . Its already a significant time commitment.

Maybe we should consider adding the instructions for committers/release-managers to verify that commits to release branches (ie. 1.7.x or 1.8.x) are just cherry-pick/porting PRs from the base branch (ie. 1.x) to avoid the problem in the first place.

sxjscience · 2020-10-01T04:55:24Z

@samskalicky I agree. We should revise the guideline for cherry-picking the commits. That means, we need to both cherry-pick to the specific branch and also to 1.x / 2.x in the future.

* Update to thrust 1.9.8 on Windows * Remove debug logic

Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090

leezu · 2020-10-02T03:57:19Z

@mxnet-bot run ci [unix-gpu]

mxnet-bot · 2020-10-02T03:57:23Z

Jenkins CI successfully triggered : [unix-gpu]

leezu · 2020-10-02T13:43:28Z

@mxnet-bot run ci [unix-gpu]

mxnet-bot · 2020-10-02T13:43:32Z

Jenkins CI successfully triggered : [unix-gpu]

samskalicky · 2020-10-02T18:57:51Z

@leezu how do you wanna handle this, should we merge this PR so we have a single commit we can cherry-pick to v1.x? Or do you wanna open a separate PR there first before merging this one? Just want to make sure we dont merge this into a feature branch without having it in the main v1.x branch again...

And what reviews do you need before merging this PR?

* * Fix einsum gradient (apache#18482) * [v1.7.x] Backport PRs of numpy features (apache#18653) * add zero grad for npi_unique (apache#18080) * fix np.clip scalar input case (apache#17788) * fix true_divide (apache#18393) Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> * [v1.7.x] backport mixed type binary ops to v1.7.x (apache#18649) * Fix Windows GPU CI (apache#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <vexilligera@gmail.com> * backport mixed type Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: vexilligera <vexilligera@gmail.com> * revise activations (apache#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (apache#18632) (apache#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <tao.a.lv@intel.com> * Fail build_windows.py if all retries failed (apache#18177) * Update to thrust 1.9.8 on Windows (apache#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (apache#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com> Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> Co-authored-by: Yijun Chen <chenyijun0902@gmail.com> Co-authored-by: vexilligera <vexilligera@gmail.com> Co-authored-by: ciyong <ciyong.chen@intel.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com>

* * Fix einsum gradient (#18482) * [v1.7.x] Backport PRs of numpy features (#18653) * add zero grad for npi_unique (#18080) * fix np.clip scalar input case (#17788) * fix true_divide (#18393) Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> * [v1.7.x] backport mixed type binary ops to v1.7.x (#18649) * Fix Windows GPU CI (#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <vexilligera@gmail.com> * backport mixed type Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: vexilligera <vexilligera@gmail.com> * revise activations (#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <tao.a.lv@intel.com> * Fail build_windows.py if all retries failed (#18177) * Update to thrust 1.9.8 on Windows (#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com> Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> Co-authored-by: Yijun Chen <chenyijun0902@gmail.com> Co-authored-by: vexilligera <vexilligera@gmail.com> Co-authored-by: ciyong <ciyong.chen@intel.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com> Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com> Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> Co-authored-by: Yijun Chen <chenyijun0902@gmail.com> Co-authored-by: vexilligera <vexilligera@gmail.com> Co-authored-by: ciyong <ciyong.chen@intel.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com>

hanke580 and others added 5 commits October 1, 2020 03:24

* Fix einsum gradient (apache#18482)

3fe38da

[v1.7.x] Backport PRs of numpy features (apache#18653)

7c904a9

* add zero grad for npi_unique (apache#18080) * fix np.clip scalar input case (apache#17788) * fix true_divide (apache#18393) Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com>

revise activations (apache#18700)

508de27

leezu requested review from aaronmarkham, anirudh2290, eric-haibin-lin, marcoabreu and szha as code owners October 1, 2020 03:47

leezu added 3 commits October 1, 2020 17:03

Fail build_windows.py if all retries failed (apache#18177)

a41ec7e

Update to thrust 1.9.8 on Windows (apache#18218)

122cc18

* Update to thrust 1.9.8 on Windows * Remove debug logic

Re-enable build retries on MSVC (apache#18230)

9fd0ce1

Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090

leezu added the pr-awaiting-review PR is waiting for code review label Oct 2, 2020

sxjscience approved these changes Oct 2, 2020

View reviewed changes

samskalicky merged commit 371b312 into apache:v1.8.x Oct 2, 2020

samskalicky mentioned this pull request Oct 2, 2020

Backport PRs in v1.7.x missing from v1.x #19281

Merged

leezu deleted the v1.8.x branch October 5, 2020 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V1.8.x #19262

V1.8.x #19262

leezu commented Oct 1, 2020

mxnet-bot commented Oct 1, 2020

sxjscience commented Oct 1, 2020

sandeep-krishnamurthy commented Oct 1, 2020

sxjscience commented Oct 1, 2020

samskalicky commented Oct 1, 2020 •

edited

sxjscience commented Oct 1, 2020

leezu commented Oct 2, 2020

mxnet-bot commented Oct 2, 2020

leezu commented Oct 2, 2020

mxnet-bot commented Oct 2, 2020

samskalicky commented Oct 2, 2020 •

edited

V1.8.x #19262

V1.8.x #19262

Conversation

leezu commented Oct 1, 2020

mxnet-bot commented Oct 1, 2020

sxjscience commented Oct 1, 2020

sandeep-krishnamurthy commented Oct 1, 2020

sxjscience commented Oct 1, 2020

samskalicky commented Oct 1, 2020 • edited

sxjscience commented Oct 1, 2020

leezu commented Oct 2, 2020

mxnet-bot commented Oct 2, 2020

leezu commented Oct 2, 2020

mxnet-bot commented Oct 2, 2020

samskalicky commented Oct 2, 2020 • edited

samskalicky commented Oct 1, 2020 •

edited

samskalicky commented Oct 2, 2020 •

edited