[v1.7.x] backport mixed type binary ops to v1.7.x #18649

yijunc · 2020-07-01T09:43:32Z

Description

Backport mixed type binary ops to v1.7.x branch (Mentioned in #18641 #18648 #18653)
Mainly code for #18250 and #18523

mxnet-bot · 2020-07-01T09:43:37Z

Hey @BenjaminCHEN2016 , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [centos-cpu, website, windows-cpu, unix-gpu, windows-gpu, edge, miscellaneous, unix-cpu, centos-gpu, clang, sanity]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

yijunc · 2020-07-02T09:08:29Z

@sxjscience @ciyongch

ciyongch · 2020-07-02T11:40:08Z

Thank you @BenjaminCHEN2016 to backport these fixes. If I understand correctly, this is the only remaining PR that fix the numpy operator and targeting to 1.7 release, am I right? @sxjscience
Can you also help to test the latest v1.7.x code base with this PR in advance to make sure it's working as expected, and no more other dependencies? Thanks!

ciyongch · 2020-07-03T01:09:00Z

@BenjaminCHEN2016 @sxjscience there's some build error on windows platform as below, please help to take a look. It'll be great if it can be solved within 24h, Thanks a lot!

[2020-07-02T13:28:03.699Z] c:\jenkins_slave\workspace\build-gpu\src\operator\tensor\elemwise_binary_broadcast_op.h(237) : fatal error C1002: compiler is out of heap space in pass 2
[2020-07-02T13:28:03.699Z] jom: C:\jenkins_slave\workspace\build-gpu\build\CMakeFiles\mxnet_52.dir\build.make [CMakeFiles\mxnet_52.dir\src\operator\numpy\np_elemwise_broadcast_op.cc.obj] Error 1

yijunc · 2020-07-03T04:30:48Z

@ciyongch I will try to resolve this issue today.

yijunc · 2020-07-03T05:29:51Z

I think that might be a compiler issue? I tried on my windows machine and it can be compiled. @ciyongch @sxjscience

ciyongch · 2020-07-03T06:07:46Z

Hi @BenjaminCHEN2016 , it looks more like a compilation error which probably introduced by the current complex expressions or something like that which ran out of memory instead of the compiler itself.
I remembered there's a discussion about the compilation memory usage here, which reported that several numpy operator files consume a large memory footprint during compilation.
But I'm curious if this is the case, why the master branch doesn't have the similar CI issue. Is there any additional or different change in this patch compared to the original PRs?
Ping @wkcn, @leezu to see if any suggestions for this case, thanks!

yijunc · 2020-07-03T07:17:26Z

@leezu Is the CI different between master and v1.7.x branch? It seems that windows CI on v1.7.x is an older version?

wkcn · 2020-07-03T08:01:41Z

Hi @ciyongch , the issue is related to the version of compiler.
I only build MXNet on gcc6, gcc7 and gcc10, and gcc10 takes more than 11GB memory.

If using visual studio, could you try add /Zm100 /GX into the additional options of Property/ C/C++ /All Options to enlarge the heap space ?

ciyongch · 2020-07-03T08:17:39Z

Thanks for your information @wkcn , according to @BenjaminCHEN2016 the failure only happened in MXNet windows CI pipeline but not his local environment. There's a concern that if this is the case and still not able to pass the CI, do we still need to include this in v1.7.0? @sandeep-krishnamurthy @szha @sxjscience .

szha · 2020-07-03T19:54:56Z

Yes, we still need the change. It sounds like we may have missed some windows CI changes to backport. cc @ChaiBapchya to help clarify issues with windows CI

ChaiBapchya · 2020-07-03T20:15:13Z

Windows CI issues in master branch in late March were fixed by PRs :

Fix Windows GPU CI #17962

Upgrade VS from 2015 to 2019
use pre-installed cmake 3.17 [instead of installing 3.16]
upgrade from 32bit to 64bit toolchain
Cuda 10.2 [VS2019 requires 10.2]

Fail build_windows.py if all retries failed #18177

Fail build if windows retries fail

Re-enable build retries on MSVC #18230

Windows Cuda [thrust issue]

Infrastructure-wise, all branches run on same infra [meaning Same AMI, Same instance types, etc]
Code-wise I'm not sure if these PRs have been cherry-picked into other branches [specifically 1.7.x in this case].
Having said that, looking at past 15 CI runs on merged commits in 1.7.x [not 1 has failed on windows-cpu or windows-gpu].

@leezu any thoughts on those backports of your Windows CI specific fixes [that have been targeted towards master in the above mentioned PRs]?

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <vexilligera@gmail.com>

yijunc · 2020-07-04T08:53:45Z

@ChaiBapchya Thanks!
I applied the #17962 to bump the windows compiler to 64 bits. And, now the CI is working as expected.
@sxjscience @ciyongch I think it is ready to be merged now.

yijunc · 2020-07-04T09:36:28Z

@mxnet-bot run ci [unix-cpu]

mxnet-bot · 2020-07-04T09:36:35Z

Jenkins CI successfully triggered : [unix-cpu]

yijunc · 2020-07-04T12:03:36Z

@mxnet-bot run ci [unix-gpu]

mxnet-bot · 2020-07-04T12:03:42Z

Jenkins CI successfully triggered : [unix-gpu]

ciyongch · 2020-07-05T05:40:55Z

Thanks a lot @BenjaminCHEN2016 for your effort to make this patch pass all the CI tests. So we already have all we need for 1.7 release right now.
@leezu @ChaiBapchya @sxjscience @szha @TaoLv @pengzhao-intel please help to review and merge, then I will trigger the nightly test and prepare for rc0. Thanks!

samskalicky · 2020-09-18T21:16:58Z

@ciyongch @szha any reason this wasnt committed to 1.x branch?

szha · 2020-09-18T21:25:32Z

I think it was just missed.

DickJC123 · 2020-09-25T00:20:52Z

A person might reasonably assume that v1.8 == v1.7 plus select commits from the 1.x branch. Will this PR be added to v1.x and v1.8, or will v1.8 and other future 1.x releases be missing this functionality?

I'm basically trying to cherry-pick commits on top of my "v1.7-ish" repo to get to v1.8. Do you think I should revert this PR commit on my repo to minimize future integration conflicts?

ciyongch · 2020-09-25T01:02:55Z

If I remembered correctly, this PR only contains a minimum bug fix that are required for v1.7 other than the full features/bug fixes from master branch. If this is the case, then it might be better to try to pull back the full version of fix into v1.x as well as v1.8.x. Ping @BenjaminCHEN2016 to help confirm.

* Fix Windows GPU CI (apache#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <vexilligera@gmail.com> * backport mixed type Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: vexilligera <vexilligera@gmail.com>

* * Fix einsum gradient (#18482) * [v1.7.x] Backport PRs of numpy features (#18653) * add zero grad for npi_unique (#18080) * fix np.clip scalar input case (#17788) * fix true_divide (#18393) Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> * [v1.7.x] backport mixed type binary ops to v1.7.x (#18649) * Fix Windows GPU CI (#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <vexilligera@gmail.com> * backport mixed type Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: vexilligera <vexilligera@gmail.com> * revise activations (#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <tao.a.lv@intel.com> * Fail build_windows.py if all retries failed (#18177) * Update to thrust 1.9.8 on Windows (#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com> Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> Co-authored-by: Yijun Chen <chenyijun0902@gmail.com> Co-authored-by: vexilligera <vexilligera@gmail.com> Co-authored-by: ciyong <ciyong.chen@intel.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com>

* * Fix einsum gradient (apache#18482) * [v1.7.x] Backport PRs of numpy features (apache#18653) * add zero grad for npi_unique (apache#18080) * fix np.clip scalar input case (apache#17788) * fix true_divide (apache#18393) Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> * [v1.7.x] backport mixed type binary ops to v1.7.x (apache#18649) * Fix Windows GPU CI (apache#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <vexilligera@gmail.com> * backport mixed type Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: vexilligera <vexilligera@gmail.com> * revise activations (apache#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (apache#18632) (apache#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <tao.a.lv@intel.com> * Fail build_windows.py if all retries failed (apache#18177) * Update to thrust 1.9.8 on Windows (apache#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (apache#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com> Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> Co-authored-by: Yijun Chen <chenyijun0902@gmail.com> Co-authored-by: vexilligera <vexilligera@gmail.com> Co-authored-by: ciyong <ciyong.chen@intel.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com>

* * Fix einsum gradient (#18482) * [v1.7.x] Backport PRs of numpy features (#18653) * add zero grad for npi_unique (#18080) * fix np.clip scalar input case (#17788) * fix true_divide (#18393) Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> * [v1.7.x] backport mixed type binary ops to v1.7.x (#18649) * Fix Windows GPU CI (#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <vexilligera@gmail.com> * backport mixed type Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: vexilligera <vexilligera@gmail.com> * revise activations (#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <tao.a.lv@intel.com> * Fail build_windows.py if all retries failed (#18177) * Update to thrust 1.9.8 on Windows (#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com> Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> Co-authored-by: Yijun Chen <chenyijun0902@gmail.com> Co-authored-by: vexilligera <vexilligera@gmail.com> Co-authored-by: ciyong <ciyong.chen@intel.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com> Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com> Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> Co-authored-by: Yijun Chen <chenyijun0902@gmail.com> Co-authored-by: vexilligera <vexilligera@gmail.com> Co-authored-by: ciyong <ciyong.chen@intel.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com>

yijunc requested review from anirudh2290, eric-haibin-lin and szha as code owners July 1, 2020 09:43

yijunc force-pushed the backport_v1.7_18523 branch 2 times, most recently from 306fcad to 454d401 Compare July 2, 2020 08:52

yijunc changed the title ~~backport mixed type to 1.7~~ backport mixed type to 1.7.x Jul 2, 2020

yijunc changed the title ~~backport mixed type to 1.7.x~~ [v1.7.x] backport mixed type binary ops to v1.7.x Jul 2, 2020

yijunc force-pushed the backport_v1.7_18523 branch from 6484b2a to 02d4fbf Compare July 2, 2020 12:53

yijunc force-pushed the backport_v1.7_18523 branch from 84cd127 to 9060226 Compare July 4, 2020 03:45

yijunc requested review from aaronmarkham and marcoabreu as code owners July 4, 2020 03:45

yijunc force-pushed the backport_v1.7_18523 branch from 9060226 to 398b4c5 Compare July 4, 2020 04:34

yijunc and others added 2 commits July 4, 2020 06:52

backport mixed type

25a9717

yijunc force-pushed the backport_v1.7_18523 branch from 398b4c5 to 22de015 Compare July 4, 2020 06:52

szha merged commit 477affe into apache:v1.7.x Jul 5, 2020

ciyongch mentioned this pull request Jul 6, 2020

Backporting recent mx.np changes to 1.7 branch #18641

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.7.x] backport mixed type binary ops to v1.7.x #18649

[v1.7.x] backport mixed type binary ops to v1.7.x #18649

yijunc commented Jul 1, 2020 •

edited

mxnet-bot commented Jul 1, 2020

yijunc commented Jul 2, 2020

ciyongch commented Jul 2, 2020

ciyongch commented Jul 3, 2020

yijunc commented Jul 3, 2020

yijunc commented Jul 3, 2020

ciyongch commented Jul 3, 2020

yijunc commented Jul 3, 2020

wkcn commented Jul 3, 2020

ciyongch commented Jul 3, 2020

szha commented Jul 3, 2020

ChaiBapchya commented Jul 3, 2020

yijunc commented Jul 4, 2020

yijunc commented Jul 4, 2020

mxnet-bot commented Jul 4, 2020

yijunc commented Jul 4, 2020

mxnet-bot commented Jul 4, 2020

ciyongch commented Jul 5, 2020

samskalicky commented Sep 18, 2020

szha commented Sep 18, 2020

DickJC123 commented Sep 25, 2020 •

edited

ciyongch commented Sep 25, 2020

[v1.7.x] backport mixed type binary ops to v1.7.x #18649

[v1.7.x] backport mixed type binary ops to v1.7.x #18649

Conversation

yijunc commented Jul 1, 2020 • edited

Description

mxnet-bot commented Jul 1, 2020

yijunc commented Jul 2, 2020

ciyongch commented Jul 2, 2020

ciyongch commented Jul 3, 2020

yijunc commented Jul 3, 2020

yijunc commented Jul 3, 2020

ciyongch commented Jul 3, 2020

yijunc commented Jul 3, 2020

wkcn commented Jul 3, 2020

ciyongch commented Jul 3, 2020

szha commented Jul 3, 2020

ChaiBapchya commented Jul 3, 2020

yijunc commented Jul 4, 2020

yijunc commented Jul 4, 2020

mxnet-bot commented Jul 4, 2020

yijunc commented Jul 4, 2020

mxnet-bot commented Jul 4, 2020

ciyongch commented Jul 5, 2020

samskalicky commented Sep 18, 2020

szha commented Sep 18, 2020

DickJC123 commented Sep 25, 2020 • edited

ciyongch commented Sep 25, 2020

yijunc commented Jul 1, 2020 •

edited

DickJC123 commented Sep 25, 2020 •

edited