Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[WIP] Windows dev environment configuration, update install instructions from source in the docs #17808

Closed
wants to merge 37 commits into from

Conversation

vexilligera
Copy link
Contributor

@vexilligera vexilligera commented Mar 11, 2020

Description

Taking over #17206 with fix to #17635, update toolchains to vs2019 and cuda 10.2

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

Comment on lines 52 to 55
'openblas': 'https://windows-post-install.s3-us-west-2.amazonaws.com/OpenBLAS-windows-v0_2_19.zip',
'opencv': 'https://windows-post-install.s3-us-west-2.amazonaws.com/opencv-windows-4.1.2-vc14_vc15.zip',
'cudnn': 'https://windows-post-install.s3-us-west-2.amazonaws.com/cudnn-9.2-windows10-x64-v7.4.2.24.zip',
'nvdriver': 'https://windows-post-install.s3-us-west-2.amazonaws.com/nvidia_display_drivers_398.75_server2016.zip',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who owns the windows-post-install bucket?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I currently don't have access to windows-post-install, so I opened another bucket (mxnet-windows-build). It would be easy to merge them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mxnet-ci-dev

logging.info("Perl install complete")


def install_clang():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need clang when compiling with visual studio?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They probably need to support tvm op build on windows as well, which may need this.

logging.info("Visual studio install complete.")


def install_cmake():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused

ci/windows_dev_env/windows_deps_headless_installer.py Outdated Show resolved Hide resolved
```
cmake -G "Visual Studio 15 2017 Win64" -T cuda=9.2,host=x64 -DUSE_CUDA=1 -DUSE_CUDNN=1 -DUSE_NVRTC=1 -DUSE_OPENCV=1 -DUSE_OPENMP=1 -DUSE_BLAS=open -DUSE_LAPACK=1 -DUSE_DIST_KVSTORE=0 -DCUDA_ARCH_LIST=Common -DCUDA_TOOLSET=9.2 -DCUDNN_INCLUDE=C:\cuda\include -DCUDNN_LIBRARY=C:\cuda\lib\x64\cudnn.lib "C:\incubator-mxnet"
.\setup.ps1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it feasible to provide short non-blackbox instructions to install MXNet?

For example, if users have a standard installation of visual studio, can the installation experience be pretty similar to https://mxnet.apache.org/get_started/ubuntu_setup#build-the-mxnet-shared-library-from-source Step 2 and Step 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can explicitly use the cmd command invoked from ci/build_windows.py

@vexilligera vexilligera changed the title Windows dev environment configuration, update install instructions from source in the docs [WIP] Windows dev environment configuration, update install instructions from source in the docs Mar 11, 2020
@aaronmarkham
Copy link
Contributor

Let me know when this is ready, and I'll test it out.

@leezu
Copy link
Contributor

leezu commented Mar 19, 2020

@vexilligera what is the status of this PR?

@vexilligera
Copy link
Contributor Author

vexilligera commented Mar 19, 2020

@vexilligera what is the status of this PR?

I'm trying to update the AMI to VS2019 and CUDA 10.2 and there's a bug with dmlc-core, probably caused by different MSVC standards. I'm working on that. If it's an issue with MSVC then I will try to use Clang-cl for building.

@vexilligera
Copy link
Contributor Author

Making it public is redistribution and thus license infringement.

I may also need to upload the VS2019 web installer as in install_vs, since the original download page needs redirection to the file url and I couldn't find any short url as for VS2017.

@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented Mar 26, 2020

Updated links for cudnn & VS2019 [private bucket: restricted access]

@leezu
Copy link
Contributor

leezu commented Mar 26, 2020

@mxnet-bot run ci [windows-cpu, windows-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-cpu, windows-gpu]

@ChaiBapchya
Copy link
Contributor

retriggering.. windows

@ChaiBapchya
Copy link
Contributor

It's failing only for windows-cpu MKLDNN MKL part with the error
"can't find specified path"

[2020-03-27T18:52:40.264Z] "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvarsx86_amd64.bat" && cmake -G Ninja -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=cl -DUSE_CUDA=OFF 
[2020-03-27T18:52:40.264Z] -DUSE_CUDNN=OFF -DENABLE_CUDA_RTC=OFF -DUSE_OPENCV=ON -DOpenCV_RUNTIME=vc15 -DOpenCV_ARCH=x64 -DUSE_OPENMP=ON -DUSE_BLAS=mkl -DUSE_LAPACK=ON -DUSE_DIST_KVSTORE=OFF -DUSE_MKL_IF_AVAILABLE=ON 
[2020-03-27T18:52:40.264Z] -DUSE_MKLDNN=ON -DCMAKE_BUILD_TYPE=Release C:\jenkins_slave\workspace\build-cpu-mkldnn-mkl

@larroy is this related to path not being on oneline? or the path itself is incorrect?

@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented Mar 27, 2020

Turns out the path mentioned in ci/build_windows.py is correct

VS 2019': r'C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvarsx86_amd64.bat

That is exactly where the bat file is.

[verified this by launching an instance using the updated windows AMI]

However, this path : "C:\jenkins_slave\workspace\build-cpu-mkldnn-mkl" doesn't exist on the AMI. Joe fixed It and updated AMI. testing on that.

@vexilligera
Copy link
Contributor Author

vexilligera commented Mar 28, 2020

WIN_GPU_MKLDNN is having a flaky issue similar to pytorch/pytorch#25393

Also need to update 3rdparty/dmlc-core to the latest to support VS2019

@vexilligera
Copy link
Contributor Author

@mxnet-bot run ci [windows-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-gpu]

@vexilligera
Copy link
Contributor Author

@mxnet-bot run ci [unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu]

@vexilligera
Copy link
Contributor Author

@mxnet-bot run ci [windows-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-gpu]

@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented Mar 29, 2020

@marcoabreu Even in this case, @vexilligera is unable to reproduce the error locally. And WIN_GPU and WIN_GPU_MKLDNN is flaky. How do you suggest we proceed trying to resolve the thrust issue on CI?

Fail : http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17808/13/pipeline
Pass : http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17808/12/pipeline

As pointed out in the pytorch issue : pytorch/pytorch#25393
they face a similar problem with their Windows CI

@ChaiBapchya
Copy link
Contributor

@vexilligera as discussed offline, lets try testing locally for WIN_GPU and WIN_GPU_MKLDNN build 10 times each (since 1 run takes 20-30mins) to come up with some basis... (ideally would have tried 100 times but given the resource & time constraints)

@vexilligera
Copy link
Contributor Author

vexilligera commented Mar 29, 2020

@vexilligera as discussed offline, lets try testing locally for WIN_GPU and WIN_GPU_MKLDNN build 10 times each (since 1 run takes 20-30mins) to come up with some basis... (ideally would have tried 100 times but given the resource & time constraints)

On my local test, the WIN_GPU_MKLDNN is much more flaky than WIN_GPU, as all WIN_GPU builds passed while about 1/3 of WIN_GPU_MKLDNN builds failed, based on my historical test data.

@ChaiBapchya suggests introducing a maximum retry number to circumvent this flaky issue as pytorch has done here pytorch/pytorch#35375

@haojin2 suggests us removing the WIN_GPU_MKLDNN test entirely since MKLDNN doesn't make much sense as we are running on GPU, and GPU_MKLDNN case is covered on other platforms.

@ChaiBapchya
Copy link
Contributor

We can get rid of WIN_GPU_MKLDNN tests altogether but that still leaves us with the flakiness of WIN_GPU as can be seen in these builds
For roughly same code of this PR & same windows AMI, below are the results so far

WIN_GPU WIN_GPU_MKLDNN Build Number Link
✖︎ ✔︎ 15 http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17808/15/pipeline
✖︎ ✔︎ 14 http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17808/14/pipeline
✔︎ ✔︎ 12 http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17808/12/pipeline
✔︎ ✖︎ 13 http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17808/13/pipeline

Ofcourse your tests on local have a different story to tell...

@vexilligera
Copy link
Contributor Author

@larroy do you have any idea on how to solve Win 126 on CI? Basically the binaries aren't in the search path and dlopen fails due to that

@marcoabreu
Copy link
Contributor

I'll have a look at the logs when I'm at my computer

Just to be clear about two things:

  1. Retries are fine for now, but must be removed before merging.
  2. It is not an option to remove GPU mkldnn. It does make sense for cases where an operator is not supported by the GPU or the user wants to run some parts on CPU. Just because it breaks, it doesn't mean that we can remove it to ease our situation.

@leezu
Copy link
Contributor

leezu commented Apr 2, 2020

Connectivity errors appear fixed now by switching to g4 (with faster cpu and network).

Removed the debug statements and rebased on latest master.

@leezu
Copy link
Contributor

leezu commented Apr 2, 2020

@marcoabreu Do you still stand by your veto?

"Retries are fine for now, but must be removed before merging."

I suggest to go ahead and merge this with the retries. We don't have control over the thrust + VS code incompatibility and may need to wait for a fix on nvidia's or microsoft's side. Doesn't seem worth to block all MXNet development until the bug in "thrust + VS code" is fixed, given it's an open issue affecting other projects as well: pytorch/pytorch#25393

The issue is tracked on our side in #17935

CMakeLists.txt Outdated
@@ -163,6 +163,8 @@ if(MSVC)
add_definitions(-DDMLC_STRICT_CXX11)
add_definitions(-DNOMINMAX)
set(CMAKE_C_FLAGS "/MP")
# report an accurate value for recent C++ language standards support
set(CMAKE_CXX_FLAGS "/Zc:__cplusplus")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChaiBapchya @vexilligera you added this line, but it doesn't have any effect as it is overwritte in the line below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set(flag ${flag} "secondval")

does append right?
I thought we are appending to CMAKE_CXX_FLAGS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the typo!!!! CMAKE_CXX_FLAGS instead of CMAKE_C_FLAGS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. I thought it would append...

Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On one hand the argument is that the build has to be unblocked ASAP, but at the same 700 changed lines are introduced.

I'd propose to rethink what a "hotfix" exactly is and come back with a minimal changeset which resolves the issue. THEN we can discuss other changes. But it's not okay to bring in a rushed refactor under the umbrella of fixing stuff.

ci/build_windows.py Show resolved Hide resolved
if 'CUDA_PATH' not in os.environ:
os.environ["CUDA_PATH"] = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v9.2"
os.environ["CUDA_PATH"] = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't agree on increasing the minimum cuda version

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed on using VS 2019. Cuda 10 is a requirement for using VS 2019

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about that? Usually visual studio can be upgraded by installing the vc++ toolset which then grants the possibility to compile other cuda versions. I'm not aware that this categorically excluded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -38,8 +38,9 @@

from util import config_logging

DOCKER_STOP_TIMEOUT_SECONDS = 3
DOCKER_STOP_TIMEOUT_SECONDS = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change


def install_mkl():
logging.info("Installing MKL 2019.3.203...")
file_path = download("http://registrationcenter-download.intel.com/akdlm/irc_nas/tec/15247/w_mkl_2019.3.203.exe")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move links and similar things into constants

@marcoabreu
Copy link
Contributor

Connectivity errors were never an issue with Windows slaves. It sounds like you've changed something as part of generating the AMIs. Thus, I'm hesitant to just accept a bunch of more changes (which have not been made transparent) as a result to fix a previously created issue.

Please just revert stuff back to what it was and stop putting duckttape over things. The build used to work and the slaves used to be stable. But I have only seen attempts to change things but no rootcausing (I know about the 2GB limit). Some action resulted in our build breaking and the only sane choice is to revert whatever was done and then do these efforts again.

I will not accept multiple layers of fixes which could have been avoided if the changes were properly developed and tested in a separate environment. At the moment, it seems like open heart surgery on a productive system.

@leezu leezu mentioned this pull request Apr 2, 2020
@leezu
Copy link
Contributor

leezu commented Apr 2, 2020

Discussion continued at #17962 (comment)

@vexilligera
Copy link
Contributor Author

Closing since this is now fixed in #17962

@leezu
Copy link
Contributor

leezu commented Apr 17, 2020

@vexilligera it's not fixed. We need a script to generate the working AMI. #17962 is only a hotfix.

Will you be working on the automated setup or is someone else taking over the work? Please clarify. Thank you

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants