[v1.x] Backport Unittest tolerance handling improvements (#18694). Also test seeding (#18762). #19148

DickJC123 · 2020-09-15T03:07:51Z

Description

This backport prepares MXNet 1.8 to be built against CUDA 11 and cuDNN 8 and run on A100 GPUs, which employ TensorFloat-32 (TF32) by default. See PR #18694 for full details.

During the development of the original PR, I fixed numerous other CI issues that kept me from getting a passing CI. At the time the PR was accepted, I was working on a couple of additional fixes that I made into a follow-up PR #18694 "Improve test seeding and robustness in test_numpy_interoperablity.py". To help get a passing CI, this PR backports that as well.

@samskalicky @anirudh2290 @ChaiBapchya @ptrendx

Checklist

Essentials

[ X] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
[ X] Changes are complete (i.e. I finished coding on this PR)
[ X] All changes have test coverage
[X ] Code is well-documented

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

@Retry

* Add sm arch 80 to Makefile * Add TF32 to cuBLAS GEMMs Signed-off-by: Serge Panev <spanev@nvidia.com> * Add CUDA version guards Signed-off-by: Serge Panev <spanev@nvidia.com> * Remove useless TF32 for double and old CUDA version Signed-off-by: Serge Panev <spanev@nvidia.com> * Factorize VERSION_ADJUSTED_TF32_MATH Signed-off-by: Serge Panev <spanev@nvidia.com> * Add TF32 considerations to test_util.py:check_consistency() * Bypass test_gluon_gpu.py:test_large_models if gmem >32GB * Default tols in assert_almost_equal() now a function of dtype and ctx * Expand types listed by default_tols() * Fix pylint * All with_seed() tests to waitall in teardown * Elevate MXNET_TEST_SEED logging to WARNING * Revert test_gluon_gpu.py:test_rnn_layer to default tols * Fix test_gluon_model_zoo_gpu.py::test_inference and test_operator_gpy.py::test_np_linalg_{solve,tensorinv} * test_numpy_interoperability.py to not fix seed for rest of CI * Further fix to test_np_linalg_tensorinv * Fix test_gluon_data.py:test_dataloader_context when run on 1-GPU system. * Fix test_operator_gpu.py::test_embedding_with_type * Fix test_operator_gpu.py::{test_*convolution_large_c,test_np_linalg_tensorsolve} * Remove unneeded print() from test_numpy_interoperability.py * Unify tol handling of check_consistency() and assert_almost_equal(). Test tweeks. * Add tol handling of assert_almost_equal() with number args * Add tol handling of bool comparisons * Fix test_numpy_op.py::test_np_random_rayleigh * Fix test_operator_gpu.py::test_batchnorm_with_type * Fix test_gluon.py::test_sync_batchnorm in cpu selftest * Improve unittest failure reporting * Add to robustness of test_operator_gpu.py::test_embedding_with_type * Check_consistency() to use equal backward gradients for increased test robustness * Fix test_operator_gpu.py::test_{fully_connected,gemm}. Add default_numeric_eps(). * test_utils.py fix for numeric gradient calc * Reinstate rtol=1e-2 for test_operator.py::test_order * Remove auto-cast of check_consistency() input data to least precise dtype (not needed) * Fix test_operator.py::test_{reciprocol,cbrt,rcbrt}_op * Expand default float64 numeric_eps for test_operator_gpu.py::test_sofmin * Fix segfault-on-error of @Retry decorator. Add test isolation. * assert_almost_equal() to handle a,b scalars * Fix test_operator_gpu.py::test_gluon_{mvn,mvn_v1} race * Fix test_operator_gpu.py::test_flatten_slice_after_conv via scale * Remove test_utils.py:almost_equal_ignore_nan() * Fix sample vs. pop variance issue with test_numpy_op.py::test_npx_batch_norm * Expose test_utils.py:effective_dtype() and use to fix test_operator_gpu.py::test_np_linalg_svd * Fix true_divide int_array / int_scalar -> float_array to honor np_default_dtype * Try test_elemwise_binary_ops serial to avoid pytest worker crash * Fix (log_)softmax backward on empty ndarray * Temporarily log all CI seeds to troubleshoot seed non-determinism * Revert "Temporarily log all CI seeds to troubleshoot seed non-determinism" This reverts commit f60eff2. * Temp log all CI seeds to troubleshoot unwanted seed determinism * Revert "Add sm arch 80 to Makefile" This reverts commit f9306ce. * Same fix of sample vs. pop variance issue, now with test_operator_gpu.py::test_batchnorm * Revert "Temp log all CI seeds to troubleshoot unwanted seed determinism" This reverts commit ff328ef. * Marking test_sparse_dot_grad with garbage_expected after teardown error * Fix flakiness of test_gluon_probability{_v1,_v2}.py::test_gluon_kl{_v1,} * Temp skip of test_aggregate_duplication on gpu * Add seeding to test_{numpy,}_contrib_gluon_data_vision.py. Make created files unique. * Add ndarray module isolation to help debug test_bbox_augmenters worker crash * Marking test_sparse_square_sum serial after pytest worker crash * Fix flakiness of test_gluon_probability{_v1,_v2}.py::test_half_cauchy{_v1,} Co-authored-by: Serge Panev <spanev@nvidia.com> Co-authored-by: Bart Gawrych <gawrych.bartlomiej@intel.com>

…s, for fp16 contexts

mxnet-bot · 2020-09-15T03:07:54Z

Hey @DickJC123 , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [website, unix-gpu, sanity, centos-gpu, windows-gpu, edge, miscellaneous, windows-cpu, clang, unix-cpu, centos-cpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

samskalicky · 2020-09-15T04:53:50Z

[2020-09-15T04:31:26.709Z] [ 99%] Linking CXX shared library mxnet_52.dll
[2020-09-15T04:37:33.250Z] LINK: command "C:\PROGRA~2\MICROS~1.0\VC\bin\X86_AM~1\link.exe /nologo @CMakeFiles\mxnet_52.dir\objects1.rsp /out:mxnet_52.dll /implib:mxnet_52.lib /pdb:C:\jenkins_slave\workspace\build-gpu\build\mxnet_52.pdb /dll /version:0.0 /machine:x64 /INCREMENTAL:NO /OPT:REF /OPT:ICF -LIBPATH:C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\lib\x64 3rdparty\mkldnn\src\dnnl.lib C:\Program Files\OpenBLAS-windows-v0_2_19\lib\libopenblas.dll.a C:\Program Files\opencv\x64\vc14\lib\opencv_world412.lib C:\Program Files\opencv\x64\vc14\lib\opencv_world412.lib C:\Program Files\opencv\x64\vc14\lib\opencv_world412.lib C:\Program Files\opencv\x64\vc14\lib\opencv_world412.lib C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\cudnn.lib cuda.lib 3rdparty\dmlc-core\dmlc.lib C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\cudart.lib C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\cufft.lib C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\cublas.lib C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\cusolver.lib C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\cusparse.lib C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\curand.lib C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\nvrtc.lib C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\cuda.lib cudadevrt.lib cudart_static.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST /MANIFESTFILE:mxnet_52.dll.manifest" failed (exit code 1102) with the following output:
[2020-09-15T04:37:33.250Z]    Creating library mxnet_52.lib and object mxnet_52.exp
[2020-09-15T04:37:33.250Z] LINK : fatal error LNK1102: out of memory

Do we need to enable compression?

…_backport

DickJC123 and others added 10 commits September 14, 2020 19:26

Add sm arch 80 to Makefile

c5725a9

Fix test_gluon_data.py:test_dataloader_context when run on 1-GPU system.

efd25d4

Remove pytest decorators introduced in error

e63299b

Fix test_forward.py:test_consistency

8c0e7d9

Fix test_numpy_op.py tests

9714e23

Improve test seeding in test_numpy_interoperablity.py (apache#18762)

5bd44db

Fix test_numpy_op.py:test_np_random_{beta,chisquare}

abba2aa

Reduce problem sizes with test_optimizer.py:test_multilamb

a11af29

Skip test_gluon_gpu.py:test_fused_{lstm,gpu}_layer, fix test_rnn_cell…

ed5c287

…s, for fp16 contexts

DickJC123 requested a review from szha as a code owner September 15, 2020 03:07

samskalicky mentioned this pull request Sep 15, 2020

[RFC] v1.8.0 release #18800

Open

DickJC123 added 2 commits September 16, 2020 23:32

Trigger CI

cf8d091

Merge remote-tracking branch 'mxnet/v1.x' into tolerance_improvements…

fdcc427

…_backport

szha merged commit ce0a518 into apache:v1.x Sep 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.x] Backport Unittest tolerance handling improvements (#18694). Also test seeding (#18762). #19148

[v1.x] Backport Unittest tolerance handling improvements (#18694). Also test seeding (#18762). #19148

DickJC123 commented Sep 15, 2020 •

edited

mxnet-bot commented Sep 15, 2020

samskalicky commented Sep 15, 2020

[v1.x] Backport Unittest tolerance handling improvements (#18694). Also test seeding (#18762). #19148

[v1.x] Backport Unittest tolerance handling improvements (#18694). Also test seeding (#18762). #19148

Conversation

DickJC123 commented Sep 15, 2020 • edited

Description

Checklist

Essentials

Changes

Comments

mxnet-bot commented Sep 15, 2020

samskalicky commented Sep 15, 2020

DickJC123 commented Sep 15, 2020 •

edited