Unittest tolerance handling improvements #18694

DickJC123 · 2020-07-11T01:32:54Z

Description

This PR moves to consolidate and standardize the way our data comparison routines in test_utils.py handle tolerances. Over time, the number of these routines has grown and the approaches have diverged. For example, two of the most frequently used routines are:

def assert_almost_equal(a, b, rtol=None, atol=None, …):
    rtol = get_rtol(rtol)
    atol = get_atol(atol)
    ….

and

def check_consistency(sym, …,  tol=None, … ):
    if tol is None:
        tol = {np.dtype(np.float16): 1e-1,
               np.dtype(np.float32): 1e-3,
               np.dtype(np.float64): 1e-5,
               …}
…

As show, assert_almost_equal() offers separate arguments for specifying relative and absolute tolerances (by default rtol, atol = 1e-5, 1e-20), but these do not vary with dtype. In contrast, check_consistency() only has one tol argument, which it uses for both atol and rtol, but the tol default is dtype-dependent.

This PR unifies these approaches by having both check_consistency() and assert_almost_equal() support rtol and atol, drawing upon the same set of dtype-dependent defaults. The goal here is for the testing framework to make sensible tolerance choices for our unittest data comparisons when possible, rather than burden each test developer to do so. In effect, this PR addresses the current codebase TODO in test_utils.py:

def get_rtol(rtol=None):
    """Get default numerical threshold for regression test."""
    # _TODO: get from env variable, different threshold might
    # be needed for different device and dtype
    return 1e-5 if rtol is None else rtol

While I believe setting tolerances via env var is not appropriate, test tolerances should be a function of the dtype and device (i.e. context). A first application for this concept is in support for the newly announced A100 GPU with its TensorFloat-32 (TF32) mode of computation. The A100 will by default round the mantissa of float32 GEMM and Convolution inputs to a float16 precision. Therefore, although operator i/o’s might be float32, the “effective dtype” for the purposes of tolerance selection is float16. By consolidating the tolerance handling logic, this PR can now easily incorporate an effective_dtype(dat) routine based on context, so that unittests can run 32-bit models and seamlessly apply the appropriate tolerances for each context, be it A100 or non-A100.

This PR was developed on an in-house CI system that runs on many GPU architectures, including the A100. As part of the PR development, some unittests were modified so that now all tests pass on the A100. These tests had typically explicitly hard-coded a tolerance appropriate for a float32 test. The modification for those tests involved dropping the fixed tolerance, allowing this PR’s adaptive context- and dtype-dependent tolerance selection logic to take over.

Thanks to @Kh4L for supplying the GEMM-flag adaptations for TF32 that are part of this PR.
Thanks also to @drivanov for his prior efforts to adapt assert_almost_equal() to work directly on ndarrays, thereby enabling context-dependent default tolerance selection.

@ptrendx @eric-haibin-lin @marcoabreu

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

[X ] The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
[ X] Changes are complete (i.e. I finished coding on this PR)
[ X] All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
[ X] Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
[X ] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

Signed-off-by: Serge Panev <spanev@nvidia.com>

….py::test_np_linalg_{solve,tensorinv}

…ensorsolve}

…Test tweeks.

…t robustness

…meric_eps().

…nism" This reverts commit f60eff2.

This reverts commit f9306ce.

….py::test_batchnorm

This reverts commit ff328ef.

…ed files unique.

…r crash

xidulu

LGTM regarding the changes to test_gluon_probability

szha · 2020-07-19T03:01:11Z

@DickJC123 thanks for the change and for fixing the issues you found along the way. Remember to mark the issues this PR resolve in the description.

…{_v1,}

szha · 2020-07-19T21:14:05Z

Thanks for the fixes, @DickJC123. They are really helpful. I did a code review and examined the analysis included in each of them as well as the specific fixes.

DickJC123 · 2020-07-19T22:42:20Z

Thanks @szha! You probably saw that I've been struggling to get a passing CI- running into and fixing many issues unrelated to my PR along the way.

@Retry

* Add sm arch 80 to Makefile * Add TF32 to cuBLAS GEMMs Signed-off-by: Serge Panev <spanev@nvidia.com> * Add CUDA version guards Signed-off-by: Serge Panev <spanev@nvidia.com> * Remove useless TF32 for double and old CUDA version Signed-off-by: Serge Panev <spanev@nvidia.com> * Factorize VERSION_ADJUSTED_TF32_MATH Signed-off-by: Serge Panev <spanev@nvidia.com> * Add TF32 considerations to test_util.py:check_consistency() * Bypass test_gluon_gpu.py:test_large_models if gmem >32GB * Default tols in assert_almost_equal() now a function of dtype and ctx * Expand types listed by default_tols() * Fix pylint * All with_seed() tests to waitall in teardown * Elevate MXNET_TEST_SEED logging to WARNING * Revert test_gluon_gpu.py:test_rnn_layer to default tols * Fix test_gluon_model_zoo_gpu.py::test_inference and test_operator_gpy.py::test_np_linalg_{solve,tensorinv} * test_numpy_interoperability.py to not fix seed for rest of CI * Further fix to test_np_linalg_tensorinv * Fix test_gluon_data.py:test_dataloader_context when run on 1-GPU system. * Fix test_operator_gpu.py::test_embedding_with_type * Fix test_operator_gpu.py::{test_*convolution_large_c,test_np_linalg_tensorsolve} * Remove unneeded print() from test_numpy_interoperability.py * Unify tol handling of check_consistency() and assert_almost_equal(). Test tweeks. * Add tol handling of assert_almost_equal() with number args * Add tol handling of bool comparisons * Fix test_numpy_op.py::test_np_random_rayleigh * Fix test_operator_gpu.py::test_batchnorm_with_type * Fix test_gluon.py::test_sync_batchnorm in cpu selftest * Improve unittest failure reporting * Add to robustness of test_operator_gpu.py::test_embedding_with_type * Check_consistency() to use equal backward gradients for increased test robustness * Fix test_operator_gpu.py::test_{fully_connected,gemm}. Add default_numeric_eps(). * test_utils.py fix for numeric gradient calc * Reinstate rtol=1e-2 for test_operator.py::test_order * Remove auto-cast of check_consistency() input data to least precise dtype (not needed) * Fix test_operator.py::test_{reciprocol,cbrt,rcbrt}_op * Expand default float64 numeric_eps for test_operator_gpu.py::test_sofmin * Fix segfault-on-error of @Retry decorator. Add test isolation. * assert_almost_equal() to handle a,b scalars * Fix test_operator_gpu.py::test_gluon_{mvn,mvn_v1} race * Fix test_operator_gpu.py::test_flatten_slice_after_conv via scale * Remove test_utils.py:almost_equal_ignore_nan() * Fix sample vs. pop variance issue with test_numpy_op.py::test_npx_batch_norm * Expose test_utils.py:effective_dtype() and use to fix test_operator_gpu.py::test_np_linalg_svd * Fix true_divide int_array / int_scalar -> float_array to honor np_default_dtype * Try test_elemwise_binary_ops serial to avoid pytest worker crash * Fix (log_)softmax backward on empty ndarray * Temporarily log all CI seeds to troubleshoot seed non-determinism * Revert "Temporarily log all CI seeds to troubleshoot seed non-determinism" This reverts commit f60eff2. * Temp log all CI seeds to troubleshoot unwanted seed determinism * Revert "Add sm arch 80 to Makefile" This reverts commit f9306ce. * Same fix of sample vs. pop variance issue, now with test_operator_gpu.py::test_batchnorm * Revert "Temp log all CI seeds to troubleshoot unwanted seed determinism" This reverts commit ff328ef. * Marking test_sparse_dot_grad with garbage_expected after teardown error * Fix flakiness of test_gluon_probability{_v1,_v2}.py::test_gluon_kl{_v1,} * Temp skip of test_aggregate_duplication on gpu * Add seeding to test_{numpy,}_contrib_gluon_data_vision.py. Make created files unique. * Add ndarray module isolation to help debug test_bbox_augmenters worker crash * Marking test_sparse_square_sum serial after pytest worker crash * Fix flakiness of test_gluon_probability{_v1,_v2}.py::test_half_cauchy{_v1,} Co-authored-by: Serge Panev <spanev@nvidia.com> Co-authored-by: Bart Gawrych <gawrych.bartlomiej@intel.com>

@Retry

…so test seeding (#18762). (#19148) * Add sm arch 80 to Makefile * Unittest tolerance handling improvements (#18694) * Add sm arch 80 to Makefile * Add TF32 to cuBLAS GEMMs Signed-off-by: Serge Panev <spanev@nvidia.com> * Add CUDA version guards Signed-off-by: Serge Panev <spanev@nvidia.com> * Remove useless TF32 for double and old CUDA version Signed-off-by: Serge Panev <spanev@nvidia.com> * Factorize VERSION_ADJUSTED_TF32_MATH Signed-off-by: Serge Panev <spanev@nvidia.com> * Add TF32 considerations to test_util.py:check_consistency() * Bypass test_gluon_gpu.py:test_large_models if gmem >32GB * Default tols in assert_almost_equal() now a function of dtype and ctx * Expand types listed by default_tols() * Fix pylint * All with_seed() tests to waitall in teardown * Elevate MXNET_TEST_SEED logging to WARNING * Revert test_gluon_gpu.py:test_rnn_layer to default tols * Fix test_gluon_model_zoo_gpu.py::test_inference and test_operator_gpy.py::test_np_linalg_{solve,tensorinv} * test_numpy_interoperability.py to not fix seed for rest of CI * Further fix to test_np_linalg_tensorinv * Fix test_gluon_data.py:test_dataloader_context when run on 1-GPU system. * Fix test_operator_gpu.py::test_embedding_with_type * Fix test_operator_gpu.py::{test_*convolution_large_c,test_np_linalg_tensorsolve} * Remove unneeded print() from test_numpy_interoperability.py * Unify tol handling of check_consistency() and assert_almost_equal(). Test tweeks. * Add tol handling of assert_almost_equal() with number args * Add tol handling of bool comparisons * Fix test_numpy_op.py::test_np_random_rayleigh * Fix test_operator_gpu.py::test_batchnorm_with_type * Fix test_gluon.py::test_sync_batchnorm in cpu selftest * Improve unittest failure reporting * Add to robustness of test_operator_gpu.py::test_embedding_with_type * Check_consistency() to use equal backward gradients for increased test robustness * Fix test_operator_gpu.py::test_{fully_connected,gemm}. Add default_numeric_eps(). * test_utils.py fix for numeric gradient calc * Reinstate rtol=1e-2 for test_operator.py::test_order * Remove auto-cast of check_consistency() input data to least precise dtype (not needed) * Fix test_operator.py::test_{reciprocol,cbrt,rcbrt}_op * Expand default float64 numeric_eps for test_operator_gpu.py::test_sofmin * Fix segfault-on-error of @Retry decorator. Add test isolation. * assert_almost_equal() to handle a,b scalars * Fix test_operator_gpu.py::test_gluon_{mvn,mvn_v1} race * Fix test_operator_gpu.py::test_flatten_slice_after_conv via scale * Remove test_utils.py:almost_equal_ignore_nan() * Fix sample vs. pop variance issue with test_numpy_op.py::test_npx_batch_norm * Expose test_utils.py:effective_dtype() and use to fix test_operator_gpu.py::test_np_linalg_svd * Fix true_divide int_array / int_scalar -> float_array to honor np_default_dtype * Try test_elemwise_binary_ops serial to avoid pytest worker crash * Fix (log_)softmax backward on empty ndarray * Temporarily log all CI seeds to troubleshoot seed non-determinism * Revert "Temporarily log all CI seeds to troubleshoot seed non-determinism" This reverts commit f60eff2. * Temp log all CI seeds to troubleshoot unwanted seed determinism * Revert "Add sm arch 80 to Makefile" This reverts commit f9306ce. * Same fix of sample vs. pop variance issue, now with test_operator_gpu.py::test_batchnorm * Revert "Temp log all CI seeds to troubleshoot unwanted seed determinism" This reverts commit ff328ef. * Marking test_sparse_dot_grad with garbage_expected after teardown error * Fix flakiness of test_gluon_probability{_v1,_v2}.py::test_gluon_kl{_v1,} * Temp skip of test_aggregate_duplication on gpu * Add seeding to test_{numpy,}_contrib_gluon_data_vision.py. Make created files unique. * Add ndarray module isolation to help debug test_bbox_augmenters worker crash * Marking test_sparse_square_sum serial after pytest worker crash * Fix flakiness of test_gluon_probability{_v1,_v2}.py::test_half_cauchy{_v1,} Co-authored-by: Serge Panev <spanev@nvidia.com> Co-authored-by: Bart Gawrych <gawrych.bartlomiej@intel.com> * Fix test_gluon_data.py:test_dataloader_context when run on 1-GPU system. * Remove pytest decorators introduced in error * Fix test_forward.py:test_consistency * Fix test_numpy_op.py tests * Improve test seeding in test_numpy_interoperablity.py (#18762) * Fix test_numpy_op.py:test_np_random_{beta,chisquare} * Reduce problem sizes with test_optimizer.py:test_multilamb * Skip test_gluon_gpu.py:test_fused_{lstm,gpu}_layer, fix test_rnn_cells, for fp16 contexts * Trigger CI Co-authored-by: Serge Panev <spanev@nvidia.com> Co-authored-by: Bart Gawrych <gawrych.bartlomiej@intel.com>

DickJC123 and others added 30 commits July 10, 2020 15:45

Add sm arch 80 to Makefile

f9306ce

Add TF32 to cuBLAS GEMMs

bd66e32

Signed-off-by: Serge Panev <spanev@nvidia.com>

Add CUDA version guards

410078e

Signed-off-by: Serge Panev <spanev@nvidia.com>

Remove useless TF32 for double and old CUDA version

dab872d

Signed-off-by: Serge Panev <spanev@nvidia.com>

Factorize VERSION_ADJUSTED_TF32_MATH

8f3610f

Signed-off-by: Serge Panev <spanev@nvidia.com>

Add TF32 considerations to test_util.py:check_consistency()

1d1c6bd

Bypass test_gluon_gpu.py:test_large_models if gmem >32GB

ded6eb9

Default tols in assert_almost_equal() now a function of dtype and ctx

e48f3c3

Expand types listed by default_tols()

b2643e0

Fix pylint

cf72179

All with_seed() tests to waitall in teardown

0e99ef1

Elevate MXNET_TEST_SEED logging to WARNING

c53d977

Revert test_gluon_gpu.py:test_rnn_layer to default tols

6d44166

Fix test_gluon_model_zoo_gpu.py::test_inference and test_operator_gpy…

e0bc4cf

….py::test_np_linalg_{solve,tensorinv}

test_numpy_interoperability.py to not fix seed for rest of CI

6567b61

Further fix to test_np_linalg_tensorinv

0da173a

Fix test_gluon_data.py:test_dataloader_context when run on 1-GPU system.

efd1def

Fix test_operator_gpu.py::test_embedding_with_type

3ae62c7

Fix test_operator_gpu.py::{test_*convolution_large_c,test_np_linalg_t…

7932bd6

…ensorsolve}

Remove unneeded print() from test_numpy_interoperability.py

3d47acb

Unify tol handling of check_consistency() and assert_almost_equal(). …

48379e4

…Test tweeks.

Add tol handling of assert_almost_equal() with number args

cbda3bb

Add tol handling of bool comparisons

c1bf591

Fix test_numpy_op.py::test_np_random_rayleigh

6cf6ca9

Fix test_operator_gpu.py::test_batchnorm_with_type

4e81d14

Fix test_gluon.py::test_sync_batchnorm in cpu selftest

d490c56

Improve unittest failure reporting

27c954c

Add to robustness of test_operator_gpu.py::test_embedding_with_type

850ff0b

Check_consistency() to use equal backward gradients for increased tes…

ec69462

…t robustness

Fix test_operator_gpu.py::test_{fully_connected,gemm}. Add default_nu…

cb487c1

…meric_eps().

DickJC123 added 3 commits July 16, 2020 17:24

Revert "Temporarily log all CI seeds to troubleshoot seed non-determi…

fb3045f

…nism" This reverts commit f60eff2.

Merge remote-tracking branch 'mxnet/master' into tol_improvements

bfba74a

Temp log all CI seeds to troubleshoot unwanted seed determinism

ff328ef

DickJC123 mentioned this pull request Jul 17, 2020

Fix the flaky test 'test_npx_batch_norm' #18688

Merged

7 tasks

DickJC123 added 4 commits July 16, 2020 18:13

Revert "Add sm arch 80 to Makefile"

2984f9e

This reverts commit f9306ce.

Same fix of sample vs. pop variance issue, now with test_operator_gpu…

c8fbbfd

….py::test_batchnorm

Revert "Temp log all CI seeds to troubleshoot unwanted seed determinism"

023ae2e

This reverts commit ff328ef.

Merge remote-tracking branch 'mxnet/master' into tol_improvements

35ed219

DickJC123 mentioned this pull request Jul 17, 2020

unittests using @retry decorator can segfault if they fail #18747

Closed

DickJC123 added 3 commits July 17, 2020 19:38

Merge remote-tracking branch 'mxnet/master' into tol_improvements

9619121

Marking test_sparse_dot_grad with garbage_expected after teardown error

16c44e8

Fix flakiness of test_gluon_probability{_v1,_v2}.py::test_gluon_kl{_v1,}

3e05edb

DickJC123 mentioned this pull request Jul 18, 2020

test_gluon_probability_v2.py::test_gluon_kl and test_gluon_probability_v1.py::test_gluon_kl_v1 are flaky #18755

Closed

DickJC123 added 3 commits July 18, 2020 15:50

Temp skip of test_aggregate_duplication on gpu

60dda9b

Add seeding to test_{numpy,}_contrib_gluon_data_vision.py. Make creat…

1184ed6

…ed files unique.

Add ndarray module isolation to help debug test_bbox_augmenters worke…

44060d0

…r crash

xidulu approved these changes Jul 19, 2020

View reviewed changes

DickJC123 added 2 commits July 18, 2020 23:08

Marking test_sparse_square_sum serial after pytest worker crash

dfa6ef3

Fix flakiness of test_gluon_probability{_v1,_v2}.py::test_half_cauchy…

d8cf30a

…{_v1,}

szha merged commit 146b49e into apache:master Jul 19, 2020

This was referenced Jul 20, 2020

Fix (log_)softmax backward on empty ndarray #18711

Closed

[v1.x Backport] Fix softmax, logsoftmax failed on empty ndarray (#18602) #18708

Merged

DickJC123 mentioned this pull request Aug 5, 2020

[RFC] v1.8.0 release #18800

Open

szha added this to the v1.8.0 milestone Aug 23, 2020

DickJC123 mentioned this pull request Sep 15, 2020

[v1.x] Backport Unittest tolerance handling improvements (#18694). Also test seeding (#18762). #19148

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unittest tolerance handling improvements #18694

Unittest tolerance handling improvements #18694

DickJC123 commented Jul 11, 2020 •

edited

xidulu left a comment

szha commented Jul 19, 2020

szha commented Jul 19, 2020

DickJC123 commented Jul 19, 2020 •

edited

Unittest tolerance handling improvements #18694

Unittest tolerance handling improvements #18694

Conversation

DickJC123 commented Jul 11, 2020 • edited

Description

Checklist

Essentials

Changes

Comments

xidulu left a comment

Choose a reason for hiding this comment

szha commented Jul 19, 2020

szha commented Jul 19, 2020

DickJC123 commented Jul 19, 2020 • edited

DickJC123 commented Jul 11, 2020 •

edited

DickJC123 commented Jul 19, 2020 •

edited