Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEST: 1.20.x + blas variants #227

Closed
wants to merge 4 commits into from

Conversation

h-vetinari
Copy link
Member

Following the same scheme as #196, but for the 1.20 branch. Should not be merged due to conda/conda-build#3947.

@conda-forge-linter
Copy link

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@h-vetinari h-vetinari marked this pull request as draft February 10, 2021 21:27
@h-vetinari
Copy link
Member Author

h-vetinari commented Feb 10, 2021

Update with 1.20.1

From 9 failures out of 92, there are now 3 failures out of 64 (-16 cpython3.6 runs, -12 pypy36 runs)

The good news:

  • win + blis passes 🥳 🚀

The bad news:

  • pypy still segfaulting on ppc + openblas

Details

lib before after
numpy 1.19.5 1.20.1
libblas 3.9.0-7 3.9.0-8
blis 0.8.0-1 0.8.0-1
openblas 0.3.12-pthreads-1 0.3.12-pthreads-1
mkl 2020.4-304 2020.4-304
netlib 3.9.0-3 3.9.0-3

variant before after
linux + ppc64le + pypy segfault segfault only on openblas
win + blis first passed test-suite with py38 only, for reasons unclear; 11-13 failures otherwise passes 🥳
win 3 failures due to The process tried to write to a nonexistent pipe. across blis/mkl/openblas same error returned twice on mkl builds

variant blis mkl netlib openblas sum*
linux / x86 ✔️ ✔️ ✔️ ✔️ -
linux / aarch ✔️ ✔️ -
linux / ppc64le ✔️ ✔️ (cpython) / ❌ (pypy) 1F
osx / x86 ✔️ ✔️ ✔️ ✔️ -
osx / arm ✔️ ✔️ -
win ✔️ ✔️ (py39) / ❌ (py37, py38) ✔️ ✔️ 2F
sum* - 2F - 1F 3F

* sum of Failures (out of a total of 64 CI combinations being tested)

Build logs:
Azure
Drone
Travis

ppc + openblas + pypy: SEGFAULT

Hard cut in the log:

lib/tests/test_format.py::test_bad_header PASSED                         [ 59%]
lib/tests/test_format.py::test_large_file_support PASSED                 [ 59%]
win + mkl + cpython 3.7 / 3.8: hard error
  File "D:\bld\numpy_1612991625372\_test_env\lib\site-packages\_pytest\_io\terminalwriter.py", line 155, in write
    self._file.write(msg)
OSError: [Errno 22] Invalid argument
##[error]Cmd.exe exited with code '1'.
The process tried to write to a nonexistent pipe.

@h-vetinari h-vetinari changed the title WIP: 1.20.x + blas variants TEST: 1.20.x + blas variants Feb 10, 2021
@h-vetinari h-vetinari closed this May 8, 2021
@h-vetinari h-vetinari reopened this May 8, 2021
@h-vetinari
Copy link
Member Author

Update for 1.20.2 & new blas builds

From 3 failures out of 64, there are now 20 failures, where 18 are (most likely) due to an openblas bug.

The bad news:

  • one test fails under openblas for all arches / OSes / python versions, related to nan-handling. Not sure if this comes from numpy or openblas, but since numpy only had a patch-release, I'm guessing this is on the openblas-side. CC @martin-frbg
  • segfault under ppc + pypy remains
  • one blis-run regressed under windows (and in a flaky manner as well...)

Details

lib before after
numpy 1.20.1 1.20.2
libblas 3.9.0-8 3.9.0-9
blis 0.8.0-1 0.8.1-0
openblas 0.3.12-pthreads-1 0.3.15-pthreads-0
mkl 2020.4-304 2021.2-389
netlib 3.9.0-3 3.9.0-5

variant before after
linux + ppc64le + openblas + pypy segfault segfault remains 😒
win + blis passed 11 failures for py38-only for first run, 14 failures for py37-only on rerun
win + mkl 2 failures due to The process tried to write to a nonexistent pipe. happened once again out of 6 runs (incl. restart)

variant blis mkl netlib openblas sum*
linux / x86 ✔️ ✔️ ✔️ 4F
linux / aarch ✔️ 4F
linux / ppc64le ✔️ 4F
osx / arm ✔️ ❌** 2F
osx / x86 ✔️ ✔️ ✔️ 4F
win / x86 ✔️ (py38, py39) / ❌ (py37) ✔️ ✔️ 4F
sum* 1F - - 19F 20F

* sum of Failures (out of a total of 64 CI combinations being tested)
** tests not run for osx-arm, but only reasonable assumption is that these would fail as well

Build logs:
Azure (& previous run)
Drone
Travis (& previous run)

linux (all arches) / osx / win + openblas: 1 failure numpy.linalg.tests.test_linalg.TestCond.test_nan
=================================== FAILURES ===================================
______________________________ TestCond.test_nan _______________________________

self = <numpy.linalg.tests.test_linalg.TestCond object at 0x7f3853dbc7d0>

    def test_nan(self):
        # nans should be passed through, not converted to infs
        ps = [None, 1, -1, 2, -2, 'fro']
        p_pos = [None, 1, 2, 'fro']
    
        A = np.ones((2, 2))
        A[0,1] = np.nan
        for p in ps:
>           c = linalg.cond(A, p)

[...]/numpy/linalg/tests/test_linalg.py:777: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
<__array_function__ internals>:6: in cond
    ???
[...]/numpy/linalg/linalg.py:1765: in cond
    s = svd(x, compute_uv=False)
<__array_function__ internals>:6: in svd
    ???
[...]/numpy/linalg/linalg.py:1672: in svd
    s = gufunc(a, signature=signature, extobj=extobj)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

err = 'invalid value', flag = 8

    def _raise_linalgerror_svd_nonconvergence(err, flag):
>       raise LinAlgError("SVD did not converge")
E       numpy.linalg.LinAlgError: SVD did not converge
linux + ppc + pypy: SEGFAULT
lib/tests/test_format.py::test_bad_header PASSED                         [ 60%]
lib/tests/test_format.py::test_large_file_support PASSED                 [ 60%]
/home/conda/feedstock_root/build_artifacts/numpy_1620502851667/test_tmp/run_test.sh: line 9:  2294 Killed                  pytest --verbose --pyargs numpy -k "not (_not_a_real_test or test_einsum_sums_cfloat64 or test_loss_of_precision or test_large_zip or test_may_share_memory_easy_fuzz or test_may_share_memory_harder_fuzz or test_unary_ufunc_call_fuzz or test_count_nonzero_all or test_diophantine_fuzz or test_generalized_sq_cases or test_may_share_memory_harder_fuzz or test_large_zip)" --durations=0
Tests failed for numpy-1.20.2-py37h9e3b4ae_0.tar.bz2 - moving package to /home/conda/feedstock_root/build_artifacts/broken
win + blis + cpython 3.7: 14 failures
=========================== short test summary info ===========================
FAILED core/tests/test_multiarray.py::TestMatmul::test_dot_equivalent[args4]
FAILED core/tests/test_multiarray.py::TestMatmul::test_matmul_object - Assert...
FAILED linalg/tests/test_linalg.py::TestSolve::test_sq_cases - AssertionError...
FAILED linalg/tests/test_linalg.py::TestSolve::test_generalized_sq_cases - As...
FAILED linalg/tests/test_linalg.py::TestInv::test_sq_cases - AssertionError: ...
FAILED linalg/tests/test_linalg.py::TestInv::test_generalized_sq_cases - Asse...
FAILED linalg/tests/test_linalg.py::TestPinv::test_nonsq_cases - AssertionErr...
FAILED linalg/tests/test_linalg.py::TestPinv::test_generalized_sq_cases - Ass...
FAILED linalg/tests/test_linalg.py::TestPinv::test_generalized_nonsq_cases - ...
FAILED linalg/tests/test_linalg.py::TestDet::test_sq_cases - AssertionError: ...
FAILED linalg/tests/test_linalg.py::TestDet::test_generalized_sq_cases - Asse...
FAILED linalg/tests/test_linalg.py::TestMatrixPower::test_power_is_minus_one[dt13]
FAILED linalg/tests/test_linalg.py::TestCholesky::test_basic_property - Asser...
FAILED linalg/tests/test_regression.py::TestRegression::test_lstsq_complex_larger_rhs
= 14 failed, 13540 passed, 716 skipped, 20 xfailed, 1 xpassed, 229 warnings in 409.45s (0:06:49) =
win + blis + cpython 3.8 (first run): 11 failures
=========================== short test summary info ===========================
FAILED core/tests/test_multiarray.py::TestMatmul::test_dot_equivalent[args4]
FAILED linalg/tests/test_linalg.py::TestSolve::test_sq_cases - AssertionError...
FAILED linalg/tests/test_linalg.py::TestSolve::test_generalized_sq_cases - As...
FAILED linalg/tests/test_linalg.py::TestInv::test_sq_cases - AssertionError: ...
FAILED linalg/tests/test_linalg.py::TestInv::test_generalized_sq_cases - Asse...
FAILED linalg/tests/test_linalg.py::TestPinv::test_nonsq_cases - AssertionErr...
FAILED linalg/tests/test_linalg.py::TestPinv::test_generalized_sq_cases - Ass...
FAILED linalg/tests/test_linalg.py::TestDet::test_sq_cases - AssertionError: ...
FAILED linalg/tests/test_linalg.py::TestDet::test_generalized_sq_cases - Asse...
FAILED linalg/tests/test_linalg.py::TestMatrixPower::test_power_is_minus_one[dt13]
FAILED linalg/tests/test_linalg.py::TestCholesky::test_basic_property - Asser...
= 11 failed, 13547 passed, 714 skipped, 20 xfailed, 1 xpassed, 227 warnings in 470.35s (0:07:50) =

@martin-frbg
Copy link

svd test failure is caused by a change in NaN handling within LAPACK 3.9.1 xGESDD that was merged in 0.3.15, see OpenMathLib/OpenBLAS#3225
What do I need to know to reproduce the ppc64le segfault with pypy?

@mattip
Copy link

mattip commented May 9, 2021

What do I need to know to reproduce the ppc64le segfault with pypy?

It most likely is a PyPy problem, the ppc64le version is not widely used and there may be bugs in the ppc64le JIT backend. @h-vetinari could you add a pypy --jit off variant to check that hypothesis?

@h-vetinari
Copy link
Member Author

@mattip: @h-vetinari could you add a pypy --jit off variant to check that hypothesis?

Took a bit longer than I hoped, because I remembered that Isuru had already done this for a previous scipy PR (conda-forge/scipy-feedstock@da63fd6) and it needed a bit of understanding & refactoring the run_test.py machinery.

Also unskipped some tests (aside from being good hygiene to try removing old skips occasionally, I also didn't want to port them to the new format if unnecessary), so it could be that "new" failures arise.

recipe/meta.yaml Outdated Show resolved Hide resolved
@h-vetinari
Copy link
Member Author

@mattip @martin-frbg
Turning off the JIT resolved the segfault with linux-ppc + pypy + openblas.

@h-vetinari
Copy link
Member Author

svd test failure is caused by a change in NaN handling within LAPACK 3.9.1 xGESDD that was merged in 0.3.15, see xianyi/OpenBLAS#3225

And thanks a lot for stopping by so quickly! 😊

@h-vetinari
Copy link
Member Author

@martin-frbg: What do I need to know to reproduce the ppc64le segfault with pypy?

@mattip: It most likely is a PyPy problem, the ppc64le version is not widely used and there may be bugs in the ppc64le JIT backend. @h-vetinari could you add a pypy --jit off variant to check that hypothesis?

@h-vetinari: Turning off the JIT resolved the segfault with linux-ppc + pypy + openblas.

Just for reference / discussion, if the pypy jit on ppc is to blame, I don't understand why it works with ppc + pypy + netlib, but not with ppc + pypy + openblas.

@martin-frbg
Copy link

I have no idea about the inner workings of the pypy jit, but maybe it is simply running out of stack space with the default ulimit on ppc ?

@h-vetinari
Copy link
Member Author

Update for 1.20.3

From 20 failures out of 64 (18 of which due to numpy/numpy#18914), there is now only 1 (flaky) failure.

The good news:

  • segfault under ppc + pypy is gone, even when using the jit 🥳

The bad news:

  • win+blis runs remain flaky, failing occasionally with ~10 failures that are however numerically critical (e.g. producing all-nan where non-nan results are expected, or producing [[0, 0], [0, 0]] instead of [[1, 0], [0, 1]])

Other notable things:

  • investigated failures around using pytest-xdist on windows; those failures were flaky (mostly 4-5 failures due to memory errors and broken pipes) and didn't (seem to) appear when setting OPENBLAS_NUM_THREADS=1. Due to the flaky things, I'm disregarding those extra CI jobs now. See this CI run for some examples.

Details

lib before after
numpy 1.20.2 1.20.3
libblas 3.9.0-9 3.9.0-9
blis 0.8.1-0 0.8.1-0
openblas 0.3.15-pthreads-0 0.3.15-pthreads-1
mkl 2021.2-389 2021.2-389
netlib 3.9.0-5 3.9.0-5
pypy 7.3.4-4(?) 7.3.4-4

variant before after
linux + ppc64le + openblas + pypy segfault passes 🥳
win + blis 11 failures for py38-only for first run, 14 failures for py37-only on rerun 12 failures for py37-only
win + mkl Occasional failures due to The process tried to write to a nonexistent pipe. did not reoccur 🥳

variant blis mkl netlib openblas sum*
linux / x86 ✔️ ✔️ ✔️ ✔️ -
linux / aarch ✔️ ✔️ -
linux / ppc64le ✔️ ✔️ -
osx / arm ✔️ ✔️ -
osx / x86 ✔️ ✔️ ✔️ ✔️ -
win / x86 ✔️ / ❌ ✔️ ✔️ ✔️ 1F
sum* 1F - - - 1F

* sum of Failures (out of a total of 64 CI combinations being tested)

Build logs:
Azure (previously)
Drone (previously & originally)
Travis (previously & originally)

win + blis + cpython 3.7: 12 failures
=========================== short test summary info ===========================
FAILED core/tests/test_multiarray.py::TestMatmul::test_dot_equivalent[args4]
FAILED core/tests/test_multiarray.py::TestMatmul::test_matmul_object - Assert...
FAILED linalg/tests/test_linalg.py::TestSolve::test_sq_cases - AssertionError...
FAILED linalg/tests/test_linalg.py::TestSolve::test_generalized_sq_cases - As...
FAILED linalg/tests/test_linalg.py::TestInv::test_sq_cases - AssertionError: ...
FAILED linalg/tests/test_linalg.py::TestInv::test_generalized_sq_cases - Asse...
FAILED linalg/tests/test_linalg.py::TestPinv::test_generalized_sq_cases - Ass...
FAILED linalg/tests/test_linalg.py::TestPinv::test_generalized_nonsq_cases - ...
FAILED linalg/tests/test_linalg.py::TestDet::test_sq_cases - AssertionError: ...
FAILED linalg/tests/test_linalg.py::TestDet::test_generalized_sq_cases - Asse...
FAILED linalg/tests/test_linalg.py::TestMatrixPower::test_power_is_minus_one[dt13]
FAILED linalg/tests/test_linalg.py::TestCholesky::test_basic_property - Asser...
= 12 failed, 13576 passed, 716 skipped, 1 deselected, 20 xfailed, 1 xpassed, 229 warnings in 321.30s (0:05:21) =

@rgommers
Copy link
Contributor

That's looking pretty good!

@h-vetinari
Copy link
Member Author

h-vetinari commented Aug 7, 2021

Time to close this one I think... Work continues in #237.

@h-vetinari h-vetinari closed this Aug 7, 2021
@h-vetinari h-vetinari deleted the 1.20_blas_vars branch August 7, 2021 18:45
@h-vetinari h-vetinari restored the 1.20_blas_vars branch April 6, 2022 21:48
@h-vetinari h-vetinari deleted the 1.20_blas_vars branch April 6, 2022 21:56
@h-vetinari h-vetinari restored the 1.20_blas_vars branch April 12, 2022 22:24
@h-vetinari h-vetinari reopened this Apr 12, 2022
@h-vetinari h-vetinari changed the base branch from master to numpy120 April 12, 2022 22:24
@h-vetinari
Copy link
Member Author

Revival (new PyPy builds and BLAS updates): all green except PPC

Due to rebuilding 1.20 for pypy3.8/3.9, much less several relevant BLAS (& infrastructure) changes, I thought I'd revive this PR for one last update.

From 1 failure out of 64 runs, we're now at 12 failures (PPC-only) out of 108 runs.

Notable

  • Added accelerate BLAS flavour on osx
  • Testing against PyPy 3.8 and 3.9 added everywhere but for osx-arm
  • Big bumps for openblas, blis & MKL
  • In the meantime, the previous blis errors have been tracked down and fixed

Details

variant before after
win + blis 12 failures fixed 🥳
linux + ppc ... test failures due to emulation problems (on azure)
win + pypy ... two spurious failures in test_closing_fid, resolved by restart
osx + pypy ... one spurious failures in test_may_share_memory_easy_fuzz, resolved by restart

lib before after updated
version
updated
build
numpy 1.20.3 1.20.3
libblas 3.9.0-9 3.9.0-14 X
blis 0.8.1-0 0.9.0-0 X
openblas 0.3.15-pthreads-1 0.3.20-pthreads-0 X
mkl 2021.2-389 2022.0.1-803 X
netlib 3.9.0-5 3.9.0-5
pypy 7.3.4-4 7.3.9-1 (pypy38/39)
7.3.7-3 (pypy37)
X
qemu-user-static ? 6.1.0-8

variant accelerate blis mkl netlib openblas sum*
linux / x86 ✔️ ✔️ ✔️ ✔️ -
linux / aarch ✔️ ✔️ -
linux / ppc64le ✖️ ✖️ 12F
osx / arm ✔️ ✔️ ✔️ -
osx / x86 ✔️ ✔️ ✔️ ✔️ ✔️ -
win / x86 ✔️ ✔️ ✔️ ✔️ -
sum* - - - 6F 6F 12F

* sum of Failures (out of a total of 108 CI combinations being tested)

Build logs:
Azure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants