Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pick up new blis, mkl & openblas builds #74

Merged
merged 5 commits into from
Jul 28, 2021
Merged

Conversation

h-vetinari
Copy link
Member

@h-vetinari h-vetinari commented Jul 8, 2021

@conda-forge-linter
Copy link

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@h-vetinari
Copy link
Member Author

@isuruf, if you don't have any objections (and you don't merge it yourself), I'm planning to merge this in ~24h

@h-vetinari
Copy link
Member Author

I'm considering to wait for openblas 0.3.16 with this (which should hopefully happen this weekend), to avoid duplicating traffic for such a key package. Thoughts?

@h-vetinari h-vetinari changed the title pick up new blis & mkl builds pick up new blis, mkl & openblas builds Jul 12, 2021
@h-vetinari
Copy link
Member Author

@martin-frbg, thanks for the release of openblas 0.3.16!

Just checking if you would have an idea what might be happening here on OSX+x86-64:

The following tests FAILED:
	 22 - LAPACK-xlintstd_dtest_in (Failed)
	 24 - LAPACK-xeigtstd_nep_in (Failed)
	 25 - LAPACK-xeigtstd_sep_in (Failed)
	 26 - LAPACK-xeigtstd_se2_in (Failed)
	 27 - LAPACK-xeigtstd_dec_in (Failed)
	 28 - LAPACK-xeigtstd_ded_in (Failed)
	 29 - LAPACK-xeigtstd_dgg_in (Failed)
	 30 - LAPACK-xeigtstd_dgd_in (Failed)
	 37 - LAPACK-xeigtstd_dbb_in (Failed)
	 38 - LAPACK-xeigtstd_glm_in (Failed)
	 39 - LAPACK-xeigtstd_gqr_in (Failed)
	 40 - LAPACK-xeigtstd_gsv_in (Failed)
	 41 - LAPACK-xeigtstd_csd_in (Failed)
	 42 - LAPACK-xeigtstd_lse_in (Failed)
	 85 - LAPACK-xlintstds_dstest_in (Failed)

Lots of functions seem to complain about illegal parameters (one example, more in log):

Test OUTPUT:
 ** On entry to DSGESVIEpsilonNo transposeAll parameter number  1 had an illegal value
 ** On entry to DSGESVIEpsilonNo transposeAll parameter number  2 had an illegal value
 ** On entry to DSGESVIEpsilonNo transposeAll parameter number  4 had an illegal value
 ** On entry to DSGESVIEpsilonNo transposeAll parameter number  7 had an illegal value
 ** On entry to DSGESVIEpsilonNo transposeAll parameter number  9 had an illegal value

@martin-frbg
Copy link

No idea - looks like an almost total corruption of the argument lists of the affected functions, almost like a C/Fortran interface mismatch.
The only changes to LAPACK interfaces in 0.3.16 were to disable multithreading for small problem sizes. For BLAS, I introduced several fast paths for conditions that do not require allocation of a memory buffer (i.e. no transposing, no strides other than 1) but if anything went wrong there it should already have blown up the basic tests (and I do not see any reason why OSX should behave differently than other x86_64 environments, there certainly is little os-specific coding for or against OSX in the BLAS kernels). There is no OSX/x86_64 locally or in the gcc compile farm so I will have to resort to Azure CI for reproducing as well.

@h-vetinari
Copy link
Member Author

h-vetinari commented Jul 12, 2021

Thanks for taking a look!

I also don't understand what's happening, though on a long shot (if it's not the C/Fortran interface), I've seen some weird failures with some blas libraries on Skylake agents in Azure recently (or at least, the difference between failing & passing runs was the presence/absence of AVX512{CD,F,_SKX}), but AFAIK those are not being used for OSX runs.

@isuruf
Copy link
Member

isuruf commented Jul 12, 2021

@h-vetinari, @martin-frbg, we have access to a osx-x86_64 machine if you'd like access. Send me your SSH key and I'll add you.

@martin-frbg
Copy link

Sadly the problem seems to be related to my OpenBLAS PR 3250 (adding a shortcut path in interface/gemv.c that does not allocate a buffer when trans=0, incx=1,incy=1) although I do not see where in the dgemv_n kernel any access attempt could
occur under these circumstances.

@h-vetinari
Copy link
Member Author

Sadly the problem seems to be related to my OpenBLAS PR 3250 (adding a shortcut path in interface/gemv.c that does not allocate a buffer when trans=0, incx=1,incy=1) although I do not see where in the dgemv_n kernel any access attempt could
occur under these circumstances.

I'd like to try if the reversion helps here, but this is already one package too far (meta-package for BLAS variants, rather than the openblas build). It sounds like we should be running the LAPACK test suite also when building on https://github.com/conda-forge/openblas-feedstock... Looking at the build recipe, it looks like this all comes from the upstream Makefile - is there a flag to enable the LAPACK tests in utest?

@martin-frbg
Copy link

No flag in utest, you would need to run make lapack-test after the build

Comment on lines 51 to 59
if [[ "${blas_impl}" == "openblas" && "$(uname)" == MINGW* ]]; then
# I'm not sure why these fail only on Windows. Skip for now.
# Still present as of openblas 0.3.16
SKIP_TESTS="${SKIP_TESTS}|LAPACK-xeigtsts_sgd_in|LAPACK-xeigtsts_glm_in|LAPACK-xeigtsts_gsv_in|LAPACK-xeigtsts_lse_in|LAPACK-xeigtstd_dgd_in"
SKIP_TESTS="${SKIP_TESTS}|LAPACK-xeigtstd_glm_in|LAPACK-xeigtstd_gsv_in|LAPACK-xeigtstd_lse_in|LAPACK-xlintstc_ctest_in|LAPACK-xlintstrfc_ctest_rfp_in"
SKIP_TESTS="${SKIP_TESTS}|LAPACK-xeigtstc_sep_in|LAPACK-xeigtstc_se2_in|LAPACK-xeigtstc_ced_in|LAPACK-xeigtstc_cgd_in|LAPACK-xeigtstc_csb_in"
SKIP_TESTS="${SKIP_TESTS}|LAPACK-xeigtstc_csg_in|LAPACK-xeigtstc_glm_in|LAPACK-xeigtstc_gsv_in|LAPACK-xeigtstc_lse_in|LAPACK-xeigtstz_zgd_in"
SKIP_TESTS="${SKIP_TESTS}|LAPACK-xeigtstz_glm_in|LAPACK-xeigtstz_gsv_in|LAPACK-xeigtstz_lse_in|BLAS-xblat1c"
fi
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martin-frbg
Not sure you were aware of this, but there's currently a bunch of tests that fail on openblas & windows (in the peculiar way that conda-forge builds this blas-metapackage).

Hoping to get the lapack tests in conda-forge/openblas-feedstock#122 running for windows as well, then we should see how much of that is "native" (and can hopefully be fixed eventually).

Nothing for you to do right now, just FYI.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not aware of this. This is specifically with mingw gcc/gfortran and has been happening for a number of previous releases as well ? And do the failures maniffest as "failed to pass the threshold" or do the test codes crash ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had let everything run through on a debug commit that I rebased out, but it seems Azure deleted the logs already. I'll let it run again.

This is indeed with mingw + gcc (in contrast to the openblas recipe), which might already explain the differences. @isuruf knows this better than I do though, that part of the recipe here hasn't been touched in quite a while.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error log is here. From what I can see, mostly segfaults, illegal arguments and the like.

Could be something related to libomp / libgomp again?

It's worth noting that this runs the vanilla netlib test suite (for 3.9.0) but linked against openblas. Not sure if there are any relevant divergences there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors of the form *** Illegal value of parameter number 1 not detected by can be ignored, because the way the lapack tests are set up, they need to override the implementation of xerbla_ in the shared library. In Linux and OSX, the local copy in the test override the xerbla_ implementation in the shared library, but in windows it doesn't.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isuruf, the thing I don't understand is that - if it was only about xerbla - why is the set of failures not exactly the same between the 4 combinations of {win, osx} + {mkl, openblas}?

@h-vetinari h-vetinari removed the automerge Merge the PR when CI passes label Jul 15, 2021
@h-vetinari
Copy link
Member Author

@isuruf
Unfortunately, this is going to need some version of Reference-LAPACK/lapack#597, it seems. osx+openblas is still failing with the failure OMP Error 178: Function sem_post failed that Martin had observed in the openblas PR (but that hadn't failed the build there, for some reason, probably due to the way the makefile is set up...?)

@isuruf
Copy link
Member

isuruf commented Jul 27, 2021

That was an issue in llvm-openmp 12. We need to apply llvm/llvm-project@f1b9ce2

@h-vetinari
Copy link
Member Author

Thanks for chasing this down! I wanted to tackle this, but saw you already did: conda-forge/openmp-feedstock#51

@isuruf isuruf added the automerge Merge the PR when CI passes label Jul 27, 2021
@h-vetinari
Copy link
Member Author

h-vetinari commented Jul 27, 2021

Also saw conda-forge/conda-forge-pinning-feedstock#1727 (and that you have a rerender open in #76)... Should this be rerendered too before merging?

@h-vetinari h-vetinari removed the automerge Merge the PR when CI passes label Jul 27, 2021
@h-vetinari
Copy link
Member Author

I removed of the automerge-label now, but purely to make it possible to react at all to the rerender question above (since the CI could would have finished in the next ~15min) - I'm fine with merging either way (but would tend to rerender so that we get the right llvm-openmp runtime constraint)

@isuruf
Copy link
Member

isuruf commented Jul 28, 2021

@conda-forge-admin, rerender

@github-actions
Copy link
Contributor

Hi! This is the friendly automated conda-forge-webservice.
I tried to rerender for you, but it looks like there was nothing to do.

@isuruf isuruf merged commit 98cdd41 into conda-forge:master Jul 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants