Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAPACK tests are failing with OpenBLAS-0.3.20 and GCC-11.3.0 #16380

Closed
maxim-masterov opened this issue Oct 7, 2022 · 26 comments
Closed

LAPACK tests are failing with OpenBLAS-0.3.20 and GCC-11.3.0 #16380

maxim-masterov opened this issue Oct 7, 2022 · 26 comments
Milestone

Comments

@maxim-masterov
Copy link
Collaborator

Creating this issue to properly log all the progress.

How it started

It was observed that the VASP6 installation with foss/2022a lead to inaccurate results. After some digging the culprit was found - DGGEV subroutine from LAPACK. To simplify debugging of the problem we isolated LAPACK tests from the official netlib distribution (3.10.1) and started to run them using different combinations of compiler flags and OpenBLAS versions.

What we have

The following tests are performed on AMD EPYC ROME (zen2 architecture):

  • OpenBLAS/0.3.15-GCC-10.3.0 (taken from foss/2021a) results in ~130 failed tests:
[wimr@int1 OUTPUT]$ grep failed foss-2021a-openblas-0.3.15/* | grep -v "error exits"
foss-2021a-openblas-0.3.15/ced.out: CEV:    4 out of  1096 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/ced.out: CVX:   24 out of  5196 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/cgd.out: CGV drivers:      5 out of   1092 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/cgd.out: CGV drivers:      6 out of   1092 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/zed.out: ZEV:    8 out of  1100 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/zed.out: ZVX:   36 out of  5208 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/zgd.out: ZGV drivers:     25 out of   1092 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/zgd.out: ZGV drivers:     22 out of   1092 tests failed to pass the threshold 
  • OpenBLAS/0.3.20-GCC-11.3.0 (taken from foss/2022a) results in ~4.2k failed tests
[wimr@int1 OUTPUT]$ grep failed foss-2022a-openblas-0.3.20/* | grep -v "error exits"
foss-2022a-openblas-0.3.20/ced.out: CEV:   30 out of  1122 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/ced.out: CVX:  194 out of  5366 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgd.out: CGV drivers:    129 out of   1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgd.out: CGV drivers:    135 out of   1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgd.out: CGS drivers:    123 out of   1560 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgd.out: CGS drivers:    126 out of   1560 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgg.out: CGG:  119 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgg.out: CGG:  115 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgg.out: CGG:  117 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgg.out: CGG:  116 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgd.out: DGS drivers:    129 out of   1560 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgd.out: DGS drivers:    129 out of   1560 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgd.out: DGV drivers:    166 out of   1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgd.out: DGV drivers:    171 out of   1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgg.out: DGG:  143 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgg.out: DGG:  146 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgg.out: DGG:  163 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgg.out: DGG:  150 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgd.out: SGS drivers:    144 out of   1560 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgd.out: SGS drivers:    132 out of   1560 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgd.out: SGV drivers:    173 out of   1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgd.out: SGV drivers:    186 out of   1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgg.out: SGG:  153 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgg.out: SGG:  140 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgg.out: SGG:  147 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgg.out: SGG:  150 out of  2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/zed.out: ZEV:   50 out of  1142 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/zed.out: ZVX:  296 out of  5468 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/zgd.out: ZGV drivers:     52 out of   1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/zgd.out: ZGV drivers:     50 out of   1092 tests failed to pass the threshold
  • OpenBLAS/0.3.15-GCC-11.3.0 (new build) results in ~4.2k failed tests

@zao got similar results on Ryzen 9 3900X (zen2 desktop) when built LAPACK tests with full foss/2022a and buildenv that picked up FlexiBLAS as the USE_OPTIMIZED_BLAS implementation:

[easybuild@eb-rocky8 build-lapack-ob-0.3.20-benv]$ grep failed TESTING/testing_results.txt
 SGG:  163 out of  2184 tests failed to pass the threshold
 SGG:  159 out of  2184 tests failed to pass the threshold
 SGG:  162 out of  2184 tests failed to pass the threshold
 SGG:  155 out of  2184 tests failed to pass the threshold
 SGS drivers:    144 out of   1560 tests failed to pass the threshold
 SGS drivers:    159 out of   1560 tests failed to pass the threshold
 SGV drivers:    180 out of   1092 tests failed to pass the threshold
 SGV drivers:    184 out of   1092 tests failed to pass the threshold
  STFSM auxiliary routine:     1 out of  7776 tests failed to pass the threshold
 DGG:  161 out of  2184 tests failed to pass the threshold
 DGG:  150 out of  2184 tests failed to pass the threshold
 DGG:  166 out of  2184 tests failed to pass the threshold
 DGG:  151 out of  2184 tests failed to pass the threshold
 DGS drivers:    135 out of   1560 tests failed to pass the threshold
 DGS drivers:    156 out of   1560 tests failed to pass the threshold
 DGV drivers:    174 out of   1092 tests failed to pass the threshold
 DGV drivers:    172 out of   1092 tests failed to pass the threshold
 CEV:   30 out of  1122 tests failed to pass the threshold
 CVX:  194 out of  5366 tests failed to pass the threshold
 CGG:  122 out of  2184 tests failed to pass the threshold
 CGG:  118 out of  2184 tests failed to pass the threshold
 CGG:  129 out of  2184 tests failed to pass the threshold
 CGG:  121 out of  2184 tests failed to pass the threshold
 CGV drivers:    135 out of   1092 tests failed to pass the threshold
 CGV drivers:    121 out of   1092 tests failed to pass the threshold
 CGS drivers:    126 out of   1560 tests failed to pass the threshold
 CGS drivers:    135 out of   1560 tests failed to pass the threshold
 ZHS:    1 out of  1764 tests failed to pass the threshold
 ZHS:    1 out of  1764 tests failed to pass the threshold
 ZHS:    1 out of  1764 tests failed to pass the threshold
 ZHS:    1 out of  1764 tests failed to pass the threshold
 ZEV:   50 out of  1142 tests failed to pass the threshold
 ZVX:  296 out of  5468 tests failed to pass the threshold
 ZGV drivers:     54 out of   1092 tests failed to pass the threshold
 ZGV drivers:     39 out of   1092 tests failed to pass the threshold

The main question - are failing tests caused by FlexiBLAS or by the optimization flags?

Update 1

From @zao :
Stripping -ftree-vectorize from the build flags that buildenv sets (leaving -O2 -march=native) makes it behave, so it's probably the better vectorizer in GCC11 lifting up some latent problem in OpenBLAS. It wouldn't be the first time...

STFSM auxiliary routine:     1 out of  7776 tests failed to pass the threshold
 CEV:    4 out of  1096 tests failed to pass the threshold
 CVX:   24 out of  5196 tests failed to pass the threshold
 CGV drivers:      5 out of   1092 tests failed to pass the threshold
 CGV drivers:      5 out of   1092 tests failed to pass the threshold
 ZEV:    8 out of  1100 tests failed to pass the threshold
 ZVX:   36 out of  5208 tests failed to pass the threshold
 ZGV drivers:     26 out of   1092 tests failed to pass the threshold
 ZGV drivers:     26 out of   1092 tests failed to pass the threshold

Update 2

From @zao
I've set up a fresh environment on a Haswell machine, got the same grade of broken outcome as on our zen2 so not µarch-dependent. Steps:

Make and install a buildenv-default-GCC-11.3.0.eb
$ ml GCC/11.3.0 OpenBLAS/0.3.20 CMake/3.23.1
$ ml buildenv  # defines all the various flags variables to "-O2 -ftree-vectorize -march=native"
$ tar xf v3.10.1.tar.gz  # extract LAPACK sources
$ cmake -B build-tests lapack-3.10.1/ -DUSE_OPTIMIZED_BLAS=ON -DBUILD_TESTING=ON -DBLAS_LIBRARIES=$EBROOTOPENBLAS/lib/libopenblas.so
$ cmake --build build-tests -j 4 && cmake --build build-tests -t test
$ (cd lapack-3.10.1; ./lapack_testing.py; grep failed TESTING/testing_results.txt)

Update 3

From @zao

Ran some exhaustive tests on zen2 from GCC 9.5.0 through GCC 12.2.0 with OpenBLAS 0.3.20. It's not looking great.
I'll try to provide data later but it seems that starting with GCC 12 we get elevated test error rates even without -ftree-vectorize , but builds with the flag have fewer categories of test errors comparatively than GCC 11 does.
Interesting enough, even on the 9.5 and 10 series there's slightly different error counts if you look at with/without the flag. I don't know enough about this test suite to tell whether any errors at all is a problem.

Update 4

I got the following number of numerical errors using lapack_testing.py -p x -t eig from the LAPACK distribution:

Build with GCC-11.3:

  • -O2 -march=znver2 -funroll-all-loops -fno-math-errno -ftree-vectorize : 4090
  • -O2 -march=znver2 -funroll-all-loops -fno-math-errno : 136
  • -O2 -march=znver2 -fno-math-errno : 136
  • -O2 -fno-math-errno : 7

Build with GCC-10.3:

  • -O2 -march=znver2 -funroll-all-loops -fno-math-errno -ftree-vectorize : 136

every OpebBLAS version was built manually using GCC/11.3.0 or GCC/10.3.0 module (no FlexiBLAS involved)

Way to reproduce:

$ wget https://github.com/Reference-LAPACK/lapack/archive/refs/tags/v3.10.1.tar.gz
$ tar -xf v3.10.1.tar.gz
$ cd lapack-3.10.1
$ cp make.inc.example make.inc
$
$ # Modify make.inc by removing paths to BLASLIB, CBLASLIB, TMGLIB and LAPACKELIB
$ # Change LAPACKLIB to, e.g. $(EBROOTOPENBLAS)/lib/libopenblas.so
$
$ cd TESTING
$ make 
$ cd .. 
$ lapack_testing.py -p x -t eig
@boegel
Copy link
Member

boegel commented Oct 8, 2022

@akesandgren Any thoughts on this?

@boegel
Copy link
Member

boegel commented Oct 8, 2022

@maxim-masterov Did you check whether building OpenBLAS/0.3.20-GCC-11.3.0 without -ftree-vectorize fixes the problems you are seeing with VASP?

@boegel boegel added this to the next release (4.6.2?) milestone Oct 8, 2022
@boegel
Copy link
Member

boegel commented Oct 8, 2022

W.r.t. things looking worse with GCC 12, that's probably because the auto-vectorizer is enabled by default there, see also https://www.phoronix.com/news/GCC-12-Auto-Vec-O2 (hat tip @zao)

@martin-frbg
Copy link

Maybe related to https://github.com/xianyi/OpenBLAS/pull/3745/files resp. the crashes on OSX that disabling tree-vectorize fixes

@martin-frbg
Copy link

Seems the added -march=znver2 is the main source of numerical errors at least with the current develop branch and gcc-12.1. (Not too keen on testing/patching outdated OpenBLAS releases). This would suggests the deviations result from gcc itself choosing particular instructions/sequences when compiling plain C or Fortran code.
NB with the LAPACK testsuite it is important to read the testing_results.txt to see the magnitude of the errors reported as most thresholds are low

@martin-frbg
Copy link

Update:errors look significant (1e6 and above) but majority arise already from compiling just the netlib-derived LAPACK part with gfortran -march=znver2

@akesandgren
Copy link
Contributor

One quick note on LAPACK testing. It is imperative to compile things under TESTING and MATGEN with -O0.
Otherwise compilers are likely to introduce errors where there are none... and to compare with a -O0 compiled libblas.
If that still returns errors, then you have a compiler (or possibly LAPACK code) problem.

@martin-frbg
Copy link

martin-frbg commented Oct 8, 2022

also things like Reference-LAPACK/lapack#679 where tests expects exact same result as from their own non-optimized BLAS...

@maxim-masterov
Copy link
Collaborator Author

@maxim-masterov Did you check whether building OpenBLAS/0.3.20-GCC-11.3.0 without -ftree-vectorize fixes the problems you are seeing with VASP?

Not with OpenBLAS/0.3.20-GCC-11.3.0. A colleague of mine built a newer version - OpenBLAS/0.3.21-GCC-11.3.0 without the -ftree-vectorize flag and used it to built VASP6/6.3.2-foss-2022a. He said that this version gave plausible results in VASP.

@martin-frbg
Copy link

OpenBLAS 0.3.21 (vs 0.3.20) updated the copy of Reference-LAPACK to 3.10.1 plus fixes, which may have fixed your use case of GGEV through changes therein. I am not immediately aware of any other change that would have affected thread safety or resilience against more aggressive optimisation, but unfortunately I am not equipped to test either VASP or EPYC (although I am a computational chemist by training - now self-employed in an unrelated sector)

@maxim-masterov
Copy link
Collaborator Author

maxim-masterov commented Oct 11, 2022

Some more results. To avoid questions like "why we didn't use the internal LAPACK tests available with OpenBLAS releases and used LAPACK tests taken directly from netlib". I downloaded OpenBLAS-0.3.20 and compiled it with GCC/11.3.0 on the AMD EPYC ROME (zen2) machine. Then I built netlib's LAPACK tests available from the OpenBLAS release.

To change optimization flags I modified the Makefile.rule file in the root folder of untared OpenBLAS. The variables that I used were: COMMON_OPT, FCOMMON_OPT, NO_AVX, NO_AVX2, and NO_AVX512. Playing around with this set of variables allowed me to change the compiler flags used to build both OpenBLAS and LAPACK tests.

Every OpenBLAS version and LAPACK test were built from scratch in a new folder after untaring the OpenBLAS tarball (so, no make clean).

Steps to reproduce:

$ module purge
$ module load 2022 GCC/11.3.0
$ wget https://github.com/xianyi/OpenBLAS/archive/refs/tags/v0.3.20.tar.gz
$ tar -xf v0.3.20.tar.gz
$ cd OpenBLAS-0.3.20
$ vim Makefile.rule
   # modify optimization flags
$ make -j 32
...
$ make PREFIX=${PWD}/install install
...
$ cd lapack-netlib/TESTING
$ make -j 32
$ cd .. && python3 ./lapack_testing.py -t eig -p x

Results

The lists of flags from below are taken from the log, therefore there are some repetitions, e.g. two -O0 flags in one line.
All tests were performed using the lapack_testing.py script available in the LAPACK distribution form netlib. I tested only eigensolvers, since they caused the inaccuracy problem originally discovered in VASP (as indicated in the first comment).

1

Flags: gfortran -O0 -O0 -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1

Output:

			-->   LAPACK TESTING SUMMARY  <--
		Processing LAPACK Testing output found in the TESTING directory
SUMMARY             	nb test run 	numerical error   	other error  
================   	===========	=================	================  
REAL             	889893		0	(0.000%)	0	(0.000%)	
DOUBLE PRECISION	889893		0	(0.000%)	0	(0.000%)	
COMPLEX          	336649		0	(0.000%)	0	(0.000%)	
COMPLEX16         	336649		0	(0.000%)	0	(0.000%)	

--> ALL PRECISIONS	2453084		0	(0.000%)	0	(0.000%)

2

Flags: -O2 -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1

Output:

			-->   LAPACK TESTING SUMMARY  <--
		Processing LAPACK Testing output found in the TESTING directory
SUMMARY             	nb test run 	numerical error   	other error  
================   	===========	=================	================  
REAL             	889893		0	(0.000%)	0	(0.000%)	
DOUBLE PRECISION	889893		0	(0.000%)	0	(0.000%)	
COMPLEX          	336649		0	(0.000%)	0	(0.000%)	
COMPLEX16         	336649		0	(0.000%)	0	(0.000%)	

--> ALL PRECISIONS	2453084		0	(0.000%)	0	(0.000%)

3

Flags: -O2 -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mavx2

Output:

			-->   LAPACK TESTING SUMMARY  <--
		Processing LAPACK Testing output found in the TESTING directory
SUMMARY             	nb test run 	numerical error   	other error  
================   	===========	=================	================  
REAL             	886365		3	(0.000%)	0	(0.000%)	
DOUBLE PRECISION	889893		0	(0.000%)	0	(0.000%)	
COMPLEX          	336649		0	(0.000%)	0	(0.000%)	
COMPLEX16         	336649		0	(0.000%)	0	(0.000%)	

--> ALL PRECISIONS	2449556		3	(0.000%)	0	(0.000%)

4

Flags: -O2 -funroll-all-loops -fno-math-errno -O2 -funroll-all-loops -fno-math-errno -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mavx2

Output:

			-->   LAPACK TESTING SUMMARY  <--
		Processing LAPACK Testing output found in the TESTING directory
SUMMARY             	nb test run 	numerical error   	other error  
================   	===========	=================	================  
REAL             	886365		3	(0.000%)	0	(0.000%)	
DOUBLE PRECISION	889893		0	(0.000%)	0	(0.000%)	
COMPLEX          	336649		0	(0.000%)	0	(0.000%)	
COMPLEX16         	336649		0	(0.000%)	0	(0.000%)	

--> ALL PRECISIONS	2449556		3	(0.000%)	0	(0.000%)

5

Flags: -O2 -funroll-all-loops -fno-math-errno -ftree-vectorize -O2 -funroll-all-loops -fno-math-errno -ftree-vectorize -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mavx2

Output:

			-->   LAPACK TESTING SUMMARY  <--
		Processing LAPACK Testing output found in the TESTING directory
SUMMARY             	nb test run 	numerical error   	other error  
================   	===========	=================	================  
REAL             	848925		1187	(0.140%)	0	(0.000%)	
DOUBLE PRECISION	875853		1153	(0.132%)	0	(0.000%)	
COMPLEX          	322609		985	(0.305%)	0	(0.000%)	
COMPLEX16         	336649		0	(0.000%)	0	(0.000%)	

--> ALL PRECISIONS	2384036		3325	(0.139%)	0	(0.000%)

6

Flags: -O2 -funroll-all-loops -fno-math-errno -march=znver2 -O2 -funroll-all-loops -fno-math-errno -march=znver2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mavx2

Output:

			-->   LAPACK TESTING SUMMARY  <--
		Processing LAPACK Testing output found in the TESTING directory
SUMMARY             	nb test run 	numerical error   	other error  
================   	===========	=================	================  
REAL             	889893		0	(0.000%)	0	(0.000%)	
DOUBLE PRECISION	889893		0	(0.000%)	0	(0.000%)	
COMPLEX          	328369		38	(0.012%)	0	(0.000%)	
COMPLEX16         	328369		94	(0.029%)	0	(0.000%)	

--> ALL PRECISIONS	2436524		132	(0.005%)	0	(0.000%)

To me, these results show that the more aggressive implicit vectorisation we use, the more LAPACK tests fail with OpenBLAS-0.3.20 and GCC-11.3.0.

Also, I think that the test # 1 should also indicate that there are no problems with a compiler, since it uses -O0 to compile both OpenBLAS and LAPACK tests. Do I understand it correctly, @akesandgren?

@boegel
Copy link
Member

boegel commented Oct 12, 2022

@maxim-masterov Can you check if you seeing the same problems for OpenBLAS-0.3.20-GCC-11.2.0.eb too?

I plan to open a PR for the relevant OpenBLAS easyconfigs to disable the use of -ftree-vectorize where it's required, so we can include those updated easyconfigs in the upcoming EasyBuild release (v4.6.2), that's the best we can do short term I think...

Longer term, we should enhance the OpenBLAS easyblock to more carefully check the result of the tests being run (and perhaps also expand the set of tests being run).

@bartoldeman
Copy link
Contributor

This is an issue with the reference LAPACK, nothing to do with OpenBLAS in principle, since you can get the same errors with reference LAPACK combined with BLIS.
I'm doing some digging to see what file / which files are miscompiled in LAPACK.

The -O0 is necessary for some files indeed, and we need to be careful here, since
https://github.com/xianyi/OpenBLAS/blob/eece0dfd143013ca6572a8d3750af159209eb019/Makefile#L38
doesn't filter -ftree-vectorize. But unfortunately just making sure LAPACK_NOOPT is correct doesn't fix this issue.

@martin-frbg
Copy link

Crude fix could be to change line 281 in the toplevel OpenBLAS Makefile

       -@echo "override FFLAGS      = $(LAPACK_FFLAGS)" >> $(NETLIB_LAPACK_DIR)/make.inc

to add -fno-tree-vectorize after the LAPACK_FFLAGS. (Incidentally, changing the Makefiles in lapack-netlib/TESTING and its LIN and EIG subdirectories to ensure use of -O0 and no fancy flags did not appear to make a difference in my tests)

@maxim-masterov
Copy link
Collaborator Author

maxim-masterov commented Oct 13, 2022

@boegel here are some results from OpenBLAS-0.3.20 built with GCC/11.2.0. I used the same procedure as before #16380 (comment).

The test command: python3 ./lapack_testing.py -t eig -p x

1

Flags: -O0 -O0 -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1

Output:

                        -->   LAPACK TESTING SUMMARY  <--
                Processing LAPACK Testing output found in the TESTING directory
SUMMARY                 nb test run     numerical error         other error
================        ===========     =================       ================
REAL                    889893          0       (0.000%)        0       (0.000%)
DOUBLE PRECISION        889893          0       (0.000%)        0       (0.000%)
COMPLEX                 336649          0       (0.000%)        0       (0.000%)
COMPLEX16               336649          0       (0.000%)        0       (0.000%)

--> ALL PRECISIONS      2453084         0       (0.000%)        0       (0.000%)

2

Flags: -O2 -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1

Output:

                        -->   LAPACK TESTING SUMMARY  <--
                Processing LAPACK Testing output found in the TESTING directory
SUMMARY                 nb test run     numerical error         other error
================        ===========     =================       ================
REAL                    889893          0       (0.000%)        0       (0.000%)
DOUBLE PRECISION        889893          0       (0.000%)        0       (0.000%)
COMPLEX                 336649          0       (0.000%)        0       (0.000%)
COMPLEX16               336649          0       (0.000%)        0       (0.000%)

--> ALL PRECISIONS      2453084         0       (0.000%)        0       (0.000%)

3

Flags: -O2 -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mavx2

Output:

                        -->   LAPACK TESTING SUMMARY  <--
                Processing LAPACK Testing output found in the TESTING directory
SUMMARY                 nb test run     numerical error         other error
================        ===========     =================       ================
REAL                    886365          3       (0.000%)        0       (0.000%)
DOUBLE PRECISION        889893          0       (0.000%)        0       (0.000%)
COMPLEX                 336649          0       (0.000%)        0       (0.000%)
COMPLEX16               336649          0       (0.000%)        0       (0.000%)

--> ALL PRECISIONS      2449556         3       (0.000%)        0       (0.000%)

4

Flags: -O2 -funroll-all-loops -fno-math-errno -O2 -funroll-all-loops -fno-math-errno -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mavx2

Output:

                        -->   LAPACK TESTING SUMMARY  <--
                Processing LAPACK Testing output found in the TESTING directory
SUMMARY                 nb test run     numerical error         other error
================        ===========     =================       ================
REAL                    886365          3       (0.000%)        0       (0.000%)
DOUBLE PRECISION        889893          0       (0.000%)        0       (0.000%)
COMPLEX                 336649          0       (0.000%)        0       (0.000%)
COMPLEX16               336649          0       (0.000%)        0       (0.000%)

--> ALL PRECISIONS      2449556         3       (0.000%)        0       (0.000%)

5

Flags: -O2 -funroll-all-loops -fno-math-errno -ftree-vectorize -O2 -funroll-all-loops -fno-math-errno -ftree-vectorize -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mavx2

Output:

                        -->   LAPACK TESTING SUMMARY  <--
                Processing LAPACK Testing output found in the TESTING directory
SUMMARY                 nb test run     numerical error         other error
================        ===========     =================       ================
REAL                    848925          1187    (0.140%)        0       (0.000%)
DOUBLE PRECISION        875853          1153    (0.132%)        0       (0.000%)
COMPLEX                 322609          985     (0.305%)        0       (0.000%)
COMPLEX16               336649          0       (0.000%)        0       (0.000%)

--> ALL PRECISIONS      2384036         3325    (0.139%)        0       (0.000%)

6

Flags: -O2 -funroll-all-loops -fno-math-errno -march=znver2 -O2 -funroll-all-loops -fno-math-errno -march=znver2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mavx2

Output:

                        -->   LAPACK TESTING SUMMARY  <--
                Processing LAPACK Testing output found in the TESTING directory
SUMMARY                 nb test run     numerical error         other error
================        ===========     =================       ================
REAL                    889893          0       (0.000%)        0       (0.000%)
DOUBLE PRECISION        889893          0       (0.000%)        0       (0.000%)
COMPLEX                 328369          38      (0.012%)        0       (0.000%)
COMPLEX16               328369          94      (0.029%)        0       (0.000%)

--> ALL PRECISIONS      2436524         132     (0.005%)        0       (0.000%)

7

Built OpenBLAS using OpenBLAS-0.3.20-GCC-11.2.0.eb and ran LAPACK tests downloaded from netlib.
Flags used to compile OpenBLAS: -O2 -ftree-vectorize -O2 -mavx2 -fno-math-errno
Flags used to compile LAPACK tests: -O2 -frecursive

Output:

			-->   LAPACK TESTING SUMMARY  <--
		Processing LAPACK Testing output found in the TESTING directory
SUMMARY             	nb test run 	numerical error   	other error  
================   	===========	=================	================  
REAL             	850647		1187	(0.140%)	4	(0.000%)	
DOUBLE PRECISION	877585		1153	(0.131%)	4	(0.000%)	
COMPLEX          	323912		985	(0.304%)	8	(0.002%)	
COMPLEX16         	336859		4	(0.001%)	8	(0.002%)	

--> ALL PRECISIONS	2389003		3329	(0.139%)	24	(0.001%)

@bartoldeman
Copy link
Contributor

I found miscompilation in dhgeqz.f, specifically this loop:
https://github.com/Reference-LAPACK/lapack/blob/28f7e8309608b92aaec2e2556d4b25d758ccada9/SRC/dhgeqz.f#L828
I'm getting that down to a much smaller test case (now a ~70 line standalone Fortran code) for a GCC bug report, after confirming with a few compiler versions.

@bartoldeman
Copy link
Contributor

bartoldeman commented Oct 13, 2022

  implicit none
  double precision :: f, g, r, s
  double precision :: d, p

  d = sqrt( f*f + g*g )
  p = 1.d0 / d
  if( abs( f ) > 1 ) then
     s = g*sign( p, f )
     r = sign( d, f )
  else
     s = g*sign( p, f )
     r = sign( d, f )
  end if
end subroutine

subroutine dhgeqz( n, h, t )
  implicit none
  integer            n
  double precision   h( n, * ), t( n, * )
  integer            jc
  double precision   c, s, temp, temp2, tempr
  temp2 = 10d0
  call dlartg( 10d0, temp2, s, tempr )
  c = 0.9d0
  s = 1.d0
  do jc = 1, n
     temp = c*h( 1, jc ) + s*h( 2, jc )
     h( 2, jc ) = -s*h( 1, jc ) + c*h( 2, jc )
     h( 1, jc ) = temp
     temp2 = c*t( 1, jc ) + s*t( 2, jc )
     ! t(2,2)=-s*t(1,2)+c*t(2,2)=-0.9*0+1*0=0
     t( 2, jc ) = -s*t( 1, jc ) + c*t( 2, jc )
     t( 1, jc ) = temp2
  enddo
end subroutine dhgeqz

program test
  implicit none
  double precision h(2,2), t(2,2)  
  h = 0
  t(1,1) = 1
  t(2,1) = 0
  t(1,2) = 0
  t(2,2) = 0
  call dhgeqz( 2, h, t )
  print *,t(2,2)
end program test
$ gfortran -O2 -ftree-vectorize -march=core-avx2 dhgeqz2.f90; ./a.out 
  -1.0000000000000000     
$ gfortran -Wall -O2 dhgeqz2.f90; ./a.out 
   0.0000000000000000

This is for GCC 11.3, 9.3 doesn't fail.

will check a few more compiler versions...

@bartoldeman
Copy link
Contributor

Submitted
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107254
GCC 10 & 12 don't fail either for this particular case, newest 11.3.1 20221007 prerelease fails.

@boegel
Copy link
Member

boegel commented Oct 14, 2022

@bartoldeman So long story short, we should avoid using -ftree-vectorize for OpenBLAS installed with GCC 11.x, for now?

@boegel
Copy link
Member

boegel commented Oct 14, 2022

I've opened a PR for the OpenBLAS easyblock that add support for opting into to running the LAPACK test suite, and catching too many failing tests, see easybuilders/easybuild-easyblocks#2801

We should also update the most recent OpenBLAS easyconfigs to i) disable the use of -ftree-vectorize, ii) opt-in to running the LAPACK tests using run_lapack_tests = True + setting a sufficiently low max. number of failing tests due to numerical errors (150 should be OK for now it seems);
edit: done in #16406

@bartoldeman
Copy link
Contributor

@boegel a conservative and easy fix is to disable -ftree-vectorize for both OpenBLAS and FlexiBLAS (since FlexiBLAS also includes reference LAPACK, and that's used if you use FlexiBLAS with BLIS).

A more targeted fix is to only compile the Lapack (Fortran) parts of OpenBLAS and FlexiBLAS with -fno-tree-vectorize (using a patch or sed or ideally a buildopt if possible). This way loops written in the core (C) parts of those still benefit from vectorization optimizations.

The GCC bug is making progress though, it's already fixed on trunk, I'll check if that patch is trivially backported. In the GCC bug it's also mentioned that -mprefer-vector-width=128 works around it, so that's another possible avenue.

@martin-frbg
Copy link

@bartoldeman
Copy link
Contributor

I tested myself with reference LAPACK 3.10.1 + BLIS, with in LAPACK's make.inc:

FFLAGS = -O2 -frecursive -ftree-vectorize -march=znver2 -fno-math-errno
BLASLIB      = $(EBROOTBLIS)/lib/libblis.so

and backported the GCC patch (https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=9ed4a849afb5b18b462bea311e7eee454c2c9f68), just needs to change .cc to .c in filenames.

The number of failures is a lot lower though not quite at zero (they could come from BLIS as well, to check).

Before

SUMMARY                 nb test run     numerical error         other error  
================        ===========     =================       ================  
REAL                    870201          1351    (0.155%)        0       (0.000%)
DOUBLE PRECISION        870211          1313    (0.151%)        0       (0.000%)
COMPLEX                 314120          1272    (0.405%)        0       (0.000%)
COMPLEX16               325975          444     (0.136%)        0       (0.000%)

--> ALL PRECISIONS      2380507         4380    (0.184%)        0       (0.000%)

After

SUMMARY                 nb test run     numerical error         other error  
================        ===========     =================       ================  
REAL                    883149          46      (0.005%)        0       (0.000%)
DOUBLE PRECISION        883159          48      (0.005%)        0       (0.000%)
COMPLEX                 327068          271     (0.083%)        0       (0.000%)
COMPLEX16               327067          377     (0.115%)        0       (0.000%)

--> ALL PRECISIONS      2420443         742     (0.031%)        0       (0.000%)

With OpenBLAS-0.3.21, similar procedure as above, patched compiler:
FCOMMON_OPT = -frecursive -O2 -funroll-all-loops -fno-math-errno -ftree-vectorize -O2 -funroll-all-loops -fno-math-errno -ftree-vectorize -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mavx2 -g -march=znver2

                Processing LAPACK Testing output found in the TESTING directory
SUMMARY                 nb test run     numerical error         other error  
================        ===========     =================       ================  
REAL                    891615          0       (0.000%)        0       (0.000%)
DOUBLE PRECISION        891625          0       (0.000%)        0       (0.000%)
COMPLEX                 329504          272     (0.083%)        0       (0.000%)
COMPLEX16               322447          392     (0.122%)        0       (0.000%)

--> ALL PRECISIONS      2435191         664     (0.027%)        0       (0.000%)

so only failures left in complex. Certainly a LOT better but I'm still going to check if those complex failures are worrying.

@bartoldeman
Copy link
Contributor

A patch for GCC 11.3.0 is here:
#16411
it'll probably apply to 12.x and 11.2 as well (not tested yet).

In the last Testing output above almost all the complex tests use CGEEV and related functions with and without computation of eigenvectors (in both cases eigenvalues are computed), and compare the eigenvalues, in the longer explanation you can see that as "result 5" or "test(5)" failing. If they're not numerically exactly the same, the tests fails, even if those eigenvalues are super close. It'll take some time to sort those out but this shouldn't have real-world significance.

I believe a test such as
8 = | W(e.vects.) - W(no e.vects.) | / ( |W| ulp )
used elsewhere in the LAPACK tests is more appropriate there.
There is exactly one test that uses this check that fails (for znep), but also quite small:
Matrix order= 10, type=18, seed=3919,3149,1497,2385, result 8 is 21.32
and the threshold value is 20 in nep.in.

@bartoldeman
Copy link
Contributor

Upstream issue:
Reference-LAPACK/lapack#732

@boegel
Copy link
Member

boegel commented Oct 18, 2022

Thanks to the changes in #16406, we are now running the LAPACK tests for recent OpenBLAS easyconfigs, and too many failing LAPACK tests (> 150) will lead to an installation error.

Note that the enhanced OpenBLAS easyblock from easybuilders/easybuild-easyblocks#2801 (which adds support for running the LAPACK tests and checking on the results) is required, and that the patch for GCC 11.x + 12.x that was added in #16411 is also required to ensure a low number of failing LAPACK tests due to numerical errors, so both GCCcore and OpenBLAS need to be reinstalled...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants