Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netlib BLAS test xblat3d using BLIS on Intel Broadwell incorrectly signals IEEE_UNDERFLOW_FLAG IEEE_DENORMAL #486

Closed
akesandgren opened this issue Mar 12, 2021 · 23 comments · Fixed by #544 or easybuilders/easybuild-easyconfigs#14018
Labels

Comments

@akesandgren
Copy link

akesandgren commented Mar 12, 2021

Building netlib blas test and linking with BLIS signals an incorrect IEEE_UNDERFLOW_FLAG IEEE_DENORMAL warning when running xblat3d on Intel Broadwell.

Flags for building BLAS test routines:
-O0 -frecursive -std=legacy -mieee-fp -fno-trapping-math -fno-math-errno -march=native

BLIS built by EasyBuild in gobff/2020b toolchain.

./xblat3d < dblat3.in
Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL

This does not happen with refblas or OpenBlas/0.3.12.

It does not happen when running on AMD EPYC 7302P, nor on Skylake.

@fgvanzee
Copy link
Member

Sorry to hear about these errors, @akesandgren.

Building netlib blas test and linking with BLIS signals an incorrect IEEE_UNDERFLOW_FLAG IEEE_DENORMAL warning when running xblat3d on Intel Broadwell.

Can you clarify what you mean by "netlib blas test"? Are you referring to test drivers in netlib LAPACK?

It does not happen when running on AMD EPYC 7302P, nor on Skylake.

Skylake or SkylakeX?

It may also be helpful if you can tell us which operation is triggering the error.

@akesandgren
Copy link
Author

netlib blas test == BLAS/TESTING when downloading LAPACK from netlib.
And it's SkylakeX

I haven't had time to dig down into the details of it before, but a quick test-one-routine-at-a-time reveals the it is DGEMM

@devinamatthews
Copy link
Member

I'm not able to reproduce this on Xeon E5-2680 v4 with the dblat3 driver included with BLIS and feenableexcept(FE_UNDERFLOW) added. I suppose I can try the Netlib Fortran driver.

@devinamatthews
Copy link
Member

Can't reproduce with the Netlib driver with -ffpe-trap=underflow either (and no FP warnings printed).

@devinamatthews
Copy link
Member

Can you provide more details about your processor, environment (OS, compiler), BLIS configuration, and flags used for the Netlib driver?

@devinamatthews
Copy link
Member

Duh, dblat3d. Let me try that one.

@devinamatthews
Copy link
Member

OK, AFAICT dblat3d is the input file for dblat3 so I was in the right place. We're back to needing more details as above.

@akesandgren
Copy link
Author

Ah, forgot about this one. Will try to provide more info soon.

@akesandgren
Copy link
Author

akesandgren commented Sep 15, 2021

Hardware: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
OS: irrelevant
Compiler: GCC 10.2.0
BLIS: 0.8.0

git clone https://github.com/akesandgren/lapack.git
cd lapack
git checkout blis-test
cp make.inc.blis-test make.inc

This sets the correct flags for the testing routines and make it link with -lblis.
Make sure libblis is in LD_LIBRARY_PATH (or update make.inc to contain -L)

cd BLAS/TESTING
make run

This will spew out:

./xblat3d < dblat3.in
Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL

If you do not see this then we need to dive into details on how BLIS is built.

@devinamatthews
Copy link
Member

I was finally able to reproduce this and tracked it down to a bug that was fixed in 0.8.1. Easy fix!

@akesandgren
Copy link
Author

Ok, can you give me the commit or describe the fix, i'm curious by nature and in BLAS-related stuff in particular...

@devinamatthews
Copy link
Member

It's the 0.8.1 tag or you can checkout the current HEAD. The fix has to do with not reading one element past the end the rows/columns of small matrices (in a way which cannot result in a segfault, but can introduce a NaN or Inf value into a horizontal addition). The particular commit for the fix is b43dae9.

@akesandgren
Copy link
Author

Ahh reading out-of-bounds... was kind of guessing at that, not the first time :-)
Was the first bug report I sent to GotoBLAS way back :-)

@akesandgren
Copy link
Author

akesandgren commented Sep 16, 2021

Nope, still getting

./xblat3d < dblat3.in
Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL

with BLIS 0.8.1 now with GCC 10.3.0
Still doing the exact same as above.

We build BLIS with this configure line:

./configure --prefix=/hpc2n/eb/software/BLIS/0.8.1-GCCcore-10.3.0 --enable-cblas --enable-threading=openmp --enable-shared CC="$CC" auto

and with

CFLAGS='-O2 -ftree-vectorize -march=native -fno-math-errno'

@devinamatthews
Copy link
Member

Are those CFLAGS for BLIS or xblat3d?

@akesandgren
Copy link
Author

The

CFLAGS='-O2 -ftree-vectorize -march=native -fno-math-errno'

is what we build BLIS with.
xblat3d flags can be seen in my lapack repo, but to shortcut, they are:

FFLAGS_NOOPT = -O0 -frecursive -mieee-fp -fno-trapping-math -fno-math-errno -std=legacy -march=native

I build the BLAS/TESTING with FFLAGS_NOOPT since it is the testing code and it should not have any optimizations done.

b-an02 [~/support-hpc2n/ake/blis-testing/lapack/BLAS/TESTING]$ make xblat3d
gfortran -c -o dblat3.o dblat3.f -O0 -frecursive -mieee-fp -fno-trapping-math -fno-math-errno -std=legacy -march=native
gfortran -O0 -frecursive -mieee-fp -fno-trapping-math -fno-math-errno -std=legacy -march=native -mieee-fp -o xblat3d dblat3.o -lblis

@devinamatthews
Copy link
Member

Found it. This one isn't a read-past-the-end, but a place where some vestigial code does a horizontal add on junk data and then throws it away.

@akesandgren
Copy link
Author

Are you sure? I still get the same if I add that commit on top of 0.8.1. Are there more commits between 0.8.1 and that commit that I need?

@devinamatthews
Copy link
Member

It's possible that was only the first location of several issues. Let me delve deeper.

@devinamatthews
Copy link
Member

That's what you get for fixing bugs in 5-minute spurts between other work :).

@akesandgren
Copy link
Author

Know the problem well...

@devinamatthews
Copy link
Member

OK, I think I really fixed it this time.

@akesandgren
Copy link
Author

Yes, with those two commits on top of 0.8.1 the problem is finally gone.

Thanks, awaiting 0.8.2 eagerly...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants