Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aobasis: fallback to dgemm if libxsmm kernel unavailable for contraction #1629

Merged
merged 2 commits into from Sep 13, 2021

Conversation

dev-zero
Copy link
Contributor

this fixes segfaults occurring on Intel Westmere-EP

@hfp
Copy link
Member

hfp commented Aug 24, 2021

Thank you for catching this and fixing it!

I consider this a bug. For all recent LIBXSMM or at some point we decided to never return NULL-pointer for JIT requests if just the usual preconditions were met (alpha, beta requirements, etc.). It could be a bug related to requesting an SSE kernel. I will probably try reproducing this to ensure it's not happening for our next release.

Which version of LIBXSMM exposed this problem? I guess 1.16.1 ...

@dev-zero
Copy link
Contributor Author

dev-zero commented Aug 24, 2021

@marci73 this fixes the regtest segfaults on tcopt9
@hfp do you maybe have an idea why libxsmm_dmmcall fails? This is on a Xeon X5670 (Westmere-EP).

After this change all except 1 regtest pass (see below):

/data/tiziano/cp2k/regtesting/local/psmp/TEST-local-psmp-2021-08-23_16-39-55/xTB/regtest-2/HF-field.inp.out
 EWALD| Spline interpolation order                                             6


 TOTAL NUMBERS AND MAXIMUM NUMBERS

  Total number of            - Atomic kinds:                                   2
                             - Atoms:                                          2
                             - Shell sets:                                     4
                             - Shells:                                         4
                             - Primitive Cartesian functions:                 28
                             - Cartesian basis functions:                      6
                             - Spherical basis functions:                      6

  Maximum angular momentum of the orbital basis functions:                     1


 SCF PARAMETERS         Density guess:                                    ATOMIC
                        --------------------------------------------------------
                        max_scf:                                              90
                        max_scf_history:                                       0
                        max_diis:                                              4
                        --------------------------------------------------------
                        eps_scf:                                        1.00E-08
                        eps_scf_history:                                0.00E+00
                        eps_diis:                                       1.00E-01
                        eps_eigval:                                     1.00E-05
                        --------------------------------------------------------
                        level_shift [a.u.]:                                 0.00
                        --------------------------------------------------------
                        No outer SCF

 Number of electrons:                                                          8
 Number of occupied orbitals:                                                  4
 Number of molecular orbitals:                                                 4

 Number of orbital functions:                                                  6
 Number of independent orbital functions:                                      6

 Extrapolation method: initial_guess


 SCF WAVEFUNCTION OPTIMIZATION

  ----------------------------------- OT ---------------------------------------
  Minimizer      : DIIS                : direct inversion
                                         in the iterative subspace
                                         using   7 DIIS vectors
                                         safer DIIS on
  Preconditioner : FULL_S_INVERSE      : cholesky inversion of S
  Precond_solver : DEFAULT
  stepsize       :    0.15000000                  energy_gap     :    0.20000000
  eps_taylor     :   0.10000E-15                  max_taylor     :             4
  ----------------------------------- OT ---------------------------------------

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 OT DIIS     0.15E+00    0.3     0.11499159        -5.5872057609 -5.59E+00
     2 OT DIIS     0.15E+00    0.2     0.04706187        -5.6208600397 -3.37E-02
     3 OT DIIS     0.15E+00    0.2     0.01196325        -5.6273877771 -6.53E-03
     4 OT DIIS     0.15E+00    0.3     0.00907899        -5.6278399241 -4.52E-04
     5 OT DIIS     0.15E+00    0.4     0.03374885        -5.6252633476  2.58E-03
     6 OT DIIS     0.15E+00    0.7     0.02793426        -5.6261717291 -9.08E-04
 Charge outside chemical range:  Kind Atom=  1     1   Limit=1.00  Charge= -1.19
 Charge outside chemical range:  Kind Atom=  2     2   Limit=1.00  Charge=  1.19

 *** WARNING in xtb_matrices.F:1152 :: Atomic charges outside chemical   ***
 *** range were detected. Switch-off CHECK_ATOMIC_CHARGES keyword in the ***
 *** &xTB section if you want to force to continue the calculation.      ***


 *******************************************************************************
 *   ___                                                                       *
 *  /   \                                                                      *
 * [ABORT]                                                                     *
 *  \___/                               xTB Charges                            *
 *    |                                                                        *
 *  O/|                                                                        *
 * /| |                                                                        *
 * / \                                                     xtb_matrices.F:1155 *
 *******************************************************************************


 ===== Routine Calling Stack =====

            8 build_xtb_ks_matrix
            7 rebuild_ks_matrix
            6 qs_ks_update_qs_env
            5 scf_env_do_scf_inner_loop
            4 scf_env_do_scf
            3 qs_energies
            2 qs_forces
            1 CP2K
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
EXIT CODE:  1  MEANING:  RUNTIME FAIL

@dev-zero
Copy link
Contributor Author

dev-zero commented Aug 24, 2021

Thank you for catching this and fixing it!

I consider this a bug. For all recent LIBXSMM or at some point we decided to never return NULL-pointer for JIT requests if just the usual preconditions were met (alpha, beta requirements, etc.). It could be a bug related to requesting an SSE kernel. I will probably try reproducing this to ensure it's not happening for our next release.

Ok, so should we rather do CPASSERT once this is fixed? Or enable the CPWARN I have commented out instead of the silent fallback? The problem is that the number of warnings will be huge.

For reproducing the issue: Cubic_RPA_H2O_standard.inp is one of the cases where we saw the segfaults (with one rank even).

Which version of LIBXSMM exposed this problem? I guess 1.16.1 ...

Yes, this is with libxsmm-1.16.1.

@hfp
Copy link
Member

hfp commented Aug 24, 2021

For reproducing the issue: Cubic_RPA_H2O_standard.inp is one of the cases where we saw the segfaults (with one rank even).

Thank you for sharing the reproducer! I am on it (as a side-task). I am doing LIBXSMM_TARGET=wsm instead of running on native Westmere (hard to get hands-on such an old system ;-).

@hfp
Copy link
Member

hfp commented Aug 24, 2021

I have a hard time reproducing the problem. The output (single rank) looks like:

LIBXSMM_VERSION: release-1.16.1 (25182208)
WSM/DP      TRY    JIT    STA    COL
   0..13     88     88      0      0
  14..23    104    104      0      0
  24..64      4      4      0      0
Registry and code: 10 MB + 1 MB (gemm=196 gemv=26)

This is LIBXSMM's termination message showing statistics about generated kernels, which looks fine (beside of terminating after success). This is a debug build of CP2K (PSMP). I can try other builds if you think it a better match. I used GNU compiler to build CP2K/master, etc.

I wonder if you can reach out (PM) and help reactivating access to UZH Portal?

@dev-zero
Copy link
Contributor Author

This is LIBXSMM's termination message showing statistics about generated kernels, which looks fine (beside of terminating after success). This is a debug build of CP2K (PSMP). I can try other builds if you think it a better match. I used GNU compiler to build CP2K/master, etc.

I wonder if you can reach out (PM) and help reactivating access to UZH Portal?

Sure, account reactivated and mail with information sent :)

@hfp
Copy link
Member

hfp commented Aug 25, 2021

I have root-caused the problem, and it suggests to maybe not merge this PR. Essentially, LIBXSMM's detects SSE4 (CPUID) which includes checking if the OS permits using the extension (like state-save per XSAVE instruction on context-switch). The OS does not seem to permit using SSE4 on this specific system (I may take a deeper look why this is). However, our current master of LIBXSMM changed the behavior like using it anyway specifically in case of SSE4 (I think we came across such situations at least with some VMs). There are now two options:

  • You can get LIBXSMM 1.16.2 containing a minor change to permit SSE4 which avoids NULL-kernel in this case.
  • You can use LIBXSMM_TARGET=wsm when running on this specific system.

The former case assumes after 20 years of SSE extension, any OS will support/use XSAVE even if it's not correctly signaled. For the latter case, LIBXSMM takes the requested code-path without further moderation (LIBXSMM_VERBOSE=1 emits a warning).

@hfp
Copy link
Member

hfp commented Aug 25, 2021

I have prepared LIBXSMM 1.16.2. Above mentioned option remains a viable workaround as well (LIBXSMM_TARGET=wsm).

@hfp
Copy link
Member

hfp commented Aug 31, 2021

I have released LIBXSMM 1.16.2. So, you can decide which of the above solutions you prefer. Though, LIBXSMM_TARGET=wsm on that machine works as well when relying on LIBXSMM 1.16.1 (now previous release).

@dev-zero
Copy link
Contributor Author

dev-zero commented Sep 8, 2021

@oschuett can you please update the libxsmm tarball on the mirror? It seems port 22 is now closed on sham.cp2k.org.

@oschuett
Copy link
Member

oschuett commented Sep 8, 2021

Voilà: https://www.cp2k.org/static/downloads/libxsmm-1.16.2.tar.gz

It seems port 22 is now closed on sham.cp2k.org.

You were probably banned by fail2ban - try again.

@dev-zero dev-zero merged commit 03cba78 into cp2k:master Sep 13, 2021
@dev-zero dev-zero deleted the bugfix/westmere branch September 13, 2021 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants