Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental support for DLA-Future eigensolver #3143

Merged
merged 40 commits into from
Dec 18, 2023

Conversation

RMeli
Copy link
Member

@RMeli RMeli commented Dec 1, 2023

This PR introduces DLA-Future, a task-based GPU-enabled dense linear algebra library based on pika (a C++ library for concurrency and parallelism), with focus on eigenvalue problems.

The library is still in development, with focus on performance improvements. Therefore, the integration should be considered experimental.

Correctness

Modification to CP2K to use DLA-Future for all matrices of size 3x3 and larger while running the regression tests:

diff --git a/src/input_cp2k_global.F b/src/input_cp2k_global.F
index 146660c2f..fc8eaedb1 100644
--- a/src/input_cp2k_global.F
+++ b/src/input_cp2k_global.F
@@ -215,7 +215,7 @@ CONTAINS
                           "the DLA-Future library. The eigensolver from the ScaLAPACK library is used as fallback "// &
                           "for all smaller cases", &
                           usage="DLAF_NEIGVEC_MIN 512", &
-                          default_i_val=1024)
+                          default_i_val=3)
       CALL section_add_keyword(section, keyword)
       CALL keyword_release(keyword)
 #endif

CP2K regression tests:

mkdir build && cd build

cmake \
    -DCP2K_SCALAPACK_VENDOR=MKL \
    -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
    -DCP2K_ENABLE_REGTESTS=ON \
    -DCP2K_USE_MPI_F08=ON \
    -DCP2K_USE_DLAF=ON \
    ..

cd ..

export PATH=$(spack -e ~/git/my-spack/envs/local/cp2k-dlaf/mkl-mt-mpich location -i pika)/bin:$PATH
export CP2K_DATA_DIR=~/git/cp2k/data/

./tests/do_regtest.py local psmp \
    --mpiranks 1 --ompthreads 12 --num_gpus 0 \
    --mpiexec "mpiexec --bind-to core:12 pika-bind --print-bind --"

Results:

------------------------------- Summary --------------------------------
Number of FAILED  tests 1
Number of WRONG   tests 0
Number of CORRECT tests 2941
Total number of   tests 2942

Summary: correct: 2941 / 2942; failed: 1; 26min
Status: FAILED

*************************** Testing ended ******************************

The one failed test is QS/regtest-pao-2/H2O_pao_equi.inp but it appears to be a sporadic failure. Re running with --restrictdir QS/regtest-pao-2 concluded successfully. This particular problem is tracked in RMeli#4, and will be further investigated.

Performance

Performance optimisation of DLA-Future is still an ongoing effort and the library performance will improve with time. For this PR, I benchmarked version 0.3.1 of the library (with pika version 0.20.0). I added comparisons with ELPA and ScaLAPACK, when available.

Benchmark results are reported for Daint (multi-core: daint-mc, P100: daint-gpu) and Alps (multi-core: clariden-mc, A100: clariden-nvgpu, Mi200: clariden-amdgpu).

Benchmarks: QS/H2O-512.inp and QS/H2O-1024.inp, modified to use standard diagonalization.

Daint

Missing points are due to not enough memory being available for the particular configuration.

MC

eigensolver_h2o-512_daint-mc
eigensolver_h2o-1024_daint-mc

GPU

eigensolver_h2o-512_daint-gpu
eigensolver_h2o-1024_daint-gpu

Alps (Clariden)

Missing points are due to MPI (cray-mpich) issues we are encountering with large matrices on multiple nodes. We have seen the same issues on LUMI (with the DLA-Future miniapps) and we are currently investigating the issue. Unfortunately we don't have further details at this time.

Fatal error in PMPI_Sendrecv: Other MPI error, error stack:
PMPI_Sendrecv(249).............: MPI_Sendrecv(sbuf=0x149e80010080, scount=104857600, MPI_DOUBLE, dest=8, stag=0, rbuf=0x149eb2010080, rcount=104857600, MPI_DOUBLE, src=8, rtag=0, comm=0x84000005, status=0x7fffe40d4590) failed
MPID_Progress_wait(193)........: 
MPIDI_Progress_test(89)........: 
MPIDI_OFI_handle_cq_error(1062): OFI poll failed (ofi_events.c:1064:MPIDI_OFI_handle_cq_error:Input/output error - NO_TRANSLATION)
events.c:1064:MPIDI_OFI_handle_cq_error:Input/output error - NO_TRANSLATION)

Despite these issues, the experimental integration will facilitate further testing and development, as well as early feedback from users.

AMD GPUs

eigensolver_h2o-512_clariden-amdgpu
eigensolver_h2o-1024_clariden-amdgpu

NVidia GPUs

eigensolver_h2o-512_clariden-nvgpu
eigensolver_h2o-1024_clariden-nvgpu

Spack

Support for cp2k@master +dlaf build_system=cmake is already included in Spack (see spack/spack#40702).

Locally, I used this Spack environment to build CP2K and run the regression tests:

spack:
  specs:
  - cp2k@master +libxc +libint +cosma +spglib +dlaf +mpi_f08 build_system=cmake
  packages:
    all:
      require:
        - any_of: [+mpi, "@:"]
        - any_of: [~cuda, "@:"]
        - any_of: [~rocm, "@:"]
    mpi:
      require:
        - mpich
    mpich:
      require:
        - "@4.1: device=ch3 netmod=tcp"
    blas:
      require:
        - intel-mkl
    lapack:
      require:
        - intel-mkl
    scalapack:
      require:
        - intel-mkl
    intel-mkl:
      require:
        - threads=openmp
    cosma:
      require: 
        - +scalapack 
        - +shared
    dbcsr:
      require:
        - +mpi_f08
    dla-future:
      require:
        - "@0.3.1"
    pika:
      require: 
        - "@0.20:"
  view: false
  concretizer:
    unify: true

Credits

While not reflected in the commit history, @mtaillefumier and @msimberg contributed to this PR, especially with early integration of the Cholesky decomposition and general comments/suggestions/reviews.

@oschuett
Copy link
Member

oschuett commented Dec 4, 2023

Woohoo, it's awesome to see DLA-F on the finish line :-)

Do you think it would be possible to also add an installer script for the toolchain? I know it's tedious, but without it we can't test it.

Regarding the performance we found that a lot of time is also spend in pdtrmm. So, for cuSOLVERMp we're now looking into calling their general eigensolver.

@RMeli
Copy link
Member Author

RMeli commented Dec 4, 2023

Do you think it would be possible to also add an installer script for the toolchain? I know it's tedious, but without it we can't test it.

I never used the toolchain beforehand (Spack + CMake works great ;) ), so I'll have to look into it. Before spending time on this, could you please share what are the long-term plans for the toolchain? Is https://github.com/cp2k/cp2k-toolchain still relevant?

As an alternative, would it be possible to add a CI build based on #3075? cp2k+dlaf is already supported in Spack upstream. With such container in CI it would be easy to simply update the local Spack package and add the +dlaf variant.

Regarding the performance we found that a lot of time is also spend in pdtrmm. So, for cuSOLVERMp we're now looking into calling their general eigensolver.

That's good to know, thanks for the information. Version 0.3.1 of DLA-Future already provides a C API for the general eigensolver, so that should be reasonably easy to add too.

@oschuett
Copy link
Member

oschuett commented Dec 4, 2023

Before spending time on this, could you please share what are the long-term plans for the toolchain?

Long-term I really hope that we can ditch the toolchain in favor of Spack. However, we're not there yet. For example, building the dependencies with Spack takes currently twice as long as with the toolchain.

As an alternative, would it be possible to add a CI build based on #3075? cp2k+dlaf is already supported in Spack upstream. With such container in CI it would be easy to simply update the local Spack package and add the +dlaf variant.

I'm working on that and might have something in a few days. However, even then you'll be missing out one all the other test configurations that we run for the dashboard.

Is https://github.com/cp2k/cp2k-toolchain still relevant?

No, it's not. Thanks for the reminder. I just took it private. Likewise, I'm planing to remove the tools/toolchain-spack directory.

Version 0.3.1 of DLA-Future already provides a C API for the general eigensolver, so that should be reasonably easy to add too.

Cool! It'll require a bit of refactoring on CP2K's side as well. I'm not sure yet when I'll get to it though.

@RMeli
Copy link
Member Author

RMeli commented Dec 7, 2023

However, we're not there yet.

Probably a discussion for somewhere else, but is there any way we can help accelerate this process?

I'm working on that and might have something in a few days.

That's awesome! I'd suggest to use this when it's ready, to avoid the PR sitting here for long. Supporting DLA-Future via the toolchain would indeed be quite tedious and if the toolchain is bound to go soon(-ish) anyways, it's probably not time well spent.

Do you think this would be a reasonable way forward? If really needed, we can eventually think about toolchain support later on.

@oschuett
Copy link
Member

Probably a discussion for somewhere else, but is there any way we can help accelerate this process?

Yes, thanks for asking :-) Let's discuss offline.

I'd suggest to use this when it's ready, to avoid the PR sitting here for long.

I've just merged #3149, which migrates our first CI test onto Spack.

@RMeli RMeli mentioned this pull request Dec 11, 2023
@RMeli RMeli force-pushed the dlafuture branch 2 times, most recently from acaa7a4 to 6e1cc33 Compare December 11, 2023 17:14
@RMeli
Copy link
Member Author

RMeli commented Dec 12, 2023

@oschuett it is not obvious to me why the Regtest sdgb CI build keeps failing. Do you have any pointers?

src/fm/cp_fm_diag.F Show resolved Hide resolved
src/environment.F Outdated Show resolved Hide resolved
src/f77_interface.F Outdated Show resolved Hide resolved
@oschuett
Copy link
Member

I've run into those segfaults from the CMake test before. The -DCP2K_USE_MPI_F08=ON seems to trigger a bug in mp_waitall which corrupts the memory and crashes mp_timeset.

@RMeli
Copy link
Member Author

RMeli commented Dec 12, 2023

The -DCP2K_USE_MPI_F08=ON seems to trigger a bug in mp_waitall which corrupts the memory and crashes mp_timeset.

Thanks for the additional information. I was indeed trying to figure that out, since using MPICH 4.1 installed with Spack did work. Do you prefer to have MPICH installed with apt or spack?

@oschuett
Copy link
Member

If possible I'd like to use the apt package because it's faster to install.

@oschuett
Copy link
Member

One can now browse the full output of failed builds from the artifacts: make.out.

Turns out the actual problem is:

makedep error: Multiple source files with the same basename: cp_dlaf_utils

Unfortunately, I created a merge conflict in the process.

@RMeli
Copy link
Member Author

RMeli commented Dec 12, 2023

Thanks @oschuett! (No worries at all for the conflict.)

RMeli and others added 14 commits December 12, 2023 20:39
* DLA-Future (#1)

* More cmake fixes

Signed-off-by: Dr. Mathieu Taillefumier <mathieu.taillefumier@free.fr>

* Add dla-future support for cholesky

Signed-off-by: Mathieu Taillefumier <mathieu.taillefumier@free.fr>

* More work on cmake

* New Keyword: MIN_PAIR_LIST_RADIUS can be used to force non-zero blocks (cp2k#2511)

in KS and Overlap matrix.

* Regtesting: Fix DeprecationWarning

* Docker: Do not flag slow tests in minimal build

* PAO: Add equivariant parametrization

* PAO: Use correct distribution in pao_calc_AB_equi

* PAO: Add missing mp_sum to pao_calc_AB_equi

* Regtesting: Fix issue with Python 3.6

* Consider target CPU in OpenBLAS toolchain build

This should partially fix cp2k#2517

* Update track imag density (#6)

* PAO: Fix unused parameter and CMake build

* Keep compatibility with bash v3

* Add info for sourceable arch files and links to HowTos

* Refactor allocations of imaginary parts for RTP (cp2k#2531)

* unify allocation of imaginary parts of matrix_h_im and matrix_ks_im with their real counterparts

* refactor allocations of imaginary parts for RTP

* clean-up

* More work on cmake

* FIST: Add NequIP equivariant neural network potentials

* FIST: Fix CMake build and conventions for NequIP

* Remove unused parameters

* Remove unused label from beta_gamma_psi.F

* Toolchain: Add -Werror=unused-parameter and -Werror=unused-label

* Docker: Fix typo in cmake variable and add check for warnings

* Regtesting: Introduce tests/UNIT_TESTS file

* Let nequip_unittest fail when libtorch is missing

* Remove some more unused parameters

* Regtesting: Add Valgrind option

* Fix various simple Valgrind issues

* update keyword descriptions for RTP

* fix strings

* Fix typo in arch/Linux-intel-x86_64.psmp 

Shared part is not really tested afaict, but for the sake of completeness let's fix this typo.

* Add more target mappings for gcc to OpenBLAS

* Fix funky libxsmm library order

* fix libxc

* cleanup dlaf cholesky

* dirty water: dirty version working for H2O-32

* dlaf

* make preffiy

* Revert "make preffiy"

This reverts commit d27b375.

* revert fftw3 change

* add some dlaf output

* remove copies using new eigensolver api

* port @msimberg improvements

* clang formatting

* add improvements suggested by @msimberg

* rename cholesky_dlaf to dlaf

* remove single-threades scope from cp2k, it is now in dlaf

* general cleanup

* clang format

* further cleanup

* remove unused single precision eigensolver

* remove MKL-specific single threaded blas/lapack scope

* use new dlaf C API

* fix dlaf calls

* improve cuSOLVER-stype integration

* make cp2k compilable without dlaf

* print pika binding

* refactoring and cleanup

* remove local grid creation/free

* zero eigenvalues buffer

* cleanup

* add missing licence from c file

* refactoring

* cholesky wrapper

* cholesky wrapper

* add scalapack fallback and falback parameter dlaf_neigvec_min

* make pretty

* add unit test for dlaf

* remove spurious file

* fix usage of CP2K as a library

* remove initialization and pika threads

* add dlaf timing without upper to full

* actually fix mpi

* remove string utilities

* prettify

* Revert "prettify"

This reverts commit cd1e89a.

* cleanup:

* remove redistribution

* manual cleanup

* prettify cmake

* make pretty exluding dlaf.F

* add some details to install and __DLAF so that make pretty does not complain

* prettify install

* more consistent cabort

* remove intrinsic

* Revert "remove intrinsic"

This reverts commit 2458e00.

* format doc

* cleanup

* revert some changes to MPI init

* cleanup

* prettify

* remove redistribution

* pretty-test

* pretty-test

* pretty-test

* pretty-test

* pretty-test

* pretty-test

* pretty-test

* Revert "pretty-test"

This reverts commit f7fd23c.

* Revert "Revert "pretty-test""

This reverts commit b9a9879.

* Revert "pretty-test"

This reverts commit d582f3f.

* pretty-test

* pretty-test

* pretty-test

* pretty-test

* Revert "pretty-test"

This reverts commit b88b373.

* pretty-test

* remove snippet

* add timers -- workaround for cp2k#3071

* prettify

* doc cleanup

* Update src/fm/cp_dlaf_utils.F

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

* Update src/fm/cp_fm_dlaf_api.F

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

* Update src/fm/cp_dlaf_utils.F

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

* Update src/fm/cp_dlaf_utils.F

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

* Update src/fm/cp_dlaf_utils.F

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

* remove ugly workaround

* cmake config

* line wrap install

* Update src/input_cp2k_global.F

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

* gitignore

* add guards

* remove pika:print-bind in favour of environment variable

* __dlaf

* make pretty

* Update src/fm/cp_fm_dlaf_api.F

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

* actually fix cmake

* fix cmake conflicts

* Update src/fm/cp_dlaf_utils.F

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

* Update src/fm/cp_fm_dlaf_api.F

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

* Update src/fm/cp_dlaf_utils.F

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

* Update src/fm/cp_dlaf_utils.F

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

* Update src/fm/cp_dlaf_utils.F

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

* remove ugly workaround

* cmake config

* line wrap install

* Update src/input_cp2k_global.F

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

* gitignore

* add guards

* remove pika:print-bind in favour of environment variable

* __dlaf

* make pretty

* Update src/fm/cp_fm_dlaf_api.F

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

* actually fix cmake

* make pretty

* Revert "MPI: Specify imported routines and constants with MPI_F08"

This reverts commit aecd330.

* Revert "Revert "MPI: Specify imported routines and constants with MPI_F08""

This reverts commit f224465.

* fix MPI

* initialize HIP/CUDA before MPI

* initialize HIP/CUDA before MPI

* better mpi init

---------

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
@RMeli
Copy link
Member Author

RMeli commented Dec 13, 2023

If possible I'd like to use the apt package because it's faster to install.

Unfortunately, after some testing, both mpich installed from apt and from Spack result in issues. This is something I have observed locally too (but never on HPC platforms with cray-mpich). Using mpich device=ch3 circumvents the issue.

@oschuett
Copy link
Member

oschuett commented Dec 13, 2023

Using mpich device=ch3 circumvents the issue.

In the toolchain we recently switched from ch3 to ch4 (#2805). So, presumable there are issue with ch3 elsewhere. In any case, we can sort that out later. Let's get this PR merged :-)

@RMeli RMeli requested a review from oschuett December 13, 2023 10:48
}

// Cholesky decomposition (double)
// Wrapper without uplo parameter, avoids passing characters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing strings to C is actually rather straightforward. Maybe you don't need these wrappers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strings (char*) yes, but I was having issues with passing single characters (I'm not a Fortran expert and all the information I could find was for strings...). The DLAF C API requires a single character, not a string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DLAF C API requires a single character, not a string.

The Fortran type for a single character is CHARACTER - note the absence of the len= argument.

The C bindings for passing a single character as value is:

CHARACTER(kind=C_CHAR), VALUE  :: uplo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC pdpotrf is only called with 'U' anyways, that's why I went with this easier solution instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I used, and did not seem work:

Fascinating! This feels like a bug.

I eventually got it to work by passing an integer, which should technically not be a hack.

   subroutine foo_char_rubbish(aa)
      character :: aa

      interface
         subroutine foo1_c(aaa) bind(C, name="foo1")
            import:: C_SIGNED_CHAR
            INTEGER(kind=C_SIGNED_CHAR), value :: aaa
         end subroutine
      end interface

      call foo1_c(IACHAR(aa, C_SIGNED_CHAR))
   end subroutine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice workaround. I added it to the reproducer for my future self. ;)

Do you prefer to have this or what I did so far, given that only 'U' is used anyways? We can always come back to it if 'L' is needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's always good to remove unnecessary code. So, personally I'd remove cp_dlaf_utils.c. Furthermore, I'd merge cp_dlaf_utils_api.F into cp_fm_dlaf_api.F to have that single interface module.

However, since this affects only the DLAF code, I'll leave it up to you.

Otherwise this PR looks good to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used your workaround (thanks!) to remove the wrapper functions. I also removed the .c file as a whole, as requested. I'm not entirely convinced that passing char** directly from Fortran is better than a simple wrapper in this case, but I think it's alright.

The separation of the two .F is to avoid circular dependencies.

src/mpiwrap/message_passing.F Show resolved Hide resolved
src/mpiwrap/message_passing.F Outdated Show resolved Hide resolved
src/f77_interface.F Outdated Show resolved Hide resolved
@RMeli RMeli requested a review from oschuett December 14, 2023 16:29
}

// Cholesky decomposition (double)
// Wrapper without uplo parameter, avoids passing characters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's always good to remove unnecessary code. So, personally I'd remove cp_dlaf_utils.c. Furthermore, I'd merge cp_dlaf_utils_api.F into cp_fm_dlaf_api.F to have that single interface module.

However, since this affects only the DLAF code, I'll leave it up to you.

Otherwise this PR looks good to me.

@oschuett oschuett merged commit 6ed950b into cp2k:master Dec 18, 2023
34 checks passed
@RMeli RMeli mentioned this pull request Dec 18, 2023
@RMeli RMeli deleted the dlafuture branch December 19, 2023 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants