Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion on DBCSR #815

Closed
Schroedingers0216 opened this issue Jul 3, 2024 · 43 comments
Closed

Discussion on DBCSR #815

Schroedingers0216 opened this issue Jul 3, 2024 · 43 comments

Comments

@Schroedingers0216
Copy link

Schroedingers0216 commented Jul 3, 2024

I am writing to seek your assistance. When running CP2K simulations on the AMD MI50 platform with the DBCSR backend set to HIP, the execution time is longer than using the CPU. Using HIPprof to examine the API calls, I noticed that the results show a large number of H2D (Host to Device) transfers but no kernel launches. Normally, the call flow should be H2D -> LaunchKernel -> D2H (Device to Host). I would like to understand why there are so many H2D transfers and where in the code this is occurring. Below, I have attached the JSON file for you to open in chrome://tracing.

Thank you.

@hfp
Copy link
Member

hfp commented Jul 3, 2024

If possible, can you share the input file and perhaps the profile output when running the workload? The profile output is what contains the timings printed by CP2K at the end. What's clear already, this is not only about DBCSR but also CP2K's GRID components (collocate/integrate), perhaps even some PW, etc.

Regarding, "H2D -> LaunchKernel -> D2H" - this is idealized assuming only a single transfer/array is the input of such kernel and in turn for the output/result as well.

@Schroedingers0216
Copy link
Author

I tried setting the DBCSR backend to others, and I didn't find a large number of H2D transfers in HIPprof. Therefore, I believe DBCSR is causing the issue. Additionally, it might be due to the transpose_d kernel. I couldn't locate the specific code responsible for the numerous H2D transfers. Below, I have attached the test file and output file. Thank you. @hfp
test.tar.gz

@hfp
Copy link
Member

hfp commented Jul 4, 2024

For the record, if there are "unnecessary" data transfers like can be combined or avoided, this issue applies to all backends as well as GPUs/vendors. The hint on transposes might be a first step.

@zhl201226 you may try DBCSR_RUN_ON_GPU=0 environment variable and recapture the GPU-profile. This environment variable disables DBCSR on GPUs even if the support is compiled into the application (and leaves the other uses of CP2K on GPUs intact).

@hfp
Copy link
Member

hfp commented Jul 4, 2024

Looking at CP2K's profile, local GEMMs (cp_fm_gemm) consume ~25% of the TTS on this system (just as a note). However, multiply_cannon* and dbcsr_mm_hostdrv_process are interesting. Given dbcsr_mm_hostdrv_process is relatively high, it seems there is a reasonable portion of fallbacks happening. Given previous implementation, the fallbacks may be accompanied by transfers without actually launching a kernel.

@Schroedingers0216
Copy link
Author

I have identified the H2D issue occurring in the dbcsr_mm_accdrv_process module. Is this module dividing the data into small chunks for transfer? Can it be merged into larger chunks for transfer? Additionally, I previously did not use ACC to accelerate DBCSR, but it seems to be taking longer now. Therefore, I am not sure if DBCSR_RUN_ON_GPU=0 is effective. Could you please provide more optimization suggestions?

@hfp
Copy link
Member

hfp commented Jul 4, 2024

Sorry, I guess DBCSR_RUN_ON_GPU is only supported in the most recent if not unreleased version. This was not meant as an optimization suggestion but rather something to systematically rule-out or blame DBCSR. Your example input is worth looking at for contributors.

@Schroedingers0216
Copy link
Author

How do I contact contributors?
@hfp

@hfp
Copy link
Member

hfp commented Jul 4, 2024

Just give some time they will see this open issue ;-)

@Schroedingers0216
Copy link
Author

Just give some time they will see this open issue ;-)

thank you :-)

@hfp
Copy link
Member

hfp commented Jul 4, 2024

( Side note, GLOBAL| CPU model name does not show up in the log ;-)

@hfp
Copy link
Member

hfp commented Jul 4, 2024

Regarding the test input, it's missing the restart file for the SCF initial guess. Commenting it out, starts from an unreasonable guess and fails in the Cholesky decomposition.

@Schroedingers0216
Copy link
Author

( Side note, GLOBAL| CPU model name does not show up in the log ;-)

"By the way, using DBCSR_RUN_ON_GPU=0 did not significantly improve performance. The CPU model name has been hidden for other reasons, but I can provide it if needed."
image

@Schroedingers0216
Copy link
Author

Regarding the test input, it's missing the restart file for the SCF initial guess. Commenting it out, starts from an unreasonable guess and fails in the Cholesky decomposition.

This restart file is too large to upload. Is there another way to send it to you?

@hfp
Copy link
Member

hfp commented Jul 4, 2024

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

@Schroedingers0216
Copy link
Author

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

I have already sent it to you via email. thank you

@hfp
Copy link
Member

hfp commented Jul 4, 2024

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

I have already sent it to you via email. thank you

( Let's see, the e-mail did not arrive yet perhaps size restrictions )

@Schroedingers0216
Copy link
Author

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

I have already sent it to you via email. thank you

( Let's see, the e-mail did not arrive yet perhaps size restrictions )

I have resent it to my.name@intel.com. Please check it. Best regards

@hfp
Copy link
Member

hfp commented Jul 8, 2024

I have resent it to my.name@intel.com. Please check it. Best regards

Literally? I envisioned my.name would be my name taken from https://github.com/hfp (hans.pabst). Sorry for the confusion.

@Schroedingers0216
Copy link
Author

I have resent it to my.name@intel.com. Please check it. Best regards

Literally? I envisioned my.name would be my name taken from https://github.com/hfp (hans.pabst). Sorry for the confusion.

sure,I also sent an email to hans.pabst@intel.com, and my email address is [zhanghl20126@gmail.com]

@alazzaro
Copy link
Member

I am writing to seek your assistance. When running CP2K simulations on the AMD MI50 platform with the DBCSR backend set to HIP, the execution time is longer than using the CPU. Using HIPprof to examine the API calls, I noticed that the results show a large number of H2D (Host to Device) transfers but no kernel launches. Normally, the call flow should be H2D -> LaunchKernel -> D2H (Device to Host). I would like to understand why there are so many H2D transfers and where in the code this is occurring. Below, I have attached the JSON file for you to open in chrome://tracing.

Thank you.

The important CP2K timers for your execution are the following:

grid_integrate_task_list         340.326
grid_collocate_task_list         377.996
multiply_multrec                 523.278
cp_fm_syevd_base                 637.115
cp_fm_redistribute_end           639.139
dbcsr_mm_hostdrv_process        1229.836
cp_gemm_cosma                   2335.899
CP2K_Total                      8183.616

Now, I would assume you are running COSMA on the GPU, so you cannot gain more there.
Then I see cp_fm_syevd_base, no sure if ELPA can give some benefit. It can also be the case for https://github.com/eth-cscs/DLA-Future. The grid parts are already running on the GPU.

Concerning DBCSR, the important part is the DBCSR kernel output:

 -------------------------------------------------------------------------------
 -                                                                             -
 -                                DBCSR STATISTICS                             -
 -                                                                             -
 -------------------------------------------------------------------------------
 COUNTER                                    TOTAL       BLAS       SMM       ACC
 flops     1 x     1 x     1                 3610       0.0%    100.0%      0.0%
 flops     1 x     1 x     5                19040       0.0%    100.0%      0.0%
...
 flops total                       537.243062E+12       0.0%     96.5%      3.5%
 flops max/rank                     35.731363E+12       0.0%     96.5%      3.5%
 matmuls inhomo. stacks                         0       0.0%      0.0%      0.0%
 matmuls total                        22844500215       0.0%     98.8%      1.2%
 number of processed stacks               3196393       0.0%     92.7%      7.3%
 average stack size                                     0.0    7614.1    1217.7

Basically, 98.8% of the blocks multiplications are running on the CPU (SMM column), only 1.2% is running on the GPU. The reason is that your kernels are not presented on the GPU tuning parameters. There are several ways to improve the situation (in order of preference):

  1. Run the tuning procedure for the parameters you are interested and contribute to the current list.
  2. You can try to set export DBCSR_MM_DENSE=1, you can see that the list of kernels should change and possibly more kernels will run on the GPU
  3. Use the latest DCBSR (v2.7.0-rc2) which provides a default GPU kernel when the tuned kernels are not available.

@Schroedingers0216
Copy link
Author

I am writing to seek your assistance. When running CP2K simulations on the AMD MI50 platform with the DBCSR backend set to HIP, the execution time is longer than using the CPU. Using HIPprof to examine the API calls, I noticed that the results show a large number of H2D (Host to Device) transfers but no kernel launches. Normally, the call flow should be H2D -> LaunchKernel -> D2H (Device to Host). I would like to understand why there are so many H2D transfers and where in the code this is occurring. Below, I have attached the JSON file for you to open in chrome://tracing.
Thank you.

The important CP2K timers for your execution are the following:

grid_integrate_task_list         340.326
grid_collocate_task_list         377.996
multiply_multrec                 523.278
cp_fm_syevd_base                 637.115
cp_fm_redistribute_end           639.139
dbcsr_mm_hostdrv_process        1229.836
cp_gemm_cosma                   2335.899
CP2K_Total                      8183.616

Now, I would assume you are running COSMA on the GPU, so you cannot gain more there. Then I see cp_fm_syevd_base, no sure if ELPA can give some benefit. It can also be the case for https://github.com/eth-cscs/DLA-Future. The grid parts are already running on the GPU.

Concerning DBCSR, the important part is the DBCSR kernel output:

 -------------------------------------------------------------------------------
 -                                                                             -
 -                                DBCSR STATISTICS                             -
 -                                                                             -
 -------------------------------------------------------------------------------
 COUNTER                                    TOTAL       BLAS       SMM       ACC
 flops     1 x     1 x     1                 3610       0.0%    100.0%      0.0%
 flops     1 x     1 x     5                19040       0.0%    100.0%      0.0%
...
 flops total                       537.243062E+12       0.0%     96.5%      3.5%
 flops max/rank                     35.731363E+12       0.0%     96.5%      3.5%
 matmuls inhomo. stacks                         0       0.0%      0.0%      0.0%
 matmuls total                        22844500215       0.0%     98.8%      1.2%
 number of processed stacks               3196393       0.0%     92.7%      7.3%
 average stack size                                     0.0    7614.1    1217.7

Basically, 98.8% of the blocks multiplications are running on the CPU (SMM column), only 1.2% is running on the GPU. The reason is that your kernels are not presented on the GPU tuning parameters. There are several ways to improve the situation (in order of preference):

  1. Run the tuning procedure for the parameters you are interested and contribute to the current list.
  2. You can try to set export DBCSR_MM_DENSE=1, you can see that the list of kernels should change and possibly more kernels will run on the GPU
  3. Use the latest DCBSR (v2.7.0-rc2) which provides a default GPU kernel when the tuned kernels are not available.

I will debug based on your suggestions later, but since the process is relatively long, I will temporarily close the question. Thank you very much.

@Schroedingers0216 Schroedingers0216 changed the title CP2K performs poorly on AMD platforms when using the DBCSR HIP backend. CP2K performs poorly on AMD platforms Jul 11, 2024
@hfp
Copy link
Member

hfp commented Jul 11, 2024

1. Run the [tuning procedure](https://cp2k.github.io/dbcsr/develop/page/3-developer-guide/3-programming/2-accelerator-backend/2-libsmm_acc/index.html) for the parameters you are interested and contribute to the current list.

2. You can try to set `export DBCSR_MM_DENSE=1`, you can see that the list of kernels should change and possibly more kernels will run on the GPU

3. Use the latest DCBSR (v2.7.0-rc2) which provides a default GPU kernel when the tuned kernels are not available.

I am sure the OpenCL backend can be mixed with HIP as well (just like with CUDA). However, I have not spent any time to exercise this. It comes down to support in build system on CP2K's side. In any case, I will keep HIP in mind when taking this task (it's still open for me to get DBM/DBT and DBCSR based on OpenCL into CP2K's CMake).

@Schroedingers0216
Copy link
Author

sorry, but I have to restart this issue.

1、When using the default GPU kernel, the dbcsr_mm_accdrv_process module is called very frequently, with call counts
dbcsr_mm_accdrv_process 148432 18.9 90.392 111.997 148.383 177.363
such as I have to suspect that this is the main reason for the performance issues. In contrast, when using CUDA, there is no record of this function in the final list.

2、During kernel training, there are a large number of VMFAULT errors. I have tried making modifications, but there have been no significant improvements. How should I resolve this issue?
Invalid address access: 0x343b05325000, Error code: 1.

KERNEL VMFault !!!! <<<<<<

PID: 8097, SIGNAL: 0 !!!! <<<<<<
=========> HOSTQUEUE <0x1b59b0f0>: VMFault HSA QUEUE ANALYSIS <=========
HOSTQUEUE <0x1b59b0f0>: get hsa queue W/R ptr: write index: 62961, read index: 62957
HOSTQUEUE <0x1b59b0f0>: >>>>>>>> DUMP KERNEL AQL PACKET <<<<<<<<<
HOSTQUEUE <0x1b59b0f0>: header: 2818
HOSTQUEUE <0x1b59b0f0>: setup: 3
HOSTQUEUE <0x1b59b0f0>: workgroup: x:128, y:1, z:1
HOSTQUEUE <0x1b59b0f0>: grid: x:128128, y:1, z:1
HOSTQUEUE <0x1b59b0f0>: group_segment_size: 2240
HOSTQUEUE <0x1b59b0f0>: private_segment_size: 136
HOSTQUEUE <0x1b59b0f0>: kernel_object: 47532725616576

HOSTQUEUE <0x1b59b0f0>: device id: 0

HOSTQUEUE <0x1b59b0f0>: >>>>>>>> FIND MATCH KERNEL COMMAND <<<<<<<<<
HOSTQUEUE <0x1b59b0f0>: kernel name: _Z20smm_acc_dnt_largeDB2ILi32ELi32ELi32ELi6ELi2ELi4ELi6ELi128ELi16ELi4EEvPKiiPKdS3_Pd

HOSTQUEUE <0x1b59b0f0>: >>>>>>>> DUMP KERNEL ARGS: size: 40 <<<<<<<<<

00 00 c0 05 3b 2b 00 00 85 3e 00 00 00 00 00 00
00 00 20 f5 3a 2b 00 00 00 00 00 00 3b 2b 00 00
00 00 20 05 3b 2b 00 00

HOSTQUEUE <0x1b59b0f0>: >>>>>>>> DUMP KERNEL ARGS PTR INFO <<<<<<<<<
HOSTQUEUE <0x1b59b0f0>: ptr arg index: 0, ptr: 0x2b3b05c00000
HOSTQUEUE <0x1b59b0f0>: origin ptr: 0x2b3b05c00000, size byte: 192060

HOSTQUEUE <0x1b59b0f0>: ptr arg index: 2, ptr: 0x2b3af5200000
HOSTQUEUE <0x1b59b0f0>: origin ptr: 0x2b3af5200000, size byte: 81920000

HOSTQUEUE <0x1b59b0f0>: ptr arg index: 3, ptr: 0x2b3b00000000
HOSTQUEUE <0x1b59b0f0>: origin ptr: 0x2b3b00000000, size byte: 81920000

HOSTQUEUE <0x1b59b0f0>: ptr arg index: 4, ptr: 0x2b3b05200000
HOSTQUEUE <0x1b59b0f0>: origin ptr: 0x2b3b05200000, size byte: 8192000

=========> HOSTQUEUE <0x1b1dbb60>: VMFault HSA QUEUE ANALYSIS <=========
params 6969 / 9136

@alazzaro @hfp

@alazzaro
Copy link
Member

alazzaro commented Aug 8, 2024

In my comment I gave some suggestions, especially on DBCSR. Since then, the new DBCSR and CP2K are out (2024.2), have you tried it?

@Schroedingers0216
Copy link
Author

In my comment I gave some suggestions, especially on DBCSR. Since then, the new DBCSR and CP2K are out (2024.2), have you tried it?

Yes, I have tried all the suggestions you gave, but they seem not so ideal, especially these two parts.

@alazzaro
Copy link
Member

alazzaro commented Aug 8, 2024

Please, post the 2 cp2k logs (cuda and hip). There is no reason why the two calls of tge function should be different.

@Schroedingers0216
Copy link
Author

Please, post the 2 cp2k logs (cuda and hip). There is no reason why the two calls of tge function should be different.

CP2K.log
The call of the "dbcsr_mm_accdrv_process" module only appears in HIP.

@Schroedingers0216
Copy link
Author

Please, post the 2 cp2k logs (cuda and hip). There is no reason why the two calls of tge function should be different.

CP2K.log The call of the "dbcsr_mm_accdrv_process" module only appears in HIP.

Additionally, the vmfault error is preventing me from training a suitable kernel.

@alazzaro
Copy link
Member

alazzaro commented Aug 8, 2024

Please, post the 2 cp2k logs (cuda and hip). There is no reason why the two calls of tge function should be different.

CP2K.log The call of the "dbcsr_mm_accdrv_process" module only appears in HIP.

You don't need the training part if you are using the new CP2K 2024.2.
could you add the DBCSR statistics to the log?

@Schroedingers0216
Copy link
Author

请发布 2 个 cp2k 日志(cuda 和 hip)。没有理由说 tge 函数的两次调用应该不同。

CP2K.log “dbcsr_mm_accdrv_process”模块的调用仅出现在HIP中。

如果您使用的是新的 CP2K 2024.2,则不需要培训部分。您可以将 DBCSR 统计信息添加到日志中吗?

DBCSR.log
SURE,this DBCSR.log

@alazzaro
Copy link
Member

alazzaro commented Aug 8, 2024

I can analyze your initial log and check the timers (first column is HIP, second is CUDA):

make_images                       10.387    18.123
multiply_cannon_loop              16.520     3.574
xc_rho_set_and_dset_create        18.409    18.188
xc_vxc_pw_create                  42.752    44.334
yz_to_x                           76.430   113.945
dbcsr_mm_accdrv_process          111.997     -----
dbcsr_complete_redistribute      125.758   130.837
x_to_yz                          135.309   200.783
multiply_cannon_sync_h2d         184.832   180.573
pbe_lsd_eval                     192.631   179.861
mp_alltoall_z22v                 217.510   358.638
multiply_multrec                 232.394   151.375
mp_waitall_1                     295.217   419.426
mp_alltoall_d11v                 329.255   155.790
mp_waitany                       336.633     -----
grid_collocate_task_list         369.034   113.191
grid_integrate_task_list         438.427   315.792
cp_fm_syevd_base                 830.135   439.105
cp_fm_redistribute_end           832.095   443.968
cp_gemm_cosma                   2744.748  1164.564
CP2K_Total                      7833.532  5368.958

I take COSMA as a reference for the HIP VS CUDA performance, ie. 2744.748/1164.564= 2.4x

Now we can analyze DBCSR. The main call is:

dbcsr_multiply_generic          1282.394   961.339

Here the ratio is much less that 2.4x, so I would say the two versions are compatible.
Then, the next call to check is:

multiply_cannon_loop             798.414   663.672

Still, they are compatible. Another step below, we have:

multiply_multrec                 401.995   215.853

Here we see something compatible with the COSMA performance ratio, which is still OK.

The fact that you don't see dbcsr_mm_accdrv_process in the CUDA version can be that the function contribution is negligible with respect to the callee timer, so it is filtered in the output. You can try to change the threshold (see https://manual.cp2k.org/trunk/CP2K_INPUT/GLOBAL/TIMINGS.html#CP2K_INPUT.GLOBAL.TIMINGS.THRESHOLD).

@alazzaro
Copy link
Member

alazzaro commented Aug 8, 2024

DBCSR.log

OK, I would assume that the CUDA version doesn't use the latest CP2K, since some of the kernels are still executed on the CPU, e.g.:

 COUNTER                                    TOTAL       BLAS       SMM       ACC
 flops     1 x     1 x     1                 3968       0.0%    100.0%      0.0%
 flops     1 x     1 x     5                21920       0.0%    100.0%      0.0%
 flops     5 x     1 x     1                27400       0.0%    100.0%      0.0%
 flops     5 x     5 x     1                53400       0.0%    100.0%      0.0%
 flops     5 x     1 x     5               217200       0.0%    100.0%      0.0%
 flops     5 x     5 x    10               450000       0.0%      0.0%    100.0%
 flops     5 x    10 x     5               952000       0.0%      0.0%    100.0%
 flops    32 x     1 x     1              6856704       0.0%    100.0%      0.0%
 flops     1 x    32 x     1              6856704       0.0%    100.0%      0.0%
 flops     1 x     1 x    32              6856704       0.0%    100.0%      0.0%
 flops     5 x    13 x     1              7057180       0.0%    100.0%      0.0%
 flops    13 x     5 x     1              7359040       0.0%    100.0%      0.0%
 flops     1 x     1 x    26              8206848       0.0%    100.0%      0.0%
 flops     1 x     1 x    13              8206848       0.0%    100.0%      0.0%
 flops    26 x     1 x     1             10258560       0.0%    100.0%      0.0%
 flops    13 x     1 x     1             10258560       0.0%    100.0%      0.0%
 flops     5 x    26 x     1             11609520       0.0%    100.0%      0.0%
 flops    26 x     5 x     1             11708320       0.0%    100.0%      0.0%
 flops    32 x     1 x     5             37393920       0.0%    100.0%      0.0%
 flops     1 x    32 x     5             37393920       0.0%    100.0%      0.0%
 flops     5 x    32 x     1             46379520       0.0%    100.0%      0.0%
 flops     5 x     1 x    32             46379520       0.0%    100.0%      0.0%
 flops    13 x     1 x     5             52060450       0.0%    100.0%      0.0%
 flops     5 x     1 x    13             52060450       0.0%    100.0%      0.0%

For the HIP case you are using the new CP2K and indeed all kernels are executed on the GPU (but likely they are not performing as expected...).
That explains why there are much less calls to the dbcsr_mm_accdrv_process, simply because it is pushing less kernels to the GPU for the CUDA version. Could you try to run the CUDA version with the new CP2K? I'm interested to see if for some kernels is still better to keep the execution on the CPU...

@Schroedingers0216
Copy link
Author

DBCSR.log

OK, I would assume that the CUDA version doesn't use the latest CP2K, since some of the kernels are still executed on the CPU, e.g.:

 COUNTER                                    TOTAL       BLAS       SMM       ACC
 flops     1 x     1 x     1                 3968       0.0%    100.0%      0.0%
 flops     1 x     1 x     5                21920       0.0%    100.0%      0.0%
 flops     5 x     1 x     1                27400       0.0%    100.0%      0.0%
 flops     5 x     5 x     1                53400       0.0%    100.0%      0.0%
 flops     5 x     1 x     5               217200       0.0%    100.0%      0.0%
 flops     5 x     5 x    10               450000       0.0%      0.0%    100.0%
 flops     5 x    10 x     5               952000       0.0%      0.0%    100.0%
 flops    32 x     1 x     1              6856704       0.0%    100.0%      0.0%
 flops     1 x    32 x     1              6856704       0.0%    100.0%      0.0%
 flops     1 x     1 x    32              6856704       0.0%    100.0%      0.0%
 flops     5 x    13 x     1              7057180       0.0%    100.0%      0.0%
 flops    13 x     5 x     1              7359040       0.0%    100.0%      0.0%
 flops     1 x     1 x    26              8206848       0.0%    100.0%      0.0%
 flops     1 x     1 x    13              8206848       0.0%    100.0%      0.0%
 flops    26 x     1 x     1             10258560       0.0%    100.0%      0.0%
 flops    13 x     1 x     1             10258560       0.0%    100.0%      0.0%
 flops     5 x    26 x     1             11609520       0.0%    100.0%      0.0%
 flops    26 x     5 x     1             11708320       0.0%    100.0%      0.0%
 flops    32 x     1 x     5             37393920       0.0%    100.0%      0.0%
 flops     1 x    32 x     5             37393920       0.0%    100.0%      0.0%
 flops     5 x    32 x     1             46379520       0.0%    100.0%      0.0%
 flops     5 x     1 x    32             46379520       0.0%    100.0%      0.0%
 flops    13 x     1 x     5             52060450       0.0%    100.0%      0.0%
 flops     5 x     1 x    13             52060450       0.0%    100.0%      0.0%

For the HIP case you are using the new CP2K and indeed all kernels are executed on the GPU (but likely they are not performing as expected...). That explains why there are much less calls to the dbcsr_mm_accdrv_process, simply because it is pushing less kernels to the GPU for the CUDA version. Could you try to run the CUDA version with the new CP2K? I'm interested to see if for some kernels is still better to keep the execution on the CPU...

Thank you very much. I'll provide you with the CUDA log later. As for the compilation issues with the COSMA library, I'm encountering some difficulties cp2k/cp2k#3611. Additionally, if I use the latest version of DBCSR, do I need to retrain the kernels?

@alazzaro
Copy link
Member

alazzaro commented Aug 8, 2024

I would say that the performance is somehow OK for the HIP. No need to retrain the kernels, just use the new DBCSR. But really, no need to train anything, I see you are using the generic kernels, which is OK.

Now the question is if we are "abusing" the generic kernel for all cases, maybe we should still keep some kernels to be executed on the CPU (I see you have a lot of kernels with size 1)... So, yes, the test of the CUDA with the new CP2K can give us some hints.

@Schroedingers0216
Copy link
Author

I would say that the performance is somehow OK for the HIP. No need to retrain the kernels, just use the new DBCSR. But really, no need to train anything, I see you are using the generic kernels, which is OK.

Now the question is if we are "abusing" the generic kernel for all cases, maybe we should still keep some kernels to be executed on the CPU (I see you have a lot of kernels with size 1)... So, yes, the test of the CUDA with the new CP2K can give us some hints.

I'll provide it to you as soon as possible.

@alazzaro
Copy link
Member

alazzaro commented Aug 8, 2024

One more thing, you can try:

export DBCSR_MM_DENSE=1

and see if the kernels int he DBCSR statistics will change.

@Schroedingers0216
Copy link
Author

One more thing, you can try:

export DBCSR_MM_DENSE=1

and see if the kernels int he DBCSR statistics will change.

This is the calculation result using tune kernel, and I guess not all tunes are suitable for calculation on GPU.
cuda-new.log

@alazzaro
Copy link
Member

alazzaro commented Aug 9, 2024

One more thing, you can try:

export DBCSR_MM_DENSE=1

and see if the kernels int he DBCSR statistics will change.

This is the calculation result using tune kernel, and I guess not all tunes are suitable for calculation on GPU. cuda-new.log

Something is wrong with attachment, I can't download it....

@Schroedingers0216
Copy link
Author

One more thing, you can try:

export DBCSR_MM_DENSE=1

and see if the kernels int he DBCSR statistics will change.

This is the calculation result using tune kernel, and I guess not all tunes are suitable for calculation on GPU. cuda-new.log

Something is wrong with attachment, I can't download it....
cuda-new.log

@alazzaro
Copy link
Member

alazzaro commented Aug 9, 2024

There is something wrong with the last run, at least I cannot compare it directly. I would assume that only DBCSR should change, while I see the following:

pw_gpu_fg                        108.515     -----
grid_collocate_task_list         113.191   159.864
yz_to_x                          113.945   174.592
dbcsr_complete_redistribute      130.837   179.396
pw_derive                        141.166   198.227
multiply_multrec                 151.375   246.651
mp_alltoall_d11v                 155.790   226.452
pbe_lsd_eval                     179.861   232.371
multiply_cannon_sync_h2d         180.573   283.476
x_to_yz                          200.783   280.289
grid_integrate_task_list         315.792   466.755
mp_alltoall_z22v                 358.638   429.072
mp_waitall_1                     419.426   560.347
cp_fm_syevd_base                 439.105   639.631
cp_fm_redistribute_end           443.968   647.219
cp_gemm_cosma                   1164.564  1652.117
CP2K_Total                      5368.958  7521.072

The number of multiplications is different:

  1. old file has 6846
  2. new 9194

We cannot compare them...

@Schroedingers0216
Copy link
Author

上次运行有问题,至少我不能直接比较。我假设只有 DBCSR 应该更改,而我看到以下内容:

pw_gpu_fg                        108.515     -----
grid_collocate_task_list         113.191   159.864
yz_to_x                          113.945   174.592
dbcsr_complete_redistribute      130.837   179.396
pw_derive                        141.166   198.227
multiply_multrec                 151.375   246.651
mp_alltoall_d11v                 155.790   226.452
pbe_lsd_eval                     179.861   232.371
multiply_cannon_sync_h2d         180.573   283.476
x_to_yz                          200.783   280.289
grid_integrate_task_list         315.792   466.755
mp_alltoall_z22v                 358.638   429.072
mp_waitall_1                     419.426   560.347
cp_fm_syevd_base                 439.105   639.631
cp_fm_redistribute_end           443.968   647.219
cp_gemm_cosma                   1164.564  1652.117
CP2K_Total                      5368.958  7521.072

乘法的次数不同:

  1. 旧文件有 6846
  2. 新款 9194

我们无法比较它们......

上次运行有问题,至少我不能直接比较。我假设只有 DBCSR 应该更改,而我看到以下内容:

pw_gpu_fg                        108.515     -----
grid_collocate_task_list         113.191   159.864
yz_to_x                          113.945   174.592
dbcsr_complete_redistribute      130.837   179.396
pw_derive                        141.166   198.227
multiply_multrec                 151.375   246.651
mp_alltoall_d11v                 155.790   226.452
pbe_lsd_eval                     179.861   232.371
multiply_cannon_sync_h2d         180.573   283.476
x_to_yz                          200.783   280.289
grid_integrate_task_list         315.792   466.755
mp_alltoall_z22v                 358.638   429.072
mp_waitall_1                     419.426   560.347
cp_fm_syevd_base                 439.105   639.631
cp_fm_redistribute_end           443.968   647.219
cp_gemm_cosma                   1164.564  1652.117
CP2K_Total                      5368.958  7521.072

乘法的次数不同:

  1. 旧文件有 6846
  2. 新款 9194

我们无法比较它们......

There is something wrong with the last run, at least I cannot compare it directly. I would assume that only DBCSR should change, while I see the following:

pw_gpu_fg                        108.515     -----
grid_collocate_task_list         113.191   159.864
yz_to_x                          113.945   174.592
dbcsr_complete_redistribute      130.837   179.396
pw_derive                        141.166   198.227
multiply_multrec                 151.375   246.651
mp_alltoall_d11v                 155.790   226.452
pbe_lsd_eval                     179.861   232.371
multiply_cannon_sync_h2d         180.573   283.476
x_to_yz                          200.783   280.289
grid_integrate_task_list         315.792   466.755
mp_alltoall_z22v                 358.638   429.072
mp_waitall_1                     419.426   560.347
cp_fm_syevd_base                 439.105   639.631
cp_fm_redistribute_end           443.968   647.219
cp_gemm_cosma                   1164.564  1652.117
CP2K_Total                      5368.958  7521.072

The number of multiplications is different:

  1. old file has 6846
  2. new 9194

We cannot compare them...

This is my rerunning of the old version.
cuda-old.log

@alazzaro
Copy link
Member

OK, thanks, so the conclusion is that the two runs (old and new) are compatible. In terms of DBCSR, I see a difference of 30s on top of ~1300s, so I would say this is simply noise. I can conclude that the generic kernel is good enough (at least this is not the dominant factor for sure).

Still, it remans open the comparison between the new cuda and the corresponding HIP (I have the HIP run with only 6846 multiplications). Indeed, with my surprise, the timer dbcsr_mm_accdrv_process is not present in CUDA, probably because the timer filtering (see my comment above).

@Schroedingers0216
Copy link
Author

OK, thanks, so the conclusion is that the two runs (old and new) are compatible. In terms of DBCSR, I see a difference of 30s on top of ~1300s, so I would say this is simply noise. I can conclude that the generic kernel is good enough (at least this is not the dominant factor for sure).

Still, it remans open the comparison between the new cuda and the corresponding HIP (I have the HIP run with only 6846 multiplications). Indeed, with my surprise, the timer dbcsr_mm_accdrv_process is not present in CUDA, probably because the timer filtering (see my comment above).

Thank you. Now it seems that we can't attribute all the problems to DBSCR and HIP. There is no basis for its existence. If there are new problems in the future, I will communicate with you in time.

@Schroedingers0216 Schroedingers0216 changed the title CP2K performs poorly on AMD platforms Discussion on DBCSR Aug 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants