-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion on DBCSR #815
Comments
If possible, can you share the input file and perhaps the profile output when running the workload? The profile output is what contains the timings printed by CP2K at the end. What's clear already, this is not only about DBCSR but also CP2K's GRID components (collocate/integrate), perhaps even some PW, etc. Regarding, "H2D -> LaunchKernel -> D2H" - this is idealized assuming only a single transfer/array is the input of such kernel and in turn for the output/result as well. |
I tried setting the DBCSR backend to others, and I didn't find a large number of H2D transfers in HIPprof. Therefore, I believe DBCSR is causing the issue. Additionally, it might be due to the transpose_d kernel. I couldn't locate the specific code responsible for the numerous H2D transfers. Below, I have attached the test file and output file. Thank you. @hfp |
For the record, if there are "unnecessary" data transfers like can be combined or avoided, this issue applies to all backends as well as GPUs/vendors. The hint on transposes might be a first step. @zhl201226 you may try |
Looking at CP2K's profile, local GEMMs ( |
I have identified the H2D issue occurring in the dbcsr_mm_accdrv_process module. Is this module dividing the data into small chunks for transfer? Can it be merged into larger chunks for transfer? Additionally, I previously did not use ACC to accelerate DBCSR, but it seems to be taking longer now. Therefore, I am not sure if DBCSR_RUN_ON_GPU=0 is effective. Could you please provide more optimization suggestions? |
Sorry, I guess |
How do I contact contributors? |
Just give some time they will see this open issue ;-) |
thank you :-) |
( Side note, |
Regarding the test input, it's missing the restart file for the SCF initial guess. Commenting it out, starts from an unreasonable guess and fails in the Cholesky decomposition. |
This restart file is too large to upload. Is there another way to send it to you? |
Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is |
I have already sent it to you via email. thank you |
( Let's see, the e-mail did not arrive yet perhaps size restrictions ) |
I have resent it to my.name@intel.com. Please check it. Best regards |
Literally? I envisioned my.name would be my name taken from https://github.com/hfp ( |
sure,I also sent an email to hans.pabst@intel.com, and my email address is [zhanghl20126@gmail.com] |
The important CP2K timers for your execution are the following:
Now, I would assume you are running COSMA on the GPU, so you cannot gain more there. Concerning DBCSR, the important part is the DBCSR kernel output:
Basically, 98.8% of the blocks multiplications are running on the CPU (SMM column), only 1.2% is running on the GPU. The reason is that your kernels are not presented on the GPU tuning parameters. There are several ways to improve the situation (in order of preference):
|
I will debug based on your suggestions later, but since the process is relatively long, I will temporarily close the question. Thank you very much. |
I am sure the OpenCL backend can be mixed with HIP as well (just like with CUDA). However, I have not spent any time to exercise this. It comes down to support in build system on CP2K's side. In any case, I will keep HIP in mind when taking this task (it's still open for me to get DBM/DBT and DBCSR based on OpenCL into CP2K's CMake). |
sorry, but I have to restart this issue. 1、When using the default GPU kernel, the dbcsr_mm_accdrv_process module is called very frequently, with call counts 2、During kernel training, there are a large number of VMFAULT errors. I have tried making modifications, but there have been no significant improvements. How should I resolve this issue?
HOSTQUEUE <0x1b59b0f0>: device id: 0 HOSTQUEUE <0x1b59b0f0>: >>>>>>>> FIND MATCH KERNEL COMMAND <<<<<<<<< HOSTQUEUE <0x1b59b0f0>: >>>>>>>> DUMP KERNEL ARGS: size: 40 <<<<<<<<< 00 00 c0 05 3b 2b 00 00 85 3e 00 00 00 00 00 00 HOSTQUEUE <0x1b59b0f0>: >>>>>>>> DUMP KERNEL ARGS PTR INFO <<<<<<<<< HOSTQUEUE <0x1b59b0f0>: ptr arg index: 2, ptr: 0x2b3af5200000 HOSTQUEUE <0x1b59b0f0>: ptr arg index: 3, ptr: 0x2b3b00000000 HOSTQUEUE <0x1b59b0f0>: ptr arg index: 4, ptr: 0x2b3b05200000 =========> HOSTQUEUE <0x1b1dbb60>: VMFault HSA QUEUE ANALYSIS <========= |
In my comment I gave some suggestions, especially on DBCSR. Since then, the new DBCSR and CP2K are out (2024.2), have you tried it? |
Yes, I have tried all the suggestions you gave, but they seem not so ideal, especially these two parts. |
Please, post the 2 cp2k logs (cuda and hip). There is no reason why the two calls of tge function should be different. |
CP2K.log |
Additionally, the vmfault error is preventing me from training a suitable kernel. |
You don't need the training part if you are using the new CP2K 2024.2. |
I can analyze your initial log and check the timers (first column is HIP, second is CUDA):
I take COSMA as a reference for the HIP VS CUDA performance, ie. 2744.748/1164.564= 2.4x Now we can analyze DBCSR. The main call is:
Here the ratio is much less that 2.4x, so I would say the two versions are compatible.
Still, they are compatible. Another step below, we have:
Here we see something compatible with the COSMA performance ratio, which is still OK. The fact that you don't see |
OK, I would assume that the CUDA version doesn't use the latest CP2K, since some of the kernels are still executed on the CPU, e.g.:
For the HIP case you are using the new CP2K and indeed all kernels are executed on the GPU (but likely they are not performing as expected...). |
Thank you very much. I'll provide you with the CUDA log later. As for the compilation issues with the COSMA library, I'm encountering some difficulties cp2k/cp2k#3611. Additionally, if I use the latest version of DBCSR, do I need to retrain the kernels? |
I would say that the performance is somehow OK for the HIP. No need to retrain the kernels, just use the new DBCSR. But really, no need to train anything, I see you are using the generic kernels, which is OK. Now the question is if we are "abusing" the generic kernel for all cases, maybe we should still keep some kernels to be executed on the CPU (I see you have a lot of kernels with size 1)... So, yes, the test of the CUDA with the new CP2K can give us some hints. |
I'll provide it to you as soon as possible. |
One more thing, you can try:
and see if the kernels int he DBCSR statistics will change. |
This is the calculation result using tune kernel, and I guess not all tunes are suitable for calculation on GPU. |
Something is wrong with attachment, I can't download it.... |
|
There is something wrong with the last run, at least I cannot compare it directly. I would assume that only DBCSR should change, while I see the following:
The number of multiplications is different:
We cannot compare them... |
This is my rerunning of the old version. |
OK, thanks, so the conclusion is that the two runs (old and new) are compatible. In terms of DBCSR, I see a difference of 30s on top of ~1300s, so I would say this is simply noise. I can conclude that the generic kernel is good enough (at least this is not the dominant factor for sure). Still, it remans open the comparison between the new cuda and the corresponding HIP (I have the HIP run with only 6846 multiplications). Indeed, with my surprise, the timer |
Thank you. Now it seems that we can't attribute all the problems to DBSCR and HIP. There is no basis for its existence. If there are new problems in the future, I will communicate with you in time. |
I am writing to seek your assistance. When running CP2K simulations on the AMD MI50 platform with the DBCSR backend set to HIP, the execution time is longer than using the CPU. Using HIPprof to examine the API calls, I noticed that the results show a large number of H2D (Host to Device) transfers but no kernel launches. Normally, the call flow should be H2D -> LaunchKernel -> D2H (Device to Host). I would like to understand why there are so many H2D transfers and where in the code this is occurring. Below, I have attached the JSON file for you to open in chrome://tracing.
Thank you.
The text was updated successfully, but these errors were encountered: