-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
COSMA cublas crash after job finished #100
Comments
Hi @yaoyi92! I am not sure if we already discussed this by email? COSMA does not have any wrapper around What happens in COSMA is that the user has the following options:
The context is then reused in all multiplications and it mostly contains the memory pool (both CPU and GPU). It is possible to control how much CPU and GPU memory you allow COSMA to use:
I assume you used the implicitly created global context (as in 2.). In that case, you should be able to call the destructor explicitly: This will release all CPU and GPU memory that was allocated. Let me know if you have any other questions! |
Is the problem you mentioned still present with the latest version: COSMA-v2.6.1? |
Btw, the gpu devices should be set outside of COSMA (e.g. in cp2k) and COSMA will just inherit the devices that were previously set. This could cause the issue you are referring to. |
Sorry for the late reply. Yes, the problem is still there. However, the message seems to show up after the job finished so it doesn't bother too much for now. We used the pxgemm_cosma wrapper you provided and we initialized our own GPU devices in our code (FHI-aims). It feels like the same situation with cp2k. Is there a solution for that? Do you think it worth to try the destructor? Do you have a fortran wrapper for the destructor? |
The CP2K problem was really not due to COSMA. The MPI implementation registers some of the GPU buffers used in MPI communications. The error is that you get a double finalization (in MPI and COSMA), but now the problem is fixed and we leave COSMA to finalize. |
I realized I cannot reopen the issue #87
copy the question here. Sorry for the confusion.
=====
Hi @kabicm ,
I am able to redo the test again and the problem still exists with v2.5.0 and the master branch. Does COSMA use a wrapper over MPI_Finalize()? I notice a similar issue on Summit here with another code that is wrapping over MPI_Finalize LLNL/Caliper#392.
If that's the case, my question becomes: Is it possible to manually finalize COSMA?
A related question here: if I called COSMA, does it take the GPU memory after the gemm calls? Is it possible to control those GPU memories? I guess I am looking for something like initiating and finalizing a COSMA environment over a certain code region and free up the GPU memories when outside the code region.
Best wishes,
Yi
The text was updated successfully, but these errors were encountered: