Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COSMA cublas crash after job finished #100

Closed
yaoyi92 opened this issue Nov 10, 2021 · 5 comments
Closed

COSMA cublas crash after job finished #100

yaoyi92 opened this issue Nov 10, 2021 · 5 comments

Comments

@yaoyi92
Copy link

yaoyi92 commented Nov 10, 2021

I realized I cannot reopen the issue #87

copy the question here. Sorry for the confusion.

=====

Hi @kabicm ,
I am able to redo the test again and the problem still exists with v2.5.0 and the master branch. Does COSMA use a wrapper over MPI_Finalize()? I notice a similar issue on Summit here with another code that is wrapping over MPI_Finalize LLNL/Caliper#392.

If that's the case, my question becomes: Is it possible to manually finalize COSMA?

A related question here: if I called COSMA, does it take the GPU memory after the gemm calls? Is it possible to control those GPU memories? I guess I am looking for something like initiating and finalizing a COSMA environment over a certain code region and free up the GPU memories when outside the code region.

Best wishes,
Yi

@kabicm
Copy link
Collaborator

kabicm commented Jul 22, 2022

Hi @yaoyi92!

I am not sure if we already discussed this by email?

COSMA does not have any wrapper around MPI_Finalize and in general, COSMA does not have a "initialize/finalize" logic.

What happens in COSMA is that the user has the following options:

  1. create a context explicitly and pass this context to multiply function. In this case, it is destroyed (something like finalize) when it goes out of scope.
  2. let the global context be created implicitly during the first multiplication. In this case, it is destroyed when the main goes out of scope.

The context is then reused in all multiplications and it mostly contains the memory pool (both CPU and GPU).

It is possible to control how much CPU and GPU memory you allow COSMA to use:

I assume you used the implicitly created global context (as in 2.). In that case, you should be able to call the destructor explicitly:
get_context_instance<T>()->~cosma_context();

This will release all CPU and GPU memory that was allocated.

Let me know if you have any other questions!

@kabicm
Copy link
Collaborator

kabicm commented Jul 22, 2022

Is the problem you mentioned still present with the latest version: COSMA-v2.6.1?

@kabicm
Copy link
Collaborator

kabicm commented Jul 22, 2022

Btw, the gpu devices should be set outside of COSMA (e.g. in cp2k) and COSMA will just inherit the devices that were previously set. This could cause the issue you are referring to.

@yaoyi92
Copy link
Author

yaoyi92 commented Aug 1, 2022

Sorry for the late reply. Yes, the problem is still there. However, the message seems to show up after the job finished so it doesn't bother too much for now. We used the pxgemm_cosma wrapper you provided and we initialized our own GPU devices in our code (FHI-aims). It feels like the same situation with cp2k. Is there a solution for that?

Do you think it worth to try the destructor? Do you have a fortran wrapper for the destructor?

@alazzaro
Copy link

alazzaro commented Aug 2, 2022

Sorry for the late reply. Yes, the problem is still there. However, the message seems to show up after the job finished so it doesn't bother too much for now. We used the pxgemm_cosma wrapper you provided and we initialized our own GPU devices in our code (FHI-aims). It feels like the same situation with cp2k. Is there a solution for that?

The CP2K problem was really not due to COSMA. The MPI implementation registers some of the GPU buffers used in MPI communications. The error is that you get a double finalization (in MPI and COSMA), but now the problem is fixed and we leave COSMA to finalize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants